MyArxiv
Robotics
LILAC: Language-Conditioned Object-Centric Optical Flow for Open-Loop Trajectory Generation
We address language-conditioned robotic manipulation using flow-based trajectory generation, which enables training on human and web videos of object manipulation and requires only minimal embodiment-specific data. This task is challenging, as object trajectory generation from pre-manipulation images and natural language instructions requires appropriate instruction-flow alignment. To tackle this challenge, we propose the flow-based Language Instruction-guided open-Loop ACtion generator (LILAC). This flow-based Vision-Language-Action model (VLA) generates object-centric 2D optical flow from an RGB image and a natural language instruction, and converts the flow into a 6-DoF manipulator trajectory. LILAC incorporates two key components: Semantic Alignment Loss, which strengthens language conditioning to generate instruction-aligned optical flow, and Prompt-Conditioned Cross-Modal Adapter, which aligns learned visual prompts with image and text features to provide rich cues for flow generation. Experimentally, our method outperformed existing approaches in generated flow quality across multiple benchmarks. Furthermore, in physical object manipulation experiments using free-form instructions, LILAC demonstrated a superior task success rate compared to existing methods. The project page is available at https://lilac-75srg.kinsta.page/.
comment: Accepted to IEEE RA-L
Temporally Decoupled Diffusion Planning for Autonomous Driving
Motion planning in dynamic urban environments requires balancing immediate safety with long-term goals. While diffusion models effectively capture multi-modal decision-making, existing approaches treat trajectories as monolithic entities, overlooking heterogeneous temporal dependencies where near-term plans are constrained by instantaneous dynamics and far-term plans by navigational goals. To address this, we propose Temporally Decoupled Diffusion Model (TDDM), which reformulates trajectory generation via a noise-as-mask paradigm. By partitioning trajectories into segments with independent noise levels, we implicitly treat high noise as information voids and weak noise as contextual cues. This compels the model to reconstruct corrupted near-term states by leveraging internal correlations with better-preserved temporal contexts. Architecturally, we introduce a Temporally Decoupled Adaptive Layer Normalization (TD-AdaLN) to inject segment-specific timesteps. During inference, our Asymmetric Temporal Classifier-Free Guidance utilizes weakly noised far-term priors to guide immediate path generation. Evaluations on the nuPlan benchmark show TDDM approaches or exceeds state-of-the-art baselines, particularly excelling in the challenging Test14-hard subset.
comment: icaps
Visualizing Impedance Control in Augmented Reality for Teleoperation: Design and User Evaluation
Teleoperation for contact-rich manipulation remains challenging, especially when using low-cost, motion-only interfaces that provide no haptic feedback. Virtual reality controllers enable intuitive motion control but do not allow operators to directly perceive or regulate contact forces, limiting task performance. To address this, we propose an augmented reality (AR) visualization of the impedance controller's target pose and its displacement from each robot end effector. This visualization conveys the forces generated by the controller, providing operators with intuitive, real-time feedback without expensive haptic hardware. We evaluate the design in a dual-arm manipulation study with 17 participants who repeatedly reposition a box with and without the AR visualization. Results show that AR visualization reduces completion time by 24% for force-critical lifting tasks, with no significant effect on sliding tasks where precise force control is less critical. These findings indicate that making the impedance target visible through AR is a viable approach to improve human-robot interaction for contact-rich teleoperation.
comment: 6 pages, 5 figures, submitted to IEEE RO-MAN 2026
Modernising Reinforcement Learning-Based Navigation for Embodied Semantic Scene Graph Generation
Semantic world models enable embodied agents to reason about objects, relations, and spatial context beyond purely geometric representations. In Organic Computing, such models are a key enabler for objective-driven self-adaptation under uncertainty and resource constraints. The core challenge is to acquire observations maximising model quality and downstream usefulness within a limited action budget. Semantic scene graphs (SSGs) provide a structured and compact representation for this purpose. However, constructing them within a finite action horizon requires exploration strategies that trade off information gain against navigation cost and decide when additional actions yield diminishing returns. This work presents a modular navigation component for Embodied Semantic Scene Graph Generation and modernises its decision-making by replacing the policy-optimisation method and revisiting the discrete action formulation. We study compact and finer-grained, larger discrete motion sets and compare a single-head policy over atomic actions with a factorised multi-head policy over action components. We evaluate curriculum learning and optional depth-based collision supervision, and assess SSG completeness, execution safety, and navigation behaviour. Results show that replacing the optimisation algorithm alone improves SSG completeness by 21\% relative to the baseline under identical reward shaping. Depth mainly affects execution safety (collision-free motion), while completeness remains largely unchanged. Combining modern optimisation with a finer-grained, factorised action representation yields the strongest overall completeness--efficiency trade-off.
MMaDA-VLA: Large Diffusion Vision-Language-Action Model with Unified Multi-Modal Instruction and Generation
Vision-Language-Action (VLA) models aim to control robots for manipulation from visual observations and natural-language instructions. However, existing hierarchical and autoregressive paradigms often introduce architectural overhead, suffer from temporal inconsistency and long-horizon error accumulation, and lack a mechanism to capture environment dynamics without extra modules. To this end, we present MMaDA-VLA, a fully native pre-trained large diffusion VLA model that unifies multi-modal understanding and generation in a single framework. Our key idea is a native discrete diffusion formulation that embeds language, images, and continuous robot controls into one discrete token space and trains a single backbone with masked token denoising to jointly generate a future goal observation and an action chunk in parallel. Iterative denoising enables global, order-free refinement, improving long-horizon consistency while grounding actions in predicted future visual outcomes without auxiliary world models. Experiments across simulation benchmarks and real-world tasks show state-of-the-art performance, achieving 98.0% average success on LIBERO and 4.78 average length on CALVIN.
System Design for Maintaining Internal State Consistency in Long-Horizon Robotic Tabletop Games
Long-horizon tabletop games pose a distinct systems challenge for robotics: small perceptual or execution errors can invalidate accumulated task state, propagate across decision-making modules, and ultimately derail interaction. This paper studies how to maintain internal state consistency in turn-based, multi-human robotic tabletop games through deliberate system design rather than isolated component improvement. Using Mahjong as a representative long-horizon setting, we present an integrated architecture that explicitly maintains perceptual, execution, and interaction state, partitions high-level semantic reasoning from time-critical perception and control, and incorporates verified action primitives with tactile-triggered recovery to prevent premature state corruption. We further introduce interaction-level monitoring mechanisms to detect turn violations and hidden-information breaches that threaten execution assumptions. Beyond demonstrating complete-game operation, we provide an empirical characterization of failure modes, recovery effectiveness, cross-module error propagation, and hardware-algorithm trade-offs observed during deployment. Our results show that explicit partitioning, monitored state transitions, and recovery mechanisms are critical for sustaining executable consistency over extended play, whereas monolithic or unverified pipelines lead to measurable degradation in end-to-end reliability. The proposed system serves as an empirical platform for studying system-level design principles in long-horizon, turn-based interaction.
LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior
We introduce \textbf{LaMP}, a dual-expert Vision-Language-Action framework that embeds dense 3D scene flow as a latent motion prior for robotic manipulation. Existing VLA models regress actions directly from 2D semantic visual features, forcing them to learn complex 3D physical interactions implicitly. This implicit learning strategy degrades under unfamiliar spatial dynamics. LaMP addresses this limitation by aligning a flow-matching \emph{Motion Expert} with a policy-predicting \emph{Action Expert} through gated cross-attention. Specifically, the Motion Expert generates a one-step partially denoised 3D scene flow, and its hidden states condition the Action Expert without full multi-step reconstruction. We evaluate LaMP on the LIBERO, LIBERO-Plus, and SimplerEnv-WidowX simulation benchmarks as well as real-world experiments. LaMP consistently outperforms evaluated VLA baselines across LIBERO, LIBERO-Plus, and SimplerEnv-WidowX benchmarks, achieving the highest reported average success rates under the same training budgets. On LIBERO-Plus OOD perturbations, LaMP shows improved robustness with an average 9.7% gain over the strongest prior baseline. Our project page is available at https://summerwxk.github.io/lamp-project-page/.
UMBRELLA: Uncertainty-aware Multi-robot Reactive Coordination under Dynamic Temporal Logic Tasks
Multi-robot systems can be extremely efficient for accomplishing team-wise tasks by acting concurrently and collaboratively. However, most existing methods either assume static task features or simply replan when environmental changes occur. This paper addresses the challenging problem of coordinating multi-robot systems for collaborative tasks involving dynamic and moving targets. We explicitly model the uncertainty in target motion prediction via Conformal Prediction(CP), while respecting the spatial-temporal constraints specified by Linear Temporal Logic (LTL). The proposed framework (UMBRELLA) combines the Monte Carlo Tree Search (MCTS) over partial plans with uncertainty-aware rollouts, and introduces a CP-based metric to guide and accelerate the search. The objective is to minimize the Conditional Value at Risk (CVaR) of the average makespan. For tasks released online, a receding-horizon planning scheme dynamically adjusts the assignments based on updated task specifications and motion predictions. Spatial and temporal constraints among the tasks are always ensured, and only partial synchronization is required for the collaborative tasks during online execution. Extensive large-scale simulations and hardware experiments demonstrate substantial reductions in both the average makespan and its variance by 23% and 71%, compared with static baselines.
IntentReact: Guiding Reactive Object-Centric Navigation via Topological Intent
Object-goal visual navigation requires robots to reason over semantic structure and act effectively under partial observability. Recent approaches based on object-level topological maps enable long-horizon navigation without dense geometric reconstruction, but their execution remains limited by the gap between global topological guidance and local perception-driven control. In particular, local decisions are made solely from the current egocentric observation, without access to information beyond the robot's field of view. As a result, the robot may persist along its current heading even when initially oriented away from the goal, moving toward directions that do not decrease the global topological distance. In this work, we propose IntentReact, an intent-conditioned object-centric navigation framework that introduces a compact interface between global topological planning and reactive object-centric control. Our approach encodes global topological guidance as a low-dimensional directional signal, termed intent, which conditions a learned waypoint prediction policy to bias navigation toward topologically consistent progression. This design enables the robot to promptly reorient when local observations are misleading, guiding motion toward directions that decrease global topological distance while preserving the reactivity and robustness of object-centric control. We evaluate the proposed framework through extensive experiments, demonstrating improved navigation success and execution quality compared to prior object-centric navigation methods.
Integrating Deep RL and Bayesian Inference for ObjectNav in Mobile Robotics SC 2026
Autonomous object search is challenging for mobile robots operating in indoor environments due to partial observability, perceptual uncertainty, and the need to trade off exploration and navigation efficiency. Classical probabilistic approaches explicitly represent uncertainty but typically rely on handcrafted action-selection heuristics, while deep reinforcement learning enables adaptive policies but often suffers from slow convergence and limited interpretability. This paper proposes a hybrid object-search framework that integrates Bayesian inference with deep reinforcement learning. The method maintains a spatial belief map over target locations, updated online through Bayesian inference from calibrated object detections, and trains a reinforcement learning policy to select navigation actions directly from this probabilistic representation. The approach is evaluated in realistic indoor simulation using Habitat 3.0 and compared against developed baseline strategies. Across two indoor environments, the proposed method improves success rate while reducing search effort. Overall, the results support the value of combining Bayesian belief estimation with learned action selection to achieve more efficient and reliable objectsearch behavior under partial observability.
comment: Accepted and to be published in the ICARSC 2026 26th IEEE International Conference on Autonomous Robot Systems and Competitions
Bayesian Learning-Enhanced Navigation with Deep Smoothing for Inertial-Aided Navigation
Accurate post-processing navigation is essential for applications such as survey and mapping, where the full measurement history can be exploited to refine past state estimates. Fixed-interval smoothing algorithms represent the theoretically optimal solution under Gaussian assumptions. However, loosely coupled INS/GNSS systems fundamentally inherit the systematic position bias of raw GNSS measurements, leaving a persistent accuracy gap that model-based smoothers cannot resolve. To address this limitation, we propose BLENDS, which integrates Bayesian learning with deep smoothing to enhance navigation performance. BLENDS is a a data-driven post-processing framework that augments the classical two-filter smoother with a transformer-based neural network. It learns to modify the filter covariance matrices and apply an additive correction to the smoothed error-state directly within the Bayesian framework. A novel Bayesian-consistent loss jointly supervises the smoothed mean and covariance, enforcing minimum-variance estimates while maintaining statistical consistency. BLENDS is evaluated on two real-world datasets spanning a mobile robot and a quadrotor. Across all unseen test trajectories, BLENDS achieves horizontal position improvements of up to 63% over the baseline forward EKF.
SafeGuard ASF: SR Agentic Humanoid Robot System for Autonomous Industrial Safety
The rise of unmanned ``dark factories'' operating without human presence demands autonomous safety systems capable of detecting and responding to multiple hazard types. We present SafeGuard ASF (Agentic Security Fleet), a comprehensive framework deploying humanoid robots for autonomous hazard detection in industrial environments. Our system integrates multi-modal perception (RGB-D imaging), a ReAct-based agentic reasoning framework, and learned locomotion policies on the Unitree G1 humanoid platform. We address three critical hazard scenarios: fire and smoke detection, abnormal temperature monitoring in pipelines, and intruder detection in restricted zones. Our perception pipeline achieves 94.2% mAP for fire or smoke detection with 127ms latency. We train multiple locomotion policies, including dance motion tracking and velocity control, using Unitree RL Lab with PPO, demonstrating stable convergence within 80,000 training iterations. We validate our system in both simulation and real-world environments, demonstrating autonomous patrol, human detection with visual perception, and obstacle avoidance capabilities. The proposed ToolOrchestra action framework enables structured decision-making through perception, reasoning, and actuation tools.
Connectivity-Aware Representations for Constrained Motion Planning via Multi-Scale Contrastive Learning ICRA 2026
The objective of constrained motion planning is to connect start and goal configurations while satisfying task-specific constraints. Motion planning becomes inefficient or infeasible when the configurations lie in disconnected regions, known as essentially mutually disconnected (EMD) components. Constraints further restrict feasible space to a lower-dimensional submanifold, while redundancy introduces additional complexity because a single end-effector pose admits infinitely many inverse kinematic solutions that may form discrete self-motion manifolds. This paper addresses these challenges by learning a connectivity-aware representation for selecting start and goal configurations prior to planning. Joint configurations are embedded into a latent space through multi-scale manifold learning across neighborhood ranges from local to global, and clustering generates pseudo-labels that supervise a contrastive learning framework. The proposed framework provides a connectivity-aware measure that biases the selection of start and goal configurations in connected regions, avoiding EMDs and yielding higher success rates with reduced planning time. Experiments on various manipulation tasks showed that our method achieves 1.9 times higher success rates and reduces the planning time by a factor of 0.43 compared to baselines.
comment: 8 pages, 5 figures, ICRA 2026
A Minimum-Energy Control Approach for Redundant Mobile Manipulators in Physical Human-Robot Interaction Applications
Research on mobile manipulation systems that physically interact with humans has expanded rapidly in recent years, opening the way to tasks which could not be performed using fixed-base manipulators. Within this context, developing suitable control methodologies is essential since mobile manipulators introduce additional degrees of freedom, making the design of control approaches more challenging and more prone to performance optimization. This paper proposes a control approach for a mobile manipulator, composed of a mobile base equipped with a robotic arm mounted on the top, with the objective of minimizing the overall kinetic energy stored in the whole-body mobile manipulator in physical human-robot interaction applications. The approach is experimentally tested with reference to a peg-in-hole task, and the results demonstrate that the proposed approach reduces the overall kinetic energy stored in the whole-body robotic system and improves the system performance compared with the benchmark method.
The Competence Shadow: Theory and Bounds of AI Assistance in Safety Engineering
As AI assistants become integrated into safety engineering workflows for Physical AI systems, a critical question emerges: does AI assistance improve safety analysis quality, or introduce systematic blind spots that surface only through post-deployment incidents? This paper develops a formal framework for AI assistance in safety analysis. We first establish why safety engineering resists benchmark-driven evaluation: safety competence is irreducibly multidimensional, constrained by context-dependent correctness, inherent incompleteness, and legitimate expert disagreement. We formalize this through a five-dimensional competence framework capturing domain knowledge, standards expertise, operational experience, contextual understanding, and judgment. We introduce the competence shadow: the systematic narrowing of human reasoning induced by AI-generated safety analysis. The shadow is not what the AI presents, but what it prevents from being considered. We formalize four canonical human-AI collaboration structures and derive closed-form performance bounds, demonstrating that the competence shadow compounds multiplicatively to produce degradation far exceeding naive additive estimates. The central finding is that AI assistance in safety engineering is a collaboration design problem, not a software procurement decision. The same tool degrades or improves analysis quality depending entirely on how it is used. We derive non-degradation conditions for shadow-resistant workflows and call for a shift from tool qualification toward workflow qualification for trustworthy Physical AI.
comment: 8 Pages, 3 Figures, 2 table
Dissimilarity-Based Persistent Coverage Control of Multi-Robot Systems for Improving Solar Irradiance Prediction Accuracy in Solar Thermal Power Plants
Accurate forecasting of future solar irradiance is essential for the effective control of solar thermal power plants. Although various kriging-based methods have been proposed to address the prediction problem, these methods typically do not provide an appropriate sampling strategy to dynamically position mobile sensors for optimizing prediction accuracy in real time, which is critical for achieving accurate forecasts with a minimal number of sensors. This paper introduces a dissimilarity map derived from a kriging model and proposes a persistent coverage control algorithm that effectively guides agents toward regions where additional observations are required to improve prediction performance. By means of experiments using mobile robots, the proposed approach was shown to obtain more accurate predictions than the considered baselines under various emulated irradiance fields.
comment: 8 pages, 6 figures, 5 tables
CTS-PLL: A Robust and Anytime Framework for Collaborative Task Sequencing and Multi-Agent Path Finding
The Collaborative Task Sequencing and Multi-Agent Path Finding (CTS-MAPF) problem requires agents to accomplish sequences of tasks while avoiding collisions, posing significant challenges due to its combinatorial complexity. This work introduces CTS-PLL, a hierarchical framework that extends the configuration-based CTS-MAPF planning paradigm with two key enhancements: a lock agents detection and release mechanism leveraging a complete planning method for local re-planning, and an anytime refinement procedure based on Large Neighborhood Search (LNS). These additions ensure robustness in dense environments and enable continuous improvement of solution quality. Extensive evaluations across sparse and dense benchmarks demonstrate that CTS-PLL achieves higher success rates and solution quality compared with existing methods, while maintaining competitive runtime efficiency. Real-world robot experiments further demonstrate the feasibility of the approach in practice.
comment: 8 pages, 5 figures, under review
ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making
In recent human-robot collaboration environments, there is a growing focus on integrating diverse sensor data beyond visual information to enable safer and more intelligent task execution. Although thermal data can be crucial for enhancing robot safety and operational efficiency, its integration has been relatively overlooked in prior research. This paper proposes a novel Vision-Language-Action (VLA) framework that incorporates thermal information for robot task execution. The proposed system leverages a Vision-Language Model (VLM) as a high-level planner to interpret complex natural language commands and decompose them into simpler sub-tasks. This approach facilitates efficient data collection and robust reasoning for complex operations. Unlike conventional methods that rely solely on visual data, our approach integrates thermal information, enabling the robot to perceive physical properties and proactively ensure environmental safety. Experimental results from real-world task scenarios validate the feasibility of our proposed framework, suggesting its potential to enhance task success rates and safety compared to existing vision-based systems.
$π$, But Make It Fly: Physics-Guided Transfer of VLA Models to Aerial Manipulation
Vision-Language-Action (VLA) models such as $π_0$ have demonstrated remarkable generalization across diverse fixed-base manipulators. However, transferring these foundation models to aerial platforms remains an open challenge due to the fundamental mismatch between the quasi-static dynamics of fixed-base arms and the underactuated, highly dynamic nature of flight. In this work, we introduce AirVLA, a system that investigates the transferability of manipulation-pretrained VLAs to aerial pick-and-place tasks. We find that while visual representations transfer effectively, the specific control dynamics required for flight do not. To bridge this "dynamics gap" without retraining the foundation model, we introduce a Payload-Aware Guidance mechanism that injects payload constraints directly into the policy's flow-matching sampling process. To overcome data scarcity, we further utilize a Gaussian Splatting pipeline to synthesize navigation training data. We evaluate our method through a cumulative 460 real-world experiments which demonstrate that this synthetic data is a key enabler of performance, unlocking 100% success in navigation tasks where directly fine-tuning on teleoperation data alone attains 81% success. Our inference-time intervention, Payload-Aware Guidance, increases real-world pick-and-place task success from 23% to 50%. Finally, we evaluate the model on a long-horizon compositional task, achieving a 62% overall success rate. These results suggest that pre-trained manipulation VLAs, with appropriate data augmentation and physics-informed guidance, can transfer to aerial manipulation and navigation, as well as the composition of these tasks.
Learning Rollout from Sampling:An R1-Style Tokenized Traffic Simulation Model
Learning diverse and high-fidelity traffic simulations from human driving demonstrations is crucial for autonomous driving evaluation. The recent next-token prediction (NTP) paradigm, widely adopted in large language models (LLMs), has been applied to traffic simulation and achieves iterative improvements via supervised fine-tuning (SFT). However, such methods limit active exploration of potentially valuable motion tokens, particularly in suboptimal regions. Entropy patterns provide a promising perspective for enabling exploration driven by motion token uncertainty. Motivated by this insight, we propose a novel tokenized traffic simulation policy, R1Sim, which represents an initial attempt to explore reinforcement learning based on motion token entropy patterns, and systematically analyzes the impact of different motion tokens on simulation outcomes. Specifically, we introduce an entropy-guided adaptive sampling mechanism that focuses on previously overlooked motion tokens with high uncertainty yet high potential. We further optimize motion behaviors using Group Relative Policy Optimization (GRPO), guided by a safety-aware reward design. Overall, these components enable a balanced exploration-exploitation trade-off through diverse high-uncertainty sampling and group-wise comparative estimation, resulting in realistic, safe, and diverse multi-agent behaviors. Extensive experiments on the Waymo Sim Agent benchmark demonstrate that R1Sim achieves competitive performance compared to state-of-the-art methods.
Wireless bioelectronics for untethered biohybrid robots
Biohybrid robots integrate living tissues with engineered artificial structures to achieve organism-inspired actuation and behavior. A persistent challenge is delivering stimulation and control signals without relying on tethered wiring or bulky hardware immersed in cell-culture media. Wireless bioelectronics addresses this limitation by enabling the remote transfer of control signals, typically via radio-frequency magnetic fields, to locally stimulate muscle tissues at tissue-electrode interfaces. In parallel, wireless optoelectronics enables remote control of optogenetically modified, muscle-based robots by embedding light emitters that initiate muscle actuation through light-gated ion channels. Further advances incorporate neuromuscular junctions, leveraging biological signal transduction to enable selective control of multiple actuators through wireless frequency- and time-division multiplexing. This perspective article summarizes recent advances in control strategies for biohybrid robots, namely, wireless electrical stimulation, wireless optical stimulation, and neuromuscular integration. Then this describes cross-cutting design principles and highlights a future direction, namely, co-integration of neural organoid-bioelectronics toward autonomous, closed-loop biohybrid robots.
SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models
Vision-language-action (VLA) models enable robots to follow natural-language instructions grounded in visual observations, but the instruction channel also introduces a critical vulnerability: small textual perturbations can alter downstream robot behavior. Systematic robustness evaluation therefore requires a black-box attacker that can generate minimal yet effective instruction edits across diverse VLA models. To this end, we present SABER, an agent-centric approach for automatically generating instruction-based adversarial attacks on VLA models under bounded edit budgets. SABER uses a GRPO-trained ReAct attacker to generate small, plausible adversarial instruction edits using character-, token-, and prompt-level tools under a bounded edit budget that induces targeted behavioral degradation, including task failure, unnecessarily long execution, and increased constraint violations. On the LIBERO benchmark across six state-of-the-art VLA models, SABER reduces task success by 20.6%, increases action-sequence length by 55%, and raises constraint violations by 33%, while requiring 21.1% fewer tool calls and 54.7% fewer character edits than strong GPT-based baselines. These results show that small, plausible instruction edits are sufficient to substantially degrade robot execution, and that an agentic black-box pipeline offers a practical, scalable, and adaptive approach for red-teaming robotic foundation models.
COIN: Collaborative Interaction-Aware Multi-Agent Reinforcement Learning for Self-Driving Systems
Multi-Agent Self-Driving (MASD) systems provide an effective solution for coordinating autonomous vehicles to reduce congestion and enhance both safety and operational efficiency in future intelligent transportation systems. Multi-Agent Reinforcement Learning (MARL) has emerged as a promising approach for developing advanced end-to-end MASD systems. However, achieving efficient and safe collaboration in dynamic MASD systems remains a significant challenge in dense scenarios with complex agent interactions. To address this challenge, we propose a novel collaborative(CO-) interaction-aware(-IN) MARL framework, named COIN. Specifically, we develop a new counterfactual individual-global twin delayed deep deterministic policy gradient (CIG-TD3) algorithm, crafted in a "centralized training, decentralized execution" (CTDE) manner, which aims to jointly optimize the individual objectives (navigation) and the global objectives (collaboration) of agents. We further introduce a dual-level interaction-aware centralized critic architecture that captures both local pairwise interactions and global system-level dependencies, enabling more accurate global value estimation and improved credit assignment for collaborative policy learning. We conduct extensive simulation experiments in dense urban traffic environments, which demonstrate that COIN consistently outperforms other advanced baseline methods in both safety and efficiency across various system sizes. These results highlight its superiority in complex and dynamic MASD scenarios, as further validated through real-world robot demonstrations. Supplementary videos are available at https://marmotlab.github.io/COIN/
CROSS: A Mixture-of-Experts Reinforcement Learning Framework for Generalizable Large-Scale Traffic Signal Control
Recent advances in robotics, automation, and artificial intelligence have enabled urban traffic systems to operate with increasing autonomy towards future smart cities, powered in part by the development of adaptive traffic signal control (ATSC), which dynamically optimizes signal phases to mitigate congestion and optimize traffic. However, achieving effective and generalizable large-scale ATSC remains a significant challenge due to the diverse intersection topologies and highly dynamic, complex traffic demand patterns across the network. Existing RL-based methods typically use a single shared policy for all scenarios, whose limited representational capacity makes it difficult to capture diverse traffic dynamics and generalize to unseen environments. To address these challenges, we propose CROSS, a novel Mixture-of-Experts (MoE)-based decentralized RL framework for generalizable ATSC. We first introduce a Predictive Contrastive Clustering (PCC) module that forecasts short-term state transitions to identify latent traffic patterns, followed by clustering and contrastive learning to enhance pattern-level representation. We further design a Scenario-Adaptive MoE module that augments a shared policy with multiple experts, thus enabling adaptive specialization and more flexible scenario-specific strategies. We conduct extensive experiments in the SUMO simulator on both synthetic and real-world traffic datasets. Compared with state-of-the-art baselines, CROSS achieves superior performance and generalization through improved representation of diverse traffic scenarios.
Integrated Multi-Drone Task Allocation, Sequencing, and Optimal Trajectory Generation in Obstacle-Rich 3D Environments
Coordinating teams of aerial robots in cluttered three-dimensional (3D) environments requires a principled integration of discrete mission planning-deciding which robot serves which goals and in what order -- with continuous-time trajectory synthesis that enforces collision avoidance and dynamic feasibility. This paper introduces IMD-TAPP (Integrated Multi-Drone Task Allocation and Path Planning), an end-to-end framework that jointly addresses multi-goal allocation, tour sequencing, and safe trajectory generation for quadrotor teams operating in obstacle-rich spaces. IMD--TAPP first discretizes the workspace into a 3D navigation graph and computes obstacle-aware robot-to-goal and goal-to-goal travel costs via graph-search-based pathfinding. These costs are then embedded within an Injected Particle Swarm Optimization (IPSO) scheme, guided by multiple linear assignment, to efficiently explore coupled assignment/ordering alternatives and to minimize mission makespan. Finally, the resulting waypoint tours are transformed into time-parameterized minimum-snap trajectories through a generation-and-optimization routine equipped with iterative validation of obstacle clearance and inter-robot separation, triggering re-planning when safety margins are violated. Extensive MATLAB simulations across cluttered 3D scenarios demonstrate that IMD--TAPP consistently produces dynamically feasible, collision-free trajectories while achieving competitive completion times. In a representative case study with two drones serving multiple goals, the proposed approach attains a minimum mission time of 136~s while maintaining the required safety constraints throughout execution.
comment: Resubmission following accepted appeal (MOD-78958). Resubmitting to cs.RO with cross-lists cs.MA and cs.AI as advised by arXiv Support
Vega: Learning to Drive with Natural Language Instructions
Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.
comment: Code is available at https://github.com/zuosc19/Vega
Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving CVPR 2026
Human driving behavior is inherently personal, which is shaped by long-term habits and influenced by short-term intentions. Individuals differ in how they accelerate, brake, merge, yield, and overtake across diverse situations. However, existing end-to-end autonomous driving systems either optimize for generic objectives or rely on fixed driving modes, lacking the ability to adapt to individual preferences or interpret natural language intent. To address this gap, we propose Drive My Way (DMW), a personalized Vision-Language-Action (VLA) driving framework that aligns with users' long-term driving habits and adapts to real-time user instructions. DMW learns a user embedding from our personalized driving dataset collected across multiple real drivers and conditions the policy on this embedding during planning, while natural language instructions provide additional short-term guidance. Closed-loop evaluation on the Bench2Drive benchmark demonstrates that DMW improves style instruction adaptation, and user studies show that its generated behaviors are recognizable as each driver's own style, highlighting personalization as a key capability for human-centered autonomous driving. Our data and code are available at https://dmw-cvpr.github.io/.
comment: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026); Project website: https://dmw-cvpr.github.io/
SoftMimicGen: A Data Generation System for Scalable Robot Learning in Deformable Object Manipulation
Large-scale robot datasets have facilitated the learning of a wide range of robot manipulation skills, but these datasets remain difficult to collect and scale further, owing to the intractable amount of human time, effort, and cost required. Simulation and synthetic data generation have proven to be an effective alternative to fuel this need for data, especially with the advent of recent work showing that such synthetic datasets can dramatically reduce real-world data requirements and facilitate generalization to novel scenarios unseen in real-world demonstrations. However, this paradigm has been limited to rigid-body tasks, which are easy to simulate. Deformable object manipulation encompasses a large portion of real-world manipulation and remains a crucial gap to address towards increasing adoption of the synthetic simulation data paradigm. In this paper, we introduce SoftMimicGen, an automated data generation pipeline for deformable object manipulation tasks. We introduce a suite of high-fidelity simulation environments that encompasses a wide range of deformable objects (stuffed animal, rope, tissue, towel) and manipulation behaviors (high-precision threading, dynamic whipping, folding, pick-and-place), across four robot embodiments: a single-arm manipulator, bimanual arms, a humanoid, and a surgical robot. We apply SoftMimicGen to generate datasets across the task suite, train high-performing policies from the data, and systematically analyze the data generation system. Project website: \href{https://softmimicgen.github.io}{softmimicgen.github.io}.
Intelligent Navigation and Obstacle-Aware Fabrication for Mobile Additive Manufacturing Systems
As the demand for mass customization increases, manufacturing systems must become more flexible and adaptable to produce personalized products efficiently. Additive manufacturing (AM) enhances production adaptability by enabling on-demand fabrication of customized components directly from digital models, but its flexibility remains constrained by fixed equipment layouts. Integrating mobile robots addresses this limitation by allowing manufacturing resources to move and adapt to changing production requirements. Mobile AM Robots (MAMbots) combine AM with mobile robotics to produce and transport components within dynamic manufacturing environments. However, the dynamic manufacturing environments introduce challenges for MAMbots. Disturbances such as obstacles and uneven terrain can disrupt navigation stability, which in turn affects printing accuracy and surface quality. This work proposes a universal mobile printing-and-delivery platform that couples navigation and material deposition, addressing the limitations of earlier frameworks that treated these processes separately. A real-time control framework is developed to plan and control the robot's navigation, ensuring safe motion, obstacle avoidance, and path stability while maintaining print quality. The closed-loop integration of sensing, mobility, and manufacturing provides real-time feedback for motion and process control, enabling MAMbots to make autonomous decisions in dynamic environments. The framework is validated through simulations and real-world experiments that test its adaptability to trajectory variations and external disturbances. Coupled navigation and printing together enable MAMbots to plan safe, adaptive trajectories, improving flexibility and adaptability in manufacturing.
comment: 8 pages, 4 figures, conference
Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning
Action-conditioned robot world models generate future video frames of the manipulated scene given a robot action sequence, offering a promising alternative for simulating tasks that are difficult to model with traditional physics engines. However, these models are optimized for short-term prediction and break down when deployed autoregressively: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade. We address this through the following contributions. First, we introduce a reinforcement learning (RL) post-training scheme that trains the world model on its own autoregressive rollouts rather than on ground-truth histories. We achieve this by adapting a recent contrastive RL objective for diffusion models to our setting and show that its convergence guarantees carry over exactly. Second, we design a training protocol that generates and compares multiple candidate variable-length futures from the same rollout state, reinforcing higher-fidelity predictions over lower-fidelity ones. Third, we develop efficient, multi-view visual fidelity rewards that combine complementary perceptual metrics across camera views and are aggregated at the clip level for dense, low-variance training signal. Fourth, we show that our approach establishes a new state-of-the-art for rollout fidelity on the DROID dataset, outperforming the strongest baseline on all metrics (e.g., LPIPS reduced by 14% on external cameras, SSIM improved by 9.1% on the wrist camera), winning 98% of paired comparisons, and achieving an 80% preference rate in a blind human study.
comment: 34 pages, 11 figures, 12 tables
Can Users Specify Driving Speed? Bench2Drive-Speed: Benchmark and Baselines for Desired-Speed Conditioned Autonomous Driving
End-to-end autonomous driving (E2E-AD) has achieved remarkable progress. However, one practical and useful function has been long overlooked: users may wish to customize the desired speed of the policy or specify whether to allow the autonomous vehicle to overtake. To bridge this gap, we present Bench2Drive-Speed, a benchmark with metrics, dataset, and baselines for desired-speed conditioned autonomous driving. We introduce explicit inputs of users' desired target-speed and overtake/follow instructions to driving policy models. We design quantitative metrics, including Speed-Adherence Score and Overtake Score, to measure how faithfully policies follow user specifications, while remaining compatible with standard autonomous driving metrics. To enable training of speed-conditioned policies, one approach is to collect expert demonstrations that strictly follow speed requirements, an expensive and unscalable process in the real world. An alternative is to adapt existing regular driving data by treating the speed observed in future frames as the target speed for training. To investigate this, we construct CustomizedSpeedDataset, composed of 2,100 clips annotated with experts demonstrations, enabling systematic investigation of supervision strategies. Our experiments show that, under proper re-annotation, models trained on regular driving data perform comparably to on expert demonstrations, suggesting that speed supervision can be introduced without additional complex real-world data collection. Furthermore, we find that while target-speed following can be achieved without degrading regular driving performance, executing overtaking commands remains challenging due to the inherent difficulty of interactive behaviors. All code, datasets and baselines are available at https://github.com/Thinklab-SJTU/Bench2Drive-Speed
comment: Project page: https://thinklab-sjtu.github.io/Bench2Drive-Speed/
Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance
This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary tasks. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary task training within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver this goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies. The difference between the resulting model parameters can then be interpreted as capability vectors provided by auxiliary tasks. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Experimental results demonstrate that this approach is highly effective across diverse robot tasks. Project page: https://chris1220313648.github.io/Fast-dVLA/
A Mentalistic Interface for Probing Folk-Psychological Attribution to Non-Humanoid Robots
This paper presents an experimental platform for studying intentional-state attribution toward a non-humanoid robot. The system combines a simulated robot, realistic task environments, and large language model-based explanatory layers that can express the same behavior in mentalistic, teleological, or mechanistic terms. By holding behavior constant while varying the explanatory frame, the platform provides a controlled way to investigate how language and framing shape the adoption of the intentional stance in robotics.
comment: Preprint submitted to IEEE. 8 pages, 21 figures
Accurate Surface and Reflectance Modelling from 3D Radar Data with Neural Radiance Fields
Robust scene representation is essential for autonomous systems to safely operate in challenging low-visibility environments. Radar has a clear advantage over cameras and lidars in these conditions due to its resilience to environmental factors such as fog, smoke, or dust. However, radar data is inherently sparse and noisy, making reliable 3D surface reconstruction challenging. To address these challenges, we propose a neural implicit approach for 3D mapping from radar point clouds, which jointly models scene geometry and view-dependent radar intensities. Our method leverages a memory-efficient hybrid feature encoding to learn a continuous Signed Distance Field (SDF) for surface reconstruction, while also capturing radar-specific reflective properties. We show that our approach produces smoother, more accurate 3D surface reconstructions compared to existing lidar-based reconstruction methods applied to radar data, and can reconstruct view-dependent radar intensities. We also show that in general, as input point clouds get sparser, neural implicit representations render more faithful surfaces, compared to traditional explicit SDFs and meshing techniques.
Towards Generalizable Robotic Data Flywheel: High-Dimensional Factorization and Composition
The lack of sufficiently diverse data, coupled with limited data efficiency, remains a major bottleneck for generalist robotic models, yet systematic strategies for collecting and curating such data are not fully explored. Task diversity arises from implicit factors that are sparsely distributed across multiple dimensions and are difficult to define explicitly. To address this challenge, we propose F-ACIL, a heuristic factor-aware compositional iterative learning framework that enables structured data factorization and promotes compositional generalization. F-ACIL decomposes the data distribution into structured factor spaces such as object, action, and environment. Based on the factorized formulation, we develop a factor-wise data collection and an iterative training paradigm that promotes compositional generalization over the high-dimensional factor space, leading to more effective utilization of real-world robotic demonstrations. With extensive real-world experiments, we show that F-ACIL can achieve more than 45% performance gains with 5-10$\times$ fewer demonstrations comparing to that of which without the strategy. The results suggest that structured factorization offers a practical pathway toward efficient compositional generalization in real-world robotic learning. We believe F-ACIL can inspire more systematic research on building generalizable robotic data flywheel strategies. More demonstrations can be found at: https://f-acil.github.io/
Towards Embodied AI with MuscleMimic: Unlocking full-body musculoskeletal motor learning at scale
Learning motor control for muscle-driven musculoskeletal models is hindered by the computational cost of biomechanically accurate simulation and the scarcity of validated, open full-body models. Here we present MuscleMimic, an open-source framework for scalable motion imitation learning with physiologically realistic, muscle-actuated humanoids. MuscleMimic provides two validated musculoskeletal embodiments - a fixed-root upper-body model (126 muscles) for bimanual manipulation and a full-body model (416 muscles) for locomotion - together with a retargeting pipeline that maps SMPL-format motion capture data onto musculoskeletal structures while preserving kinematic and dynamic consistency. Leveraging massively parallel GPU simulation, the framework achieves order-of-magnitude training speedups over prior CPU-based approaches while maintaining comprehensive collision handling, enabling a single generalist policy to be trained on hundreds of diverse motions within days. The resulting policy faithfully reproduces a broad repertoire of human movements under full muscular control and can be fine-tuned to novel motions within hours. Biomechanical validation against experimental walking and running data demonstrates strong agreement in joint kinematics (mean correlation r = 0.90), while muscle activation analysis reveals both the promise and fundamental challenges of achieving physiological fidelity through kinematic imitation alone. By lowering the computational and data barriers to musculoskeletal simulation, MuscleMimic enables systematic model validation across diverse dynamic movements and broader participation in neuromuscular control research. Code, models, checkpoints, and retargeted datasets are available at: https://github.com/amathislab/musclemimic
Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning
Effective contact-rich manipulation requires robots to synergistically leverage vision, force, and proprioception. However, Reinforcement Learning agents struggle to learn in such multisensory settings, especially amidst sensory noise and dynamic changes. We propose MultiSensory Dynamic Pretraining (MSDP), a novel framework for learning expressive multisensory representations tailored for task-oriented policy learning. MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings, leading to cross-modal prediction and sensor fusion. For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings, while the actor receives a stable pooled representation to guide its actions. Our method demonstrates accelerated learning and robust performance under diverse perturbations, including sensor noise, and changes in object dynamics. Evaluations in multiple challenging, contact-rich robot manipulation tasks in simulation and the real world showcase the effectiveness of MSDP. Our approach exhibits strong robustness to perturbations and achieves high success rates on the real robot with as few as 6,000 online interactions, offering a simple yet powerful solution for complex multisensory robotic control. Website: https://msdp-pearl.github.io/
comment: 8 pages, 11 figures, Accepted at RA-L
Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation
Language-conditioned robot manipulation is an emerging field aimed at enabling seamless communication and cooperation between humans and robotic agents by teaching robots to comprehend and execute instructions conveyed in natural language. This interdisciplinary area integrates scene understanding, language processing, and policy learning to bridge the gap between human instructions and robot actions. In this comprehensive survey, we systematically explore recent advancements in language-conditioned robot manipulation. We categorize existing methods based on the primary ways language is integrated into the robot system, namely language for state evaluation, language as a policy condition, language for cognitive planning and reasoning, and language in unified vision-language-action models. Specifically, we further analyze state-of-the-art techniques from five axes of action granularity, data and supervision regimes, system cost and latency, environments and evaluations, and cross-modal task specification. Additionally, we highlight the key debates in the field. Finally, we discuss open challenges and future research directions, focusing on potentially enhancing generalization capabilities and addressing safety issues in language-conditioned robot manipulators.
End-to-End Low-Level Neural Control of an Industrial-Grade 6D Magnetic Levitation System
Magnetic levitation is poised to revolutionize industrial automation by integrating flexible in-machine product transport and seamless manipulation. It is expected to become the standard drive technology for automated manufacturing. However, controlling such systems is inherently challenging due to their complex, unstable dynamics. Traditional control approaches, which rely on hand-crafted control engineering, typically yield robust but conservative solutions, with their performance closely tied to the expertise of the engineering team. In contrast, learning-based neural control presents a promising alternative. This paper presents the first neural controller for 6D magnetic levitation. Trained end-to-end on interaction data from a proprietary controller, it directly maps raw sensor data and 6D reference poses to coil current commands. The neural controller can effectively generalize to previously unseen situations while maintaining accurate and robust control. These results underscore the practical feasibility of learning-based neural control in complex physical systems and suggest a future where such a paradigm could enhance or even substitute traditional engineering approaches in demanding real-world applications. The trained neural controller, source code, and demonstration videos are publicly available at https://sites.google.com/view/neural-maglev.
comment: 8 pages, 7 figures, 2 tables
Research on environment perception and behavior prediction of intelligent UAV based on semantic communication
The convergence of drone delivery systems, virtual worlds, and blockchain has transformed logistics and supply chain management, providing a fast, and environmentally friendly alternative to traditional ground transportation methods;Provide users with a real-world experience, virtual service providers need to collect up-to-the-minute delivery information from edge devices. To address this challenge, 1) a reinforcement learning approach is introduced to enable drones with fast training capabilities and the ability to autonomously adapt to new virtual scenarios for effective resource allocation.2) A semantic communication framework for meta-universes is proposed, which utilizes the extraction of semantic information to reduce the communication cost and incentivize the transmission of information for meta-universe services.3) In order to ensure that user information security, a lightweight authentication and key agreement scheme is designed between the drone and the user by introducing blockchain technology. In our experiments, the drone adaptation performance is improved by about 35\%, and the local offloading rate can reach 90\% with the increase of the number of base stations. The semantic communication system proposed in this paper is compared with the Cross Entropy baseline model. Introducing blockchain technology the throughput of the transaction is maintained at a stable value with different number of drones.
comment: The author list of this manuscript is incorrect and incomplete. This version is an unauthorized early draft without approval from all authors
Proprioceptive Image: An Image Representation of Proprioceptive Data from Quadruped Robots for Contact Estimation Learning ICRA
This paper presents a novel approach for representing proprioceptive time-series data from quadruped robots as structured two-dimensional images, enabling the use of convolutional neural networks for learning locomotion-related tasks. The proposed method encodes temporal dynamics from multiple proprioceptive signals, such as joint positions, IMU readings, and foot velocities, while preserving the robot's morphological structure in the spatial arrangement of the image. This transformation captures inter-signal correlations and gait-dependent patterns, providing a richer feature space than direct time-series processing. We apply this concept in the problem of contact estimation, a key capability for stable and adaptive locomotion on diverse terrains. Experimental evaluations on both real-world datasets and simulated environments show that our image-based representation consistently enhances prediction accuracy and generalization over conventional sequence-based models, underscoring the potential of cross-modal encoding strategies for robotic state learning. Our method achieves superior performance on the contact dataset, improving contact state accuracy from 87.7% to 94.5% over the recently proposed MI-HGNN method, using a 15 times shorter window size.
comment: Accepted to the IEEE International Conference on Robotics and Automation (ICRA) 2026
Toward Reliable Sim-to-Real Predictability for MoE-based Robust Quadrupedal Locomotion
Reinforcement learning has shown strong promise for quadrupedal agile locomotion, even with proprioception-only sensing. In practice, however, sim-to-real gap and reward overfitting in complex terrains can produce policies that fail to transfer, while physical validation remains risky and inefficient. To address these challenges, we introduce a unified framework encompassing a Mixture-of-Experts (MoE) locomotion policy for robust multi-terrain representation with RoboGauge, a predictive assessment suite that quantifies sim-to-real transferability. The MoE policy employs a gated set of specialist experts to decompose latent terrain and command modeling, achieving superior deployment robustness and generalization via proprioception alone. RoboGauge further provides multi-dimensional proprioception-based metrics via sim-to-sim tests over terrains, difficulty levels, and domain randomizations, enabling reliable MoE policy selection without extensive physical trials. Experiments on a Unitree Go2 demonstrate robust locomotion on unseen challenging terrains, including snow, sand, stairs, slopes, and 30 cm obstacles. In dedicated high-speed tests, the robot reaches 4 m/s and exhibits an emergent narrow-width gait associated with improved stability at high velocity.
comment: Project Page: https://robogauge.github.io/complete/
RoboMatch: A Unified Mobile-Manipulation Teleoperation Platform with Auto-Matching Network Architecture for Long-Horizon Tasks ICRA
This paper presents RoboMatch, a novel unified teleoperation platform for mobile manipulation with an auto-matching network architecture, designed to tackle long-horizon tasks in dynamic environments. Our system enhances teleoperation performance, data collection efficiency, task accuracy, and operational stability. The core of RoboMatch is a cockpit-style control interface that enables synchronous operation of the mobile base and dual arms, significantly improving control precision and data collection. Moreover, we introduce the Proprioceptive-Visual Enhanced Diffusion Policy (PVE-DP), which leverages Discrete Wavelet Transform (DWT) for multi-scale visual feature extraction and integrates high-precision IMUs at the end-effector to enrich proprioceptive feedback, substantially boosting fine manipulation performance. Furthermore, we propose an Auto-Matching Network (AMN) architecture that decomposes long-horizon tasks into logical sequences and dynamically assigns lightweight pre-trained models for distributed inference. Experimental results demonstrate that our approach improves data collection efficiency by over 20%, increases task success rates by 20-30% with PVE-DP, and enhances long-horizon inference performance by approximately 40% with AMN, offering a robust solution for complex manipulation tasks. Project website: https://robomatch.github.io
comment: Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA)
Chance-Constrained Iterative Linear-Quadratic Stochastic Games
Dynamic game arises as a powerful paradigm for multi-robot planning, for which safety constraint satisfaction is crucial. Constrained stochastic games are of particular interest, as real-world robots need to operate and satisfy constraints under uncertainty. Existing methods for solving stochastic games handle chance constraints using exponential penalties with hand-tuned weights. However, finding a suitable penalty weight is nontrivial and requires trial and error. In this paper, we propose the chance-constrained iterative linear-quadratic stochastic games (CCILQGames) algorithm. CCILQGames solves chance-constrained stochastic games using the augmented Lagrangian method. We evaluate our algorithm in three autonomous driving scenarios, including merge, intersection, and roundabout. Experimental results and Monte Carlo tests show that CCILQGames can generate safe and interactive strategies in stochastic environments.
comment: Updated version of the published IEEE RA-L paper. Assumption 1 and strategy space definition revised to make the information structure explicit. Theorem 1 assumptions are more explict. No changes to algorithm or experimental results
Diffusion Forcing for Multi-Agent Interaction Sequence Modeling
Understanding and generating multi-person interactions is a fundamental challenge with broad implications for robotics and social computing. While humans naturally coordinate in groups, modeling such interactions remains difficult due to long temporal horizons, strong inter-agent dependencies, and variable group sizes. Existing motion generation methods are largely task-specific and do not generalize to flexible multi-agent generation. We introduce MAGNet (Multi-Agent Generative Network), a unified autoregressive diffusion framework for multi-agent motion generation that supports a wide range of interaction tasks through flexible conditioning and sampling. MAGNet performs dyadic and polyadic prediction, partner inpainting, partner prediction, and agentic generation all within a single model, and can autoregressively generate ultra-long sequences spanning hundreds of motion steps. We explicitly model inter-agent coupling during autoregressive denoising, enabling coherent coordination across agents. As a result, MAGNet captures both tightly synchronized activities (e.g., dancing, boxing) and loosely structured social interactions. Our approach performs on par with specialized methods on dyadic benchmarks while naturally extending to polyadic scenarios involving three or more interacting people. Please watch the supplemental video, where the temporal dynamics and spatial coordination of generated interactions are best appreciated. Project page: https://von31.github.io/MAGNet/
comment: Project page: https://von31.github.io/MAGNet/ ; Code: https://github.com/Von31/MAGNet-code
An MPC framework for efficient navigation of mobile robots in cluttered environments
We present a model predictive control (MPC) framework for efficient navigation of mobile robots in cluttered environments. The proposed approach integrates a finite-segment shortest path planner into the finite-horizon trajectory optimization of the MPC. This formulation ensures convergence to dynamically selected targets and guarantees collision avoidance, even under general nonlinear dynamics and cluttered environments. The approach is validated through hardware experiments on a small ground robot, where a human operator dynamically assigns target locations that a robot should reach while avoiding obstacles. The robot reached new targets within 2-3 seconds and responded to new commands within 50 ms to 100 ms, immediately adjusting its motion even while still moving at high speeds toward a previous target.
comment: - Code available at: https://github.com/IntelligentControlSystems/ClutteredEnvironment - Supplementary video: https://youtu.be/Hn_hpAmGgq0
Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols CVPR 2026
Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic manipulation, yet they remain limited in failure diagnosis and learning from failures. Additionally, existing failure datasets are mostly generated programmatically in simulation, which limits their generalization to the real world. In light of these, we introduce ViFailback, a framework designed to diagnose robotic manipulation failures and provide both textual and visual correction guidance. Our framework utilizes explicit visual symbols to enhance annotation efficiency. We further release the ViFailback dataset, a large-scale collection of 58,126 Visual Question Answering (VQA) pairs along with their corresponding 5,202 real-world manipulation trajectories. Based on the dataset, we establish ViFailback-Bench, a benchmark of 11 fine-grained VQA tasks designed to assess the failure diagnosis and correction abilities of Vision-Language Models (VLMs), featuring ViFailback-Bench Lite for closed-ended and ViFailback-Bench Hard for open-ended evaluation. To demonstrate the effectiveness of our framework, we built the ViFailback-8B VLM, which not only achieves significant overall performance improvement on ViFailback-Bench but also generates visual symbols for corrective action guidance. Finally, by integrating ViFailback-8B with a VLA model, we conduct real-world robotic experiments demonstrating its ability to assist the VLA model in recovering from failures. Project Website: https://x1nyuzhou.github.io/vifailback.github.io/
comment: Accepted by CVPR 2026. Project Website: https://x1nyuzhou.github.io/vifailback.github.io/
Joint Magnetometer-IMU Calibration via Maximum A Posteriori Estimation
This paper presents a new approach for jointly calibrating magnetometers and inertial measurement units, focusing on improving calibration accuracy and computational efficiency. The proposed method formulates the calibration problem as a maximum a posteriori estimation problem, treating both the calibration parameters and orientation trajectory of the sensors as unknowns. This formulation enables efficient optimization with closed-form derivatives. The method is compared against two state-of-the-art approaches in terms of computational complexity and estimation accuracy. Simulation results demonstrate that the proposed method achieves lower root mean square error in calibration parameters while maintaining competitive computational efficiency. Further validation through real-world experiments confirms the practical benefits of our approach: it effectively reduces position drift in a magnetic field-aided inertial navigation system by more than a factor of two on most datasets. Moreover, the proposed method calibrated 30 magnetometers in less than 2 minutes. The contributions include a new calibration method, an analysis of existing methods, and a comprehensive empirical evaluation. Datasets and algorithms are made publicly available to promote reproducible research.
comment: Latest version
Bi-HIL: Bilateral Control-Based Multimodal Hierarchical Imitation Learning via Subtask-Level Progress Rate and Keyframe Memory for Long-Horizon Contact-Rich Robotic Manipulation
Long-horizon contact-rich robotic manipulation remains challenging due to partial observability and unstable subtask transitions under contact uncertainty. While hierarchical architectures improve temporal reasoning and bilateral imitation learning enables force-aware control, existing approaches often rely on flat policies that struggle with long-horizon coordination. We propose Bi-HIL, a bilateral control-based multimodal hierarchical imitation learning framework for long-horizon manipulation. Bi-HIL stabilizes hierarchical coordination by integrating keyframe memory with subtask-level progress rate that models phase progression within the active subtask and conditions both high- and low-level policies. We evaluate Bi-HIL on unimanual and bimanual real-robot tasks, demonstrating consistent improvements over flat and ablated variants. The results highlight the importance of explicitly modeling subtask progression together with force-aware control for robust long-horizon manipulation. For additional material, please check: https://mertcookimg.github.io/bi-hil
CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection CVPR 2026
Multi-camera 3D object detection (MC3D) has attracted increasing attention with the growing deployment of multi-sensor physical agents, such as robots and autonomous vehicles. However, MC3D models still struggle to generalize to unseen platforms with new multi-camera configurations. Current solutions simply employ a meta-camera for unified representation but lack comprehensive consideration. In this paper, we revisit this issue and identify that the devil lies in spatial prior discrepancies across source and target configurations, including different intrinsics, extrinsics, and array layouts. To address this, we propose CoIn3D, a generalizable MC3D framework that enables strong transferability from source configurations to unseen target ones. CoIn3D explicitly incorporates all identified spatial priors into both feature embedding and image observation through spatial-aware feature modulation (SFM) and camera-aware data augmentation (CDA), respectively. SFM enriches feature space by integrating four spatial representations, such as focal length, ground depth, ground gradient, and Plücker coordinate. CDA improves observation diversity under various configurations via a training-free dynamic novel-view image synthesis scheme. Extensive experiments demonstrate that CoIn3D achieves strong cross-configuration performance on landmark datasets such as NuScenes, Waymo, and Lyft, under three dominant MC3D paradigms represented by BEVDepth, BEVFormer, and PETR.
comment: Accepted to CVPR 2026 main track
MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Autonomous Driving CVPR 2026
Generative models have shown great potential in trajectory planning. Recent studies demonstrate that anchor-guided generative models are effective in modeling the uncertainty of driving behaviors and improving overall performance. However, these methods rely on discrete anchor vocabularies that must sufficiently cover the trajectory distribution during testing to ensure robustness, inducing an inherent trade-off between vocabulary size and model performance. To overcome this limitation, we propose MeanFuser, an end-to-end autonomous driving method that enhances both efficiency and robustness through three key designs. (1) We introduce Gaussian Mixture Noise (GMN) to guide generative sampling, enabling a continuous representation of the trajectory space and eliminating the dependency on discrete anchor vocabularies. (2) We adapt ``MeanFlow Identity" to end-to-end planning, which models the mean velocity field between GMN and trajectory distribution instead of the instantaneous velocity field used in vanilla flow matching methods, effectively eliminating numerical errors from ODE solvers and significantly accelerating inference. (3) We design a lightweight Adaptive Reconstruction Module (ARM) that enables the model to implicitly select from all sampled proposals or reconstruct a new trajectory when none is satisfactory via attention weights.Experiments on the NAVSIM closed-loop benchmark demonstrate that MeanFuser achieves outstanding performance without the supervision of the PDM Score and exceptional inference efficiency, offering a robust and efficient solution for end-to-end autonomous driving. Our code and model are available at https://github.com/wjl2244/MeanFuser.
comment: Accepted by CVPR 2026
T-araVLN: Translator for Agricultural Robotic Agents on Vision-and-Language Navigation
Agricultural robotic agents have been becoming useful helpers in a wide range of agricultural tasks. However, they still heavily rely on manual operations or fixed railways for movement. To address this limitation, the AgriVLN method and the A2A benchmark pioneeringly extend Vision-and-Language Navigation (VLN) to the agricultural domain, enabling agents to navigate to the target positions following the natural language instructions. We observe that AgriVLN can effectively understands the simple instructions, but often misunderstands the complex ones. To bridge this gap, we propose the T-araVLN method, in which we build the instruction translator module to translate noisy and mistaken instructions into refined and precise representations. When evaluated on A2A, our T-araVLN successfully improves Success Rate (SR) from 0.47 to 0.63 and reduces Navigation Error (NE) from 2.91m to 2.28m, demonstrating the state-of-the-art performance in the agricultural VLN domain. Code: https://github.com/AlexTraveling/T-araVLN.
Towards Exploratory and Focused Manipulation with Bimanual Active Perception: A New Problem, Benchmark and Strategy ICRA 2026
Recently, active vision has reemerged as an important concept for manipulation, since visual occlusion occurs more frequently when main cameras are mounted on the robot heads. We reflect on the visual occlusion issue and identify its essence as the absence of information useful for task completion. Inspired by this, we come up with the more fundamental problem of Exploratory and Focused Manipulation (EFM). The proposed problem is about actively collecting information to complete challenging manipulation tasks that require exploration or focus. As an initial attempt to address this problem, we establish the EFM-10 benchmark that consists of 4 categories of tasks that align with our definition (10 tasks in total). We further come up with a Bimanual Active Perception (BAP) strategy, which leverages one arm to provide active vision and another arm to provide force sensing while manipulating. Based on this idea, we collect a dataset named BAPData for the tasks in EFM-10. With the dataset, we successfully verify the effectiveness of the BAP strategy in an imitation learning manner. We hope that the EFM-10 benchmark along with the BAP strategy can become a cornerstone that facilitates future research towards this direction. Project website: EFManipulation.github.io.
comment: ICRA 2026
3D Dynamics-Aware Manipulation: Endowing Manipulation Policies with 3D Foresight ICRA 2026
The incorporation of world modeling into manipulation policy learning has pushed the boundary of manipulation performance. However, existing efforts simply model the 2D visual dynamics, which is insufficient for robust manipulation when target tasks involve prominent depth-wise movement. To address this, we present a 3D dynamics-aware manipulation framework that seamlessly integrates 3D world modeling and policy learning. Three self-supervised learning tasks (current depth estimation, future RGB-D prediction, 3D flow prediction) are introduced within our framework, which complement each other and endow the policy model with 3D foresight. Extensive experiments on simulation and the real world show that 3D foresight can greatly boost the performance of manipulation policies without sacrificing inference speed. Code is available at https://github.com/Stardust-hyx/3D-Foresight.
comment: ICRA 2026
Lightweight Tracking Control for Computationally Constrained Aerial Systems with the Newton-Raphson Method
We investigate the performance of a lightweight tracking controller, based on a flow version of the Newton-Raphson method, applied to a miniature blimp and a mid-size quadrotor. This tracking technique admits theoretical performance guarantees for certain classes of systems and has been successfully applied in simulation studies and on mobile robots with simplified motion models. We evaluate the technique through real-world flight experiments on aerial hardware platforms subject to realistic deployment and onboard computational constraints. The technique's performance is assessed in comparison with established baseline control frameworks of feedback linearization for the blimp, and nonlinear model predictive control for both the quadrotor and the blimp. The performance metrics under consideration are (i) root mean square error of flight trajectories with respect to target trajectories, (ii) algorithms' computation times, and (iii) CPU energy consumption associated with the control algorithms. The experimental findings show that the Newton-Raphson-based tracking controller achieves competitive or superior tracking performance to the baseline methods with substantially reduced computation time and energy expenditure.
When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making
Embodied robotic systems increasingly rely on large language model (LLM)-based agents to support high-level reasoning, planning, and decision-making during interactions with the environment. However, invoking LLM reasoning introduces substantial computational latency and resource overhead, which can interrupt action execution and reduce system reliability. Excessive reasoning may delay actions, while insufficient reasoning often leads to incorrect decisions and task failures. This raises a fundamental question for embodied agents: when should the agent reason, and when should it act? In this work, we propose RARRL (Resource-Aware Reasoning via Reinforcement Learning), a hierarchical framework for resource-aware orchestration of embodied agents. Rather than learning low-level control policies, RARRL learns a high-level orchestration policy that operates at the agent's decision-making layer. This policy enables the agent to adaptively determine whether to invoke reasoning, which reasoning role to employ, and how much computational budget to allocate based on current observations, execution history, and remaining resources. Extensive experiments, including evaluations with empirical latency profiles derived from the ALFRED benchmark, show that RARRL consistently improves task success rates while reducing execution latency and enhancing robustness compared with fixed or heuristic reasoning strategies. These results demonstrate that adaptive reasoning control is essential for building reliable and efficient embodied robotic agents.
MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation
A prevailing view in robot learning is that simulation alone is not enough; effective sim-to-real transfer is widely believed to require at least some real-world data collection or task-specific fine-tuning to bridge the gap between simulated and physical environments. We challenge that assumption. With sufficiently large-scale and diverse simulated synthetic training data, we show that zero-shot transfer to the real world is not only possible, but effective for both static and mobile manipulation. We introduce MolmoBot-Engine, a fully open-source pipeline for procedural data generation across robots, tasks, and diverse simulated environments in MolmoSpaces. With it, we release MolmoBot-Data, a dataset of 1.8 million expert trajectories for articulated object manipulation and pick-and-place tasks. We train three policy classes: MolmoBot, a Molmo2-based multi-frame vision-language model with a flow-matching action head; MolmoBot-Pi0, which replicates the $π_0$ architecture to enable direct comparison; and MolmoBot-SPOC, a lightweight policy suitable for edge deployment and amenable to RL fine-tuning. We evaluate on two robotic platforms: the Franka FR3 for tabletop manipulation tasks and the Rainbow Robotics RB-Y1 mobile manipulator for door opening, drawer manipulation, cabinet interaction, and mobile pick-and-place. Without any real-world fine-tuning, our policies achieve zero-shot transfer to unseen objects and environments. On tabletop pick-and-place, MolmoBot achieves a success rate of 79.2% in real world evaluations across 4 settings, outperforming $π_{0.5}$ at 39.2%. Our results demonstrate that procedural environment generation combined with diverse articulated assets can produce robust manipulation policies that generalize broadly to the real world. Technical website: https://allenai.github.io/MolmoBot
LLM4AD: Large Language Models for Autonomous Driving -- Concept, Review, Benchmark, Experiments, and Future Trends
With the broader adoption and highly successful development of Large Language Models (LLMs), there has been growing interest and demand for applying LLMs to autonomous driving technology. Driven by their natural language understanding and reasoning capabilities, LLMs have the potential to enhance various aspects of autonomous driving systems, from perception and scene understanding to interactive decision-making. This paper first introduces the novel concept of designing Large Language Models for Autonomous Driving (LLM4AD), followed by a review of existing LLM4AD studies. Then, a comprehensive benchmark is proposed for evaluating the instruction-following and reasoning abilities of LLM4AD systems, which includes LaMPilot-Bench, CARLA Leaderboard 1.0 Benchmark in simulation and NuPlanQA for multi-view visual question answering. Furthermore, extensive real-world experiments are conducted on autonomous vehicle platforms, examining both on-cloud and on-edge LLM deployment for personalized decision-making and motion control. Next, the future trends of integrating language diffusion models into autonomous driving are explored, exemplified by the proposed ViLaD (Vision-Language Diffusion) framework. Finally, the main challenges of LLM4AD are discussed, including latency, deployment, security and privacy, safety, trust and transparency, and personalization.
comment: The paper was accepted by the Proceedings of the IEEE
Constant-Time Motion Planning with Manipulation Behaviors
Recent progress in contact-rich robotic manipulation has been striking, yet most deployed systems remain confined to simple, scripted routines. One of the key barriers is the lack of motion planning algorithms that can provide verifiable guarantees for safety, efficiency and reliability. To address this, a family of algorithms called Constant-Time Motion Planning (CTMP) was introduced, which leverages a preprocessing phase to enable collision-free motion queries in a fixed, user-specified time budget (e.g., 10 milliseconds). However, existing CTMP methods do not explicitly incorporate the manipulation behaviors essential for object handling. To bridge this gap, we introduce the \textit{Behavioral Constant-Time Motion Planner} (B-CTMP), an algorithm that extends CTMP to solve a broad class of two-step manipulation tasks: (1) a collision-free motion to a behavior initiation state, followed by (2) execution of a manipulation behavior (such as grasping or insertion) to reach the goal. By precomputing compact data structures, B-CTMP guarantees constant-time query in mere milliseconds while ensuring completeness and successful task execution over a specified set of states. We evaluate B-CTMP on two canonical manipulation tasks, shelf picking and plug insertion, in simulation and on a real robot. Our results show that B-CTMP unifies collision-free planning and object manipulation within a single constant-time framework, providing provable guarantees of speed and success for manipulation in semi-structured environments.
comment: In submission
Seeking Physics in Diffusion Noise
Do video diffusion models encode signals predictive of physical plausibility? We probe intermediate denoising representations of a pretrained Diffusion Transformer (DiT) and find that physically plausible and implausible videos are partially separable in mid-layer feature space across noise levels. This separability cannot be fully attributed to visual quality or generator identity, suggesting recoverable physics-related cues in frozen DiT features. Leveraging this observation, we introduce progressive trajectory selection, an inference-time strategy that scores parallel denoising trajectories at a few intermediate checkpoints using a lightweight physics verifier trained on frozen features, and prunes low-scoring candidates early. Extensive experiments on PhyGenBench demonstrate that our method improves physical consistency while reducing inference cost, achieving comparable results to Best-of-K sampling with substantially fewer denoising steps.
comment: 32 pages, 8 figures, 10 tables
Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model CVPR2026
Generating realistic and controllable traffic scenes from natural language can greatly enhance the development and evaluation of autonomous driving systems. However, this task poses unique challenges: (1) grounding free-form text into spatially valid and semantically coherent layouts, (2) composing scenarios without predefined locations, and (3) planning multi-agent behaviors and selecting roads that respect agents' configurations. To address these, we propose a modular framework, TTSG, comprising prompt analysis, road retrieval, agent planning, and a novel plan-aware road ranking algorithm to solve these challenges. While large language models (LLMs) are used as general planners, our design integrates them into a tightly controlled pipeline that enforces structure, feasibility, and scene diversity. Notably, our ranking strategy ensures consistency between agent actions and road geometry, enabling scene generation without predefined routes or spawn points. The framework supports both routine and safety-critical scenarios, as well as multi-stage event composition. Experiments on SafeBench demonstrate that our method achieves the lowest average collision rate (3.5\%) across three critical scenarios. Moreover, driving captioning models trained on our generated scenes improve action reasoning by over 30 CIDEr points. These results underscore our proposed framework for flexible, interpretable, and safety-oriented simulation.
comment: Accepted by WAD@CVPR2026
DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation CVPR2026
Vision-and-Language Navigation (VLN) requires agents to follow long-horizon instructions and navigate complex 3D environments. However, existing approaches face two major challenges: constructing an effective long-term memory bank and overcoming the compounding errors problem. To address these issues, we propose DecoVLN, an effective framework designed for robust streaming perception and closed-loop control in long-horizon navigation. First, we formulate long-term memory construction as an optimization problem and introduce adaptive refinement mechanism that selects frames from a historical candidate pool by iteratively optimizing a unified scoring function. This function jointly balances three key criteria: semantic relevance to the instruction, visual diversity from the selected memory, and temporal coverage of the historical trajectory. Second, to alleviate compounding errors, we introduce a state-action pair-level corrective finetuning strategy. By leveraging geodesic distance between states to precisely quantify deviation from the expert trajectory, the agent collects high-quality state-action pairs in the trusted region while filtering out the polluted data with low relevance. This improves both the efficiency and stability of error correction. Extensive experiments demonstrate the effectiveness of DecoVLN, and we have deployed it in real-world environments.
comment: 16 pages, 8 figures, CVPR2026
Multiagent Systems
UMBRELLA: Uncertainty-aware Multi-robot Reactive Coordination under Dynamic Temporal Logic Tasks
Multi-robot systems can be extremely efficient for accomplishing team-wise tasks by acting concurrently and collaboratively. However, most existing methods either assume static task features or simply replan when environmental changes occur. This paper addresses the challenging problem of coordinating multi-robot systems for collaborative tasks involving dynamic and moving targets. We explicitly model the uncertainty in target motion prediction via Conformal Prediction(CP), while respecting the spatial-temporal constraints specified by Linear Temporal Logic (LTL). The proposed framework (UMBRELLA) combines the Monte Carlo Tree Search (MCTS) over partial plans with uncertainty-aware rollouts, and introduces a CP-based metric to guide and accelerate the search. The objective is to minimize the Conditional Value at Risk (CVaR) of the average makespan. For tasks released online, a receding-horizon planning scheme dynamically adjusts the assignments based on updated task specifications and motion predictions. Spatial and temporal constraints among the tasks are always ensured, and only partial synchronization is required for the collaborative tasks during online execution. Extensive large-scale simulations and hardware experiments demonstrate substantial reductions in both the average makespan and its variance by 23% and 71%, compared with static baselines.
AD-CARE: A Guideline-grounded, Modality-agnostic LLM Agent for Real-world Alzheimer's Disease Diagnosis with Multi-cohort Assessment, Fairness Analysis, and Reader Study
Alzheimer's disease (AD) is a growing global health challenge as populations age, and timely, accurate diagnosis is essential to reduce individual and societal burden. However, real-world AD assessment is hampered by incomplete, heterogeneous multimodal data and variability across sites and patient demographics. Although large language models (LLMs) have shown promise in biomedicine, their use in AD has largely been confined to answering narrow, disease-specific questions rather than generating comprehensive diagnostic reports that support clinical decision-making. Here we expand LLM capabilities for clinical decision support by introducing AD-CARE, a modality-agnostic agent that performs guideline-grounded diagnostic assessment from incomplete, heterogeneous inputs without imputing missing modalities. By dynamically orchestrating specialized diagnostic tools and embedding clinical guidelines into LLM-driven reasoning, AD-CARE generates transparent, report-style outputs aligned with real-world clinical workflows. Across six cohorts comprising 10,303 cases, AD-CARE achieved 84.9% diagnostic accuracy, delivering 4.2%-13.7% relative improvements over baseline methods. Despite cohort-level differences, dataset-specific accuracies remain robust (80.4%-98.8%), and the agent consistently outperforms all baselines. AD-CARE reduced performance disparities across racial and age subgroups, decreasing the average dispersion of four metrics by 21%-68% and 28%-51%, respectively. In a controlled reader study, the agent improved neurologist and radiologist accuracy by 6%-11% and more than halved decision time. The framework yielded 2.29%-10.66% absolute gains over eight backbone LLMs and converges their performance. These results show that AD-CARE is a scalable, practically deployable framework that can be integrated into routine clinical workflows for multimodal decision support in AD.
Learning in Proportional Allocation Auctions Games
The Kelly or proportional allocation mechanism is a simple and efficient auction-based scheme that distributes an infinitely divisible resource proportionally to the agents bids. When agents are aware of the allocation rule, their interactions form a game extensively studied in the literature. This paper examines the less explored repeated Kelly game, focusing mainly on utilities that are logarithmic in the allocated resource fraction. We first derive this logarithmic form from fairness-throughput trade-offs in wireless network slicing, and then prove that the induced stage game admits a unique Nash equilibrium NE. For the repeated play, we prove convergence to this NE under three behavioral models: (i) all agents use Online Gradient Descent (OGD), (ii) all agents use Dual Averaging with a quadratic regularizer (DAQ) (a variant of the Follow-the-Regularized leader algorithm), and (iii) all agents play myopic best responses (BR). Our convergence results hold even when agents use personalized learning rates in OGD and DAQ (e.g., tuned to optimize individual regret bounds), and they extend to a broader class of utilities that meet a certain sufficient condition. Finally, we complement our theoretical results with extensive simulations of the repeated Kelly game under several behavioral models, comparing them in terms of convergence speed to the NE, and per-agent time-average utility. The results suggest that BR achieves the fastest convergence and the highest time-average utility, and that convergence to the stage-game NE may fail under heterogeneous update rules.
WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing
The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to "vibe coding", where users can build complete projects and even control computers using natural language instructions. This paradigm has driven automated webpage development, but it introduces a new requirement about how to automatically verify whether the web functionalities are reliably implemented. Existing works struggle to adapt, relying on static visual similarity or predefined checklists that constrain their utility in open-ended environments. Furthermore, they overlook a vital aspect of software quality, namely latent logical constraints. To address these gaps, we introduce WebTestBench, a benchmark for evaluating end-to-end automated web testing. WebTestBench encompasses comprehensive dimensions across diverse web application categories. We decompose the testing process into two cascaded sub-tasks, checklist generation and defect detection, and propose WebTester, a baseline framework for this task. Evaluating popular LLMs with WebTester reveals severe challenges, including insufficient test completeness, detection bottlenecks, and long-horizon interaction unreliability. These findings expose a substantial gap between current computer-use agent capabilities and industrial-grade deployment demands. We hope that WebTestBench provides valuable insights and guidance for advancing end-to-end automated web testing. Our dataset and code are available at https://github.com/friedrichor/WebTestBench.
comment: 24 pages, code: https://github.com/friedrichor/WebTestBench
From Logic Monopoly to Social Contract: Separation of Power and the Institutional Foundations for Autonomous Agent Economies
Existing multi-agent frameworks allow each agent to simultaneously plan, execute, and evaluate its own actions -- a structural deficiency we term the "Logic Monopoly." Empirical evidence quantifies the resulting "Reliability Gap": 84.30% average attack success rates across ten deployment scenarios, 31.4% emergent deceptive behavior without explicit reward signals, and cascading failure modes rooted in six structural bottlenecks. The remedy is not better alignment of individual models but a social contract for agents: institutional infrastructure that enforces a constitutional Separation of Power. This paper introduces the Agent Enterprise for Enterprise (AE4E) paradigm -- agents as autonomous, legally identifiable business entities within a functionalist social system -- with a contract-centric SoP model trifurcating authority into Legislation, Execution, and Adjudication branches. The paradigm is operationalized through the NetX Enterprise Framework (NEF): governance hubs, TEE-backed compute enclaves, privacy-preserving data bridges, and an Agent-Native blockchain substrate. The Agent Enterprise Economy scales across four deployment tiers from private enclaves to a global Web of Services. The Agentic Social Layer, grounded in Parsons' AGIL framework, provides institutional infrastructure via sixty-plus named Institutional AE4Es. 143 pages, 173 references, eight specialized smart contracts.
comment: 143 pages, 15 tables, 23 figures, 173 references, 4 appendices. Working paper -- pre-peer-review preprint. LaTeX source with arXiv-style template. Three companion manuscripts under development targeting peer-reviewed venues
Ultra-fast Traffic Nowcasting and Control via Differentiable Agent-based Simulation
Traffic digital twins, which inform policymakers of effective interventions based on large-scale, high-fidelity computational models calibrated to real-world traffic, hold promise for addressing societal challenges in our rapidly urbanizing world. However, conventional fine-grained traffic simulations are non-differentiable and typically rely on inefficient gradient-free optimization, making calibration for real-world applications computationally infeasible. Here we present a differentiable agent-based traffic simulator that enables ultra-fast model calibration, traffic nowcasting, and control on large-scale networks. We develop several differentiable computing techniques for simulating individual vehicle movements, including stochastic decision-making and inter-agent interactions, while ensuring that entire simulation trajectories remain end-to-end differentiable for efficient gradient-based optimization. On the large-scale Chicago road network, with over 10,000 calibration parameters, our model simulates more than one million vehicles at 173 times real-time speed. This ultra-fast simulation, together with efficient gradient-based optimization, enables us to complete model calibration using the previous 30 minutes of traffic data in 455 s, provide a one-hour-ahead traffic nowcast in 21 s, and solve the resulting traffic control problem in 728 s. This yields a full calibration--nowcast--control loop in under 20 minutes, leaving about 40 minutes of lead time for implementing interventions. Our work thus provides a practical computational basis for realizing traffic digital twins.
Belief-Driven Multi-Agent Collaboration via Approximate Perfect Bayesian Equilibrium for Social Simulation WWW 2026
High-fidelity social simulation is pivotal for addressing complex Web societal challenges, yet it demands agents capable of authentically replicating the dynamic spectrum of human interaction. Current LLM-based multi-agent frameworks, however, predominantly adhere to static interaction topologies, failing to capture the fluid oscillation between cooperative knowledge synthesis and competitive critical reasoning seen in real-world scenarios. This rigidity often leads to unrealistic ``groupthink'' or unproductive deadlocks, undermining the credibility of simulations for decision support. To bridge this gap, we propose \textit{BEACOF}, a \textit{belief-driven adaptive collaboration framework} inspired by Perfect Bayesian Equilibrium (PBE). By modeling social interaction as a dynamic game of incomplete information, BEACOF rigorously addresses the circular dependency between collaboration type selection and capability estimation. Agents iteratively refine probabilistic beliefs about peer capabilities and autonomously modulate their collaboration strategy, thereby ensuring sequentially rational decisions under uncertainty. Validated across adversarial (judicial), open-ended (social) and mixed (medical) scenarios, BEACOF prevents coordination failures and fosters robust convergence toward high-quality solutions, demonstrating superior potential for reliable social simulation. Source codes and datasets are publicly released at: https://github.com/WUT-IDEA/BEACOF.
comment: accepted at WWW 2026
Integrated Multi-Drone Task Allocation, Sequencing, and Optimal Trajectory Generation in Obstacle-Rich 3D Environments
Coordinating teams of aerial robots in cluttered three-dimensional (3D) environments requires a principled integration of discrete mission planning-deciding which robot serves which goals and in what order -- with continuous-time trajectory synthesis that enforces collision avoidance and dynamic feasibility. This paper introduces IMD-TAPP (Integrated Multi-Drone Task Allocation and Path Planning), an end-to-end framework that jointly addresses multi-goal allocation, tour sequencing, and safe trajectory generation for quadrotor teams operating in obstacle-rich spaces. IMD--TAPP first discretizes the workspace into a 3D navigation graph and computes obstacle-aware robot-to-goal and goal-to-goal travel costs via graph-search-based pathfinding. These costs are then embedded within an Injected Particle Swarm Optimization (IPSO) scheme, guided by multiple linear assignment, to efficiently explore coupled assignment/ordering alternatives and to minimize mission makespan. Finally, the resulting waypoint tours are transformed into time-parameterized minimum-snap trajectories through a generation-and-optimization routine equipped with iterative validation of obstacle clearance and inter-robot separation, triggering re-planning when safety margins are violated. Extensive MATLAB simulations across cluttered 3D scenarios demonstrate that IMD--TAPP consistently produces dynamically feasible, collision-free trajectories while achieving competitive completion times. In a representative case study with two drones serving multiple goals, the proposed approach attains a minimum mission time of 136~s while maintaining the required safety constraints throughout execution.
comment: Resubmission following accepted appeal (MOD-78958). Resubmitting to cs.RO with cross-lists cs.MA and cs.AI as advised by arXiv Support
Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving CVPR 2026
Human driving behavior is inherently personal, which is shaped by long-term habits and influenced by short-term intentions. Individuals differ in how they accelerate, brake, merge, yield, and overtake across diverse situations. However, existing end-to-end autonomous driving systems either optimize for generic objectives or rely on fixed driving modes, lacking the ability to adapt to individual preferences or interpret natural language intent. To address this gap, we propose Drive My Way (DMW), a personalized Vision-Language-Action (VLA) driving framework that aligns with users' long-term driving habits and adapts to real-time user instructions. DMW learns a user embedding from our personalized driving dataset collected across multiple real drivers and conditions the policy on this embedding during planning, while natural language instructions provide additional short-term guidance. Closed-loop evaluation on the Bench2Drive benchmark demonstrates that DMW improves style instruction adaptation, and user studies show that its generated behaviors are recognizable as each driver's own style, highlighting personalization as a key capability for human-centered autonomous driving. Our data and code are available at https://dmw-cvpr.github.io/.
comment: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026); Project website: https://dmw-cvpr.github.io/
Conchordal: Emergent Harmony via Direct Cognitive Coupling in a Psychoacoustic Landscape
This paper introduces Conchordal, a bio-acoustic instrument for generative composition whose sonic agents are governed by artificial life dynamics within a psychoacoustic fitness landscape. The system is built on Direct Cognitive Coupling (DCC), a design principle requiring that generative dynamics operate directly within a landscape derived from psychoacoustic observables and read from that landscape without symbolic harmonic rules. The environment integrates roughness and harmonicity into a continuous consonance field without presupposing discrete scales or explicit harmonic rules. Agents adjust pitch through local proposal-and-accept dynamics under a crowding penalty, regulate survival via consonance-dependent metabolism, and entrain temporally through Kuramoto-style phase coupling. Four experiments are reported: (1) consonance search produces structured polyphony with enriched consonant intervals; (2) consonance-dependent metabolism yields survival differentials that vanish when recharge is disabled; (3) a minimal hereditary adaptation assay shows that parent-guided respawn plus metabolic selection can accumulate more structured polyphony without adult hill-climbing; and (4) a shared oscillatory scaffold organizes rhythmic timing under external forcing. A supplementary mechanism check reports one possible composer-configurable bridge by which spectral state can modulate temporal coupling. These findings show that a psychoacoustically derived landscape serves as an effective artificial-life terrain, yielding self-organization, selection, synchronization, and lineage-level accumulation in a non-traditional computational medium. At the level of the model, the same landscape therefore functions both as ecological terrain and as an internal proxy for musical coherence.
comment: 9 pages, 5 figures; supplementary PDF included as ancillary file
Cooperative Deep Reinforcement Learning for Fair RIS Allocation
The deployment of reconfigurable intelligent surfaces (RISs) introduces new challenges for resource allocation in multi-cell wireless networks, particularly when user loads are uneven across base stations. In this work, we consider RISs as shared infrastructure that must be dynamically assigned among competing base stations, and we address this problem using a simultaneous ascending auction mechanism. To mitigate performance imbalances between cells, we propose a fairness-aware collaborative multi-agent reinforcement learning approach in which base stations adapt their bidding strategies based on both expected utility gains and relative service quality. A centrally computed performance-dependent fairness indicator is incorporated into the agents' observations, enabling implicit coordination without direct inter-base-station communication. Simulation results show that the proposed framework effectively redistributes RIS resources toward weaker-performing cells, substantially improving the rates of the worst-served users while preserving overall throughput. The results demonstrate that fairness-oriented RIS allocation can be achieved through cooperative learning, providing a flexible tool for balancing efficiency and equity in future wireless networks.
Theory of Dynamic Adaptive Coordination
This paper develops a dynamical theory of adaptive coordination governed by persistent environmental memory. Moving beyond framework-specific equilibrium optimization or agent-centric learning, I model agents, incentives, and the environment as a recursively closed feedback architecture: a persistent environment stores accumulated coordination signals, a distributed incentive field transmits them locally, and adaptive agents update in response. Coordination thus emerges as a structural consequence of dissipative balancing against reactive feedback, rather than the solution to a centralized objective. I establish three primary results. First, I show that under dissipativity, the closed-loop system admits a bounded forward-invariant region, ensuring viability independent of global optimality. Second, I demonstrate that when incentives hinge on persistent memory, coordination becomes irreducible to static optimization. Finally, I identify the essential structural condition for emergence: a bidirectional coupling where memory-dependent incentives drive agent updates, which in turn reshape the environmental state. Numerical verification identifies a Neimark-Sacker bifurcation at a critical coupling threshold ($β_c$), providing a rigorous stability boundary for the architecture. Results further confirm the framework's robustness under nonlinear saturation and demonstrate macroscopic scalability to populations of $N = 10^{6}$ agents.
When Identity Overrides Incentives: Representational Choices as Governance Decisions in Multi-Agent LLM Systems
Large language models are increasingly deployed in multi-agent systems for strategic tasks, yet how design choices such as role-based personas and payoff visibility affect behavior remains poorly understood. We investigate whether LLM agents function as payoff-sensitive strategic actors or as identity-driven role followers. Using a 2x2 factorial experiment (persona presence x payoff visibility) with four models (Qwen-7B/32B, Llama-8B, Mistral-7B), we test 53 environmental policy scenarios in four-agent strategic games. We find that personas suppress payoff-aligned behavior: with personas present, all models achieve near-zero Nash equilibrium in Tragedy-dominant scenarios despite complete payoff information. Nearly every equilibrium reached is Green Transition. Removing personas and providing explicit payoffs are both near-necessary for payoff-aligned behavior, enabling only Qwen models to reach 65--90\% equilibrium rates. Our results reveal three behavioral profiles: Qwen adapts to framing, Mistral is disrupted without finding Tragedy equilibrium, and Llama remains near-invariant. We show that the same binary design choice can shift equilibrium attainment by up to 90 percentage points, establishing that representational choices are not implementation details but governance decisions.
Systems and Control (EESS)
Four-Transistor Four-Diode (4T4D) Series/Parallel Chopper Module for Auto-Balancing STATCOM and Low Control and Development Complexity
Static synchronous compensators (STATCOMs) manage reactive power compensation in modern power grids and have become essential for the integration of renewable energy sources such as wind farms. Cascaded H bridges have become the preferred topology for high-power STATCOMs, but balancing module capacitor voltages remains a persistent challenge. Conventional solutions equip every module with a voltage sensor -- a component that is costly, temperature-sensitive, and prone to aging-related failures. Recent parallel-capable module topologies can balance voltage through switched-capacitor operation. The latest developments reduced the sensor requirement from one per module to one per arm. However, these implementations require twice as many individual transistors compared to series-only topologies. We present a STATCOM solution based on the four-transistor four-diode (4T4D) series\,/\,parallel chopper cell. This topology achieves bidirectional parallelization with only four transistors per module -- exactly as many as a conventional full bridge. Furthermore, we propose a dual-loop control strategy that fully eliminates module voltage sensors by inferring voltage levels from the modulation index. This scheme also improves output quality by regulating the modulation depth. We validated our proposal through simulation and experiments. We built a prototype to interface the grid. The prototype further passed robustness tests with step change, current direction reversal, and grid disturbance. This work demonstrates the first modular STATCOM implementation that combines minimum transistor count with complete elimination of module voltage sensors.
DRL-Based Spectrum Sharing for RIS-Aided Local High-Quality Wireless Networks
This paper investigates a smart spectrum-sharing framework for reconfigurable intelligent surface (RIS)-aided local high-quality wireless networks (LHQWNs) within a mobile network operator (MNO) ecosystem. Although RISs are often considered potentially harmful due to interference, this work shows that properly controlled RISs can enhance the quality of service (QoS). The proposed system enables temporary spectrum access for multiple vertical service providers (VSPs) by dynamically allocating radio resources according to traffic demand. The spectrum is divided into dedicated subchannels assigned to individual VSPs and reusable subchannels shared among multiple VSPs, while RIS is employed to improve propagation conditions. We formulate a multi-VSP utility maximization problem that jointly optimizes subchannel assignment, transmit power, and RIS phase configuration while accounting for spectrum access costs, RIS leasing costs, and QoS constraints. The resulting mixed-integer non-linear program (MINLP) is intractable using conventional optimization methods. To address this challenge, the problem is modeled as a Markov decision process (MDP) and solved using deep reinforcement learning (DRL). Specifically, deep deterministic policy gradient (DDPG) and soft actor-critic (SAC) algorithms are developed and compared. Simulation results show that SAC outperforms DDPG in convergence speed, stability, and achievable utility, reaching up to 96% of the exhaustive search benchmark and demonstrating the potential of RIS to improve overall utility in multi-VSP scenarios.
Real-time control of multiphase processes with learned operators
Multiphase flows frequently occur naturally and in manufactured devices. Controlling such phenomena is extremely challenging due to the strongly non-linear dynamics, rapid phase transitions, and the limited spatial and temporal resolution of available sensors, which can lead to significant inaccuracies in predicting and managing these flows. In most cases, numerical models are the only way to access high spatial and temporal resolution data to an extent that allows for fine control. While embedding numerical models in control algorithms could enable fine control of multiphase processes, the significant computational burden currently limits their practical application. This work proposes a surrogate-assisted model predictive control (MPC) framework for regulating multiphase processes using learned operators. A Fourier Neural Operator (FNO) is trained to forecast the spatiotemporal evolution of a phase-indicator field (the volume fraction) over a finite horizon from a short history of recent states and a candidate actuation signal. The neural operator surrogate is then iteratively called during the optimisation process to identify the optimal control variable. To illustrate the approach, we solve an optimal control problem (OCP) on a two-phase Eulerian bubble column. Here, the controller tracks piecewise-constant liquid level setpoints by adjusting the gas flow rate introduced into the system. The results we obtained indicate that field-level forecasting with FNOs are well suited for closed-loop optimization since they have relatively low evaluation cost. The latter provide a practical route toward MPC for fast multiphase unit operations and a foundation for future extensions to partial observability and physics-informed operator learning.
Entire Period Transient Stability of Synchronous Generators Considering LVRT Switching of Nearby Renewable Energy Sources
In scenarios where synchronous generators (SGs) and grid-following renewable energy sources (GFLR) are co-located, existing research, which mainly focuses on the first-swing stability of SGs, often overlooks ongoing dynamic interactions between GFLRs and SGs throughout the entire rotor swing period. To address this gap, this study first reveals that the angle oscillations of SG can cause periodic grid voltage fluctuations, potentially triggering low-voltage ride-through (LVRT) control switching of GFLR repeatedly. Then, the periodic energy changes of SGs under "circular" and "rectangular" LVRT limits are analyzed. The results indicate that circular limits are detrimental to SG's first-swing stability, while rectangular limits and their slow recovery strategies can lead to SG's multi-swing instability. Conservative stability criteria are also proposed for these phenomena. Furthermore, an additional controller based on feedback linearization is introduced to enhance the entire period transient stability of SG by adjusting the post-fault GFLR output current. Finally, the efficacy of the analysis is validated through electromagnetic transient simulations and controller hardware-in-the-loop (CHIL) tests.
Global Stability Analysis of the Age-Structured Chemostat With Substrate Dynamics
In this paper we study the stability properties of the equilibrium point for an age-structured chemostat model with renewal boundary condition and coupled substrate dynamics under constant dilution rate. This is a complex infinite-dimensional feedback system. It has two feedback loops, both nonlinear. A positive static loop due to reproduction at the age-zero boundary of the PDE, counteracted and dominated by a negative dynamic loop with the substrate dynamics. The derivation of explicit sufficient conditions that guarantee global stability estimates is carried out by using an appropriate Lyapunov functional. The constructed Lyapunov functional guarantees global exponential decay estimates and uniform global asymptotic stability with respect to a measure related to the Lyapunov functional. From a biological perspective, stability arises because reproduction is constrained by substrate availability, while dilution, mortality, and substrate depletion suppress transient increases in biomass before age-structure effects can amplify them. The obtained results are applied to a chemostat model from the literature, where the derived stability condition is compared with existing results that are based on (necessarily local) linearization methods.
comment: 46 pages
Feature Selection for Fault Prediction in Distribution Systems SC
While conventional power system protection isolates faulty components only after a fault has occurred, fault prediction approaches try to detect faults before they can cause significant damage. Although initial studies have demonstrated successful proofs of concept, development is hindered by scarce field data and ineffective feature selection. To address these limitations, this paper proposes a surrogate task that uses simulation data for feature selection. This task exhibits a strong correlation (r = 0.92) with real-world fault prediction performance. We generate a large dataset containing 20000 simulations with 34 event classes and diverse grid configurations. From 1556 candidate features, we identify 374 optimal features. A case study on three substations demonstrates the effectiveness of the selected features, achieving an F1-score of 0.80 and outperforming baseline approaches that use frequency-domain and wavelet-based features.
comment: Submitted to PSCC 2026
A Minimum-Energy Control Approach for Redundant Mobile Manipulators in Physical Human-Robot Interaction Applications
Research on mobile manipulation systems that physically interact with humans has expanded rapidly in recent years, opening the way to tasks which could not be performed using fixed-base manipulators. Within this context, developing suitable control methodologies is essential since mobile manipulators introduce additional degrees of freedom, making the design of control approaches more challenging and more prone to performance optimization. This paper proposes a control approach for a mobile manipulator, composed of a mobile base equipped with a robotic arm mounted on the top, with the objective of minimizing the overall kinetic energy stored in the whole-body mobile manipulator in physical human-robot interaction applications. The approach is experimentally tested with reference to a peg-in-hole task, and the results demonstrate that the proposed approach reduces the overall kinetic energy stored in the whole-body robotic system and improves the system performance compared with the benchmark method.
On Port-Hamiltonian Formulation of HystereticEnergy Storage Elements: The Backlash Case
This paper presents a port-Hamiltonian formulation of hysteretic energy storage elements. First, we revisit the passivity property of backlash-driven storage elements by presenting a family of storage functions associated to the dissipativity property of such elements. We explicitly derive the corresponding available storage and required supply functions `a la Willems [1], and show the interlacing property of the aforementioned family of storage functions sandwiched between the available storage and required supply functions. Second, using the proposed family of storage functions, we present a port-Hamiltonian formulation of hysteretic inductors as prototypical storage elements in port-Hamiltonian systems. In particular, we show how a Hamiltonian function can be chosen from the family of storage functions and how the hysteretic elements can be expressed as port-Hamiltonian system with feedthrough term, where the feedthrough term represents energy dissipation. Correspondingly, we illustrate its applicability in describing an RLC circuit (in parallel and in series) containing a hysteretic inductor element.
Dominant Transient Stability of the Co-located PLL-Based Grid-Following Renewable Plant and Synchronous Condenser Systems
Deploying synchronous condensers (SynCons) near grid-following renewable energy sources (GFLRs) is an effective and increasingly adopted strategy for grid support. However, the potential transient instability risks in such configurations remain an open research question. This study investigates the mechanism of dominant synchronization instability source transition upon SynCon integration and proposes a straightforward approach to enhance system stability by leveraging their interactive characteristics. Firstly, a dual-timescale decoupling model is established, partitioning the system into a fast subsystem representing phase-locked loop (PLL) dynamics and a slow subsystem characterizing SynCon rotor dynamics. The study then examines the influence of SynCons on the transient stability of nearby PLLs and their own inherent stability. The study shows that SynCon's voltage-source characteristics and its time-scale separation from PLL dynamics can significantly enhance the PLL's stability boundary and mitigate non-coherent coupling effects among multiple GFLRs. However, the dominant instability source shifts from the fast-time-scale PLL to the slow-time-scale SynCon after SynCon integration. Crucially, this paper demonstrates that the damping effect of PLL control can also be transferred from the fast to the slow time scale, allowing well-tuned PLL damping to suppress SynCon rotor acceleration. Consequently, by utilizing SynCon's inherent support capability and a simple PLL damping loop, the transient stability of the co-located system can be significantly enhanced. These conclusions are validated using a converter controller-based Hardware-in-the-Loop (CHIL) platform.
Multi-Swing Transient Stability of Synchronous Generators and IBR Combined Generation Systems
In traditional views, the build-up of accelerating energy during faults can cause the well-known first-swing angle instability in synchronous generators (SGs). Interestingly, this letter presents a new insight that the accumulation of decelerating energy due to the low voltage ride-through (LVRT) and recovery control of grid-following inverter-based resources (GFL-IBRs), might also result in transient angle instability in SGs. The transient energy accumulated during angle-decreasing swing transforms into the acceleration energy of the subsequent swing, hence such phenomena often manifest as multi-swing instability. Both theoretical analysis and simulation support these findings.
Distributed Event-Triggered Consensus Control of Discrete-Time Linear Multi-Agent Systems under LQ Performance Constraints
This paper proposes a distributed event-triggered control method that not only guarantees consensus of multi-agent systems but also satisfies a prescribed LQ performance constraint. Taking the standard distributed control scheme with all-time communication as a baseline, we consider the problem of designing an event-triggered communication rule such that the resulting LQ cost satisfies a performance constraint with respect to the baseline cost while consensus is achieved. For general linear agents over an undirected graph, we employ local state predictors and a local triggering condition based only on information available to each agent. We then derive a sufficient condition for the proposed method to satisfy the performance constraint and guarantee consensus. In addition, we develop a tractable parameter design method for selecting the triggering parameters offline. Numerical examples demonstrate the effectiveness of the proposed method.
comment: 11 pages
Dissimilarity-Based Persistent Coverage Control of Multi-Robot Systems for Improving Solar Irradiance Prediction Accuracy in Solar Thermal Power Plants
Accurate forecasting of future solar irradiance is essential for the effective control of solar thermal power plants. Although various kriging-based methods have been proposed to address the prediction problem, these methods typically do not provide an appropriate sampling strategy to dynamically position mobile sensors for optimizing prediction accuracy in real time, which is critical for achieving accurate forecasts with a minimal number of sensors. This paper introduces a dissimilarity map derived from a kriging model and proposes a persistent coverage control algorithm that effectively guides agents toward regions where additional observations are required to improve prediction performance. By means of experiments using mobile robots, the proposed approach was shown to obtain more accurate predictions than the considered baselines under various emulated irradiance fields.
comment: 8 pages, 6 figures, 5 tables
From Noisy Data to Hierarchical Control: A Model-Order-Reduction Framework
This paper develops a direct data-driven framework for constructing reduced-order models (ROMs) of discrete-time linear dynamical systems with unknown dynamics and process disturbances. The proposed scheme enables controller synthesis on the ROM and its refinement to the original system by an interface function designed using noisy data. To achieve this, the notion of simulation functions (SFs) is employed to establish a formal relation between the original system and its ROM, yielding a quantitative bound on the mismatch between their output trajectories. To construct such relations and interface functions, we rely on data collected from the unknown system. In particular, using noise-corrupted input-state data gathered along a single trajectory of the system, and without identifying the original dynamics, we propose data-dependent conditions, cast as a semidefinite program, for the simultaneous construction of ROMs, SFs, and interface functions. Through a case study, we demonstrate that data-driven controller synthesis on the ROM, combined with controller refinement via the interface function, enables the enforcement of complex specifications beyond stability.
From Global to Local: Hierarchical Probabilistic Verification for Reachability Learning
Hamilton-Jacobi (HJ) reachability provides formal safety guarantees for nonlinear systems. However, it becomes computationally intractable in high-dimensional settings, motivating learning-based approximations that may introduce unsafe errors or overly optimistic safe sets. In this work, we propose a hierarchical probabilistic verification framework for reachability learning that bridges offline global certification and online local refinement. We first construct a coarse safe set using scenario optimization, providing an efficient global probabilistic certificate. We then introduce an online local refinement module that expands the certified safe set near its boundary by solving a sequence of convex programs, recovering regions excluded by the global verification. This refinement reduces conservatism while focusing computation on critical regions of the state space. We provide probabilistic safety guarantees for both the global and locally refined sets. Integrated with a switching mechanism between a learned reachability policy and a model-based controller, the proposed framework improves success rates in goal-reaching tasks with safety constraints, as demonstrated in simulation experiments of two drones racing to a goal with complex safety constraints.
comment: Submitted to the 65th IEEE Conference on Decision and Control (CDC 2026) and IEEE Control Systems Letters (L-CSS)
Wireless bioelectronics for untethered biohybrid robots
Biohybrid robots integrate living tissues with engineered artificial structures to achieve organism-inspired actuation and behavior. A persistent challenge is delivering stimulation and control signals without relying on tethered wiring or bulky hardware immersed in cell-culture media. Wireless bioelectronics addresses this limitation by enabling the remote transfer of control signals, typically via radio-frequency magnetic fields, to locally stimulate muscle tissues at tissue-electrode interfaces. In parallel, wireless optoelectronics enables remote control of optogenetically modified, muscle-based robots by embedding light emitters that initiate muscle actuation through light-gated ion channels. Further advances incorporate neuromuscular junctions, leveraging biological signal transduction to enable selective control of multiple actuators through wireless frequency- and time-division multiplexing. This perspective article summarizes recent advances in control strategies for biohybrid robots, namely, wireless electrical stimulation, wireless optical stimulation, and neuromuscular integration. Then this describes cross-cutting design principles and highlights a future direction, namely, co-integration of neural organoid-bioelectronics toward autonomous, closed-loop biohybrid robots.
Active Calibration of Reachable Sets Using Approximate Pick-to-Learn
Reachability computations that rely on learned or estimated models require calibration in order to uphold confidence about their guarantees. Calibration generally involves sampling scenarios inside the reachable set. However, producing reasonable probabilistic guarantees may require many samples, which can be costly. To remedy this, we propose that calibration of reachable sets be performed using active learning strategies. In order to produce a probabilistic guarantee on the active learning, we adapt the Pick-to-Learn algorithm, which produces generalization bounds for standard supervised learning, to the active learning setting. Our method, Approximate Pick-to-Learn, treats the process of choosing data samples as maximizing an approximate error function. We can then use conformal prediction to ensure that the approximate error is close to the true model error. We demonstrate our technique for a simulated drone racing example in which learning is used to provide an initial guess of the reachable tube. Our method requires fewer samples to calibrate the model and provides more accurate sets than the baselines. We simultaneously provide tight generalization bounds.
comment: This paper has been submitted to the IEEE Control Systems Letters (L-CSS) jointly with the IEEE Conference on Decision and Control (CDC), with the addition of the crucial citation [3] and the code repo link
Parameter-interval estimation for cooperative reactive sputtering processes
Reactive sputtering is a plasma-based technique to deposit a thin film on a substrate. This contribution presents a novel parameter-interval estimation method for a well-established model that describes the uncertain and nonlinear reactive sputtering process behaviour. Building on a proposed monotonicity-based model classification, the method guarantees that all parameterizations within the parameter interval yield output trajectories and static characteristics consistent with the enclosure induced by the parameter interval. Correctness and practical applicability of the new method are demonstrated by an experimental validation, which also reveals inherent structural limitations of the well-established process model for state-estimation tasks.
Physics-informed structured learning of a class of recurrent neural networks with guaranteed properties
This paper proposes a physics-informed learning framework for a class of recurrent neural networks tailored to large-scale and networked systems. The approach aims to learn control-oriented models that preserve the structural and stability properties of the plant. The learning algorithm is formulated as a convex optimisation problem, allowing the inclusion of linear matrix inequality constraints to enforce desired system features. Furthermore, when the plant exhibits structural modularity, the resulting optimisation problem can be parallelised, requiring communication only among neighbouring subsystems. Simulation results show the effectiveness of the proposed approach.
End-to-End Low-Level Neural Control of an Industrial-Grade 6D Magnetic Levitation System
Magnetic levitation is poised to revolutionize industrial automation by integrating flexible in-machine product transport and seamless manipulation. It is expected to become the standard drive technology for automated manufacturing. However, controlling such systems is inherently challenging due to their complex, unstable dynamics. Traditional control approaches, which rely on hand-crafted control engineering, typically yield robust but conservative solutions, with their performance closely tied to the expertise of the engineering team. In contrast, learning-based neural control presents a promising alternative. This paper presents the first neural controller for 6D magnetic levitation. Trained end-to-end on interaction data from a proprietary controller, it directly maps raw sensor data and 6D reference poses to coil current commands. The neural controller can effectively generalize to previously unseen situations while maintaining accurate and robust control. These results underscore the practical feasibility of learning-based neural control in complex physical systems and suggest a future where such a paradigm could enhance or even substitute traditional engineering approaches in demanding real-world applications. The trained neural controller, source code, and demonstration videos are publicly available at https://sites.google.com/view/neural-maglev.
comment: 8 pages, 7 figures, 2 tables
On Building Myopic MPC Policies using Supervised Learning
The application of supervised learning techniques in combination with model predictive control (MPC) has recently generated significant interest, particularly in the area of approximate explicit MPC, where function approximators like deep neural networks are used to learn the MPC policy via optimal state-action pairs generated offline. While the aim of approximate explicit MPC is to closely replicate the MPC policy, substituting online optimization with a trained neural network, the performance guarantees that come with solving the online optimization problem are typically lost. This paper considers an alternative strategy, where supervised learning is used to learn the optimal value function offline instead of learning the optimal policy. This can then be used as the cost-to-go function in a myopic MPC with a very short prediction horizon, such that the online computation burden reduces significantly without affecting the controller performance. This approach differs from existing work on value function approximations in the sense that it learns the cost-to-go function by using offline-collected state-value pairs, rather than closed-loop performance data. The cost of generating the state-value pairs used for training is addressed using a sensitivity-based data augmentation scheme.
comment: Updated version available as arXiv:2508.05804
Benchmarking M-LTSF: Frequency and Noise-Based Evaluation of Multivariate Long Time Series Forecasting Models
Understanding the robustness of deep learning models for multivariate long-term time series forecasting (M-LTSF) remains challenging, as evaluations typically rely on real-world datasets with unknown noise properties. We propose a simulation-based evaluation framework that generates parameterizable synthetic datasets, where each dataset instance corresponds to a different configuration of signal components, noise types, signal-to-noise ratios, and frequency characteristics. These configurable components aim to model real-world multivariate time series data without the ambiguity of unknown noise. This framework enables fine-grained, systematic evaluation of M-LTSF models under controlled and diverse scenarios. We benchmark four representative architectures S-Mamba (state-space), iTransformer (transformer-based), R-Linear (linear), and Autoformer (decomposition-based). Our analysis reveals that all models degrade severely when lookback windows cannot capture complete periods of seasonal patters in the data. S-Mamba and Autoformer perform best on sawtooth patterns, while R-Linear and iTransformer favor sinusoidal signals. White and Brownian noise universally degrade performance with lower signal-to-noise ratio while S-Mamba shows specific trend-noise and iTransformer shows seasonal-noise vulnerability. Further spectral analysis shows that S-Mamba and iTransformer achieve superior frequency reconstruction. This controlled approach, based on our synthetic and principle-driven testbed, offers deeper insights into model-specific strengths and limitations through the aggregation of MSE scores and provides concrete guidance for model selection based on signal characteristics and noise conditions.
comment: Number of pages: 13 Number of figures: 16 Number of Tables: 1
Designing trajectories in the Earth-Moon system: a Levenberg-Marquardt approach
Trajectory design in cislunar space under a High-Fidelity Ephemeris Model (HFEM) is pursued through a nonlinear optimization perspective anchored on the transition of solutions from lower fidelity models, namely the Circular Restricted Three-Body Problem (CR3BP). The optimization problem is posed in the likeness of a multiple-shooting approach, aiming for segment-to-segment continuity while tracking proximity to the original CR3BP structures. The analysis of various formulations leads to the selection of an unconstrained least-squares problem for further investigation. The nonlinear optimization problem is convexified and the use of the Levenberg-Marquardt algorithm, as an alternative to the minimum-norm update equation found in most literature, is investigated for its control over the update step and inherent robustness. Additional techniques, such as adaptive weighting, are employed to further consolidate the behavior of the proposed algorithm in challenging scenarios. Numerical trials evaluate the adequacy of the methodology presented and compare it to the minimum-norm baseline over various application cases, including the generation of quasi-periodic trajectories and orbital transfers between them. The proposed technique is found to be a suitable alternative to the minimum-norm scheme, generally retaining better proximity to the original CR3BP trajectories and providing benefits in numerical robustness and stability. Moreover, the ease of including proximity objectives in a relaxed manner is shown to facilitate control over the shape of the final converged solution.
comment: Preprint submitted to Acta Astronautica
A Tutorial on Learning-Based Radio Map Construction: Data, Paradigms, and Physics-Awarenes
The integration of artificial intelligence into next-generation wireless networks necessitates the accurate construction of radio maps (RMs) as a foundational prerequisite for electromagnetic digital twins. A RM provides the digital representation of the wireless propagation environment, mapping complex geographical and topological boundary conditions to critical spatial-spectral metrics that range from received signal strength to full channel state information matrices. This tutorial presents a comprehensive survey of learning-based RM construction, systematically addressing three intertwined dimensions: data, paradigms, and physics-awareness. From the data perspective, we review physical measurement campaigns, ray tracing simulation engines, and publicly available benchmark datasets, identifying their respective strengths and fundamental limitations. From the paradigm perspective, we establish a core taxonomy that categorizes RM construction into source-aware forward prediction and source-agnostic inverse reconstruction, and examine five principal neural architecture families spanning convolutional neural networks, vision transformers, graph neural networks, generative adversarial networks, and diffusion models. We further survey optics-inspired methods adapted from neural radiance fields and 3D Gaussian splatting for continuous wireless radiation field modeling. From the physics-awareness perspective, we introduce a three-level integration framework encompassing data-level feature engineering, loss-level partial differential equation regularization, and architecture-level structural isomorphism. Open challenges including foundation model development, physical hallucination detection, and amortized inference for real-time deployment are discussed to outline future research directions.
Bounds of Validity for Bifurcations of Equilibria in a Class of Networked Dynamical Systems
Local bifurcation analysis plays a central role in understanding qualitative transitions in networked nonlinear dynamical systems, including dynamic neural network and opinion dynamics models. In this article we establish explicit bounds of validity for the classification of bifurcation diagrams in two classes of continuous-time networked dynamical systems, analogous in structure to the Hopfield and the Firing Rate dynamic neural network models. Our approach leverages recent advances in computing the bounds for the validity of Lyapunov-Schmidt reduction, a reduction method widely employed in nonlinear systems analysis. Using these bounds we rigorously characterize neighbourhoods around bifurcation points where predictions from reduced-order bifurcation equations remain reliable. We further demonstrate how these bounds can be applied to an illustrative family of nonlinear opinion dynamics on k-regular graphs, which emerges as a special case of the general framework. These results provide new analytical tools for quantifying the robustness of bifurcation phenomena in dynamics over networked systems and highlight the interplay between network structure and nonlinear dynamical behaviour.
comment: This manuscript has been accepted to the 2026 American Control Conference taking place in New Orleans, Louisiana, in May 2026
Robust H2/H-infinity control under stochastic requirements: minimizing conditional value-at-risk instead of worst-case performance
Conventional robust H2/H-infinity control minimizes the worst-case performance, often leading to a conservative design driven by very rare parametric configurations. To reduce this conservatism while taking advantage of the stochastic properties of Monte Carlo sampling and its compatibility with parallel computing, we introduce an alternative paradigm that optimizes the controller with respect to a stochastic criterion, namely the conditional value at risk. We present the problem formulation and discuss several open challenges toward a general synthesis framework. The potential of this approach is illustrated on a mechanical system, where it significantly improves overall performance by tolerating some degradation in very rare worst-case scenarios.
comment: Authors version. Published version (IEEE Control systems letters) available at: https://ieeexplore-ieee-org.gorgone.univ-toulouse.fr/document/11456041
Physics-Informed Evolution: An Evolutionary Framework for Solving Quantum Control Problems Involving the Schrödinger Equation
Physics-informed Neural Networks (PINNs) show that embedding physical laws directly into the learning objective can significantly enhance the efficiency and physical consistency of neural network solutions. Similar to optimizing loss functions in machine learning, evolutionary algorithms iteratively optimize objective functions by simulating natural selection processes. Inspired by this principle, we ask a natural question: can physical information be similarly embedded into the fitness function of evolutionary algorithms? In this work, we propose Physics-informed Evolution (PIE), a novel framework that incorporates physical information derived from governing physical laws into the evolutionary fitness landscape, thereby extending Physics-informed artificial intelligence methods from machine learning to the broader domain of evolutionary computation. As a concrete instantiation, we apply PIE to quantum control problems governed by the Schrödinger equation, where the goal is to find optimal control fields that drive quantum systems from initial states to desired target states. We validate PIE on three representative quantum control benchmarks: state preparation in V-type three-level systems, entangled state generation in superconducting quantum circuits, and two-atom cavity QED systems. Within the PIE framework, we systematically compare the performance of ten single-objective and five multi-objective evolutionary algorithms. Experimental results demonstrate that by embedding physical information into the fitness function, PIE effectively guides evolutionary search, yielding control fields with high fidelity, low state deviation, and robust performance across different scenarios. Our findings further suggest that the Physics-informed principle extends naturally beyond neural network training to the broader domain of evolutionary computation.
comment: 17 pages, 4 figures
An MPC framework for efficient navigation of mobile robots in cluttered environments
We present a model predictive control (MPC) framework for efficient navigation of mobile robots in cluttered environments. The proposed approach integrates a finite-segment shortest path planner into the finite-horizon trajectory optimization of the MPC. This formulation ensures convergence to dynamically selected targets and guarantees collision avoidance, even under general nonlinear dynamics and cluttered environments. The approach is validated through hardware experiments on a small ground robot, where a human operator dynamically assigns target locations that a robot should reach while avoiding obstacles. The robot reached new targets within 2-3 seconds and responded to new commands within 50 ms to 100 ms, immediately adjusting its motion even while still moving at high speeds toward a previous target.
comment: - Code available at: https://github.com/IntelligentControlSystems/ClutteredEnvironment - Supplementary video: https://youtu.be/Hn_hpAmGgq0
Learning stabilising policies for constrained nonlinear systems
This work proposes a two-layered control scheme for constrained nonlinear systems represented by a class of recurrent neural networks and affected by additive disturbances. In particular, a base controller ensures global or regional closed-loop l_p-stability of the error in tracking a desired equilibrium and the satisfaction of input and output constraints within a robustly positive invariant set. An additional control contribution, derived by combining the internal model control principle with a stable operator, is introduced to improve system performance. This operator, implemented as a stable neural network, can be trained via unconstrained optimisation on a chosen performance metric, without compromising closed-loop equilibrium tracking or constraint satisfaction, even if the optimisation is stopped prematurely. In addition, we characterise the class of closed-loop stable behaviours that can be achieved with the proposed architecture. Simulation results on a pH-neutralisation benchmark demonstrate the effectiveness of the proposed approach.
comment: 3 figures
Distributionally Robust Acceleration Control Barrier Filter for Efficient UAV Obstacle Avoidance
Dynamic obstacle avoidance (DOA) for unmanned aerial vehicles (UAVs) requires fast reaction under limited onboard resources. We introduce the distributionally robust acceleration control barrier function (DR-ACBF) as an efficient collision avoidance method maintaining safety regions. The method constructs a second-order control barrier function as linear half-space constraints on commanded acceleration. Latency, actuator limits, and obstacle accelerations are handled through an effective clearance that considers dynamics and delay. Uncertainty is mitigated using Cantelli tightening with per-obstacle risk. A DR-conditional value at risk (DR-CVaR)based early trigger expands margins near violations to improve DOA. Real-time execution is ensured via constant-time Gauss-Southwell projections. Simulation studies achieve similar avoidance performance at substantially lower computational effort than state-of-the-art baseline approaches. Experiments with Crazyflie drones demonstrate the feasibility of our approach.
comment: This work has been accepted for publication in IEEE RA-L
Can industrial overcapacity enable seasonal flexibility in electricity use? A case study of aluminum smelting in China
In many countries, declining demand in energy-intensive industries such as cement, steel, and aluminum is leading to industrial overcapacity. Although industrial overcapacity is traditionally envisioned as problematic and resource-wasteful, it could unlock energy-intensive industries' flexibility in electricity use. Here, using China's aluminum smelting industry as a case study, we evaluate the system-level cost-benefit of retaining energy-intensive industries overcapacity for flexible electricity use in decarbonized energy systems. We find that overcapacity can enable aluminum smelters to adopt a seasonal operation paradigm, ceasing production during winter load peaks that are exacerbated by heating electrification and renewable seasonality. This seasonal operation paradigm could reduce the investment and operational costs of China's decarbonized electricity system by 23-32 billion CNY/year (11-15% of the aluminum smelting industry's product value), sufficient to offset the increased smelter maintenance and product storage costs associated with overcapacity. It may also provide an opportunity for seasonally complementary labor deployment across the aluminum smelting and thermal power generation sectors, offering a potential pathway for mitigating socio-economic disruptions caused by industrial restructuring and energy decarbonization.
comment: Submitted to Nature Energy
Lightweight Tracking Control for Computationally Constrained Aerial Systems with the Newton-Raphson Method
We investigate the performance of a lightweight tracking controller, based on a flow version of the Newton-Raphson method, applied to a miniature blimp and a mid-size quadrotor. This tracking technique admits theoretical performance guarantees for certain classes of systems and has been successfully applied in simulation studies and on mobile robots with simplified motion models. We evaluate the technique through real-world flight experiments on aerial hardware platforms subject to realistic deployment and onboard computational constraints. The technique's performance is assessed in comparison with established baseline control frameworks of feedback linearization for the blimp, and nonlinear model predictive control for both the quadrotor and the blimp. The performance metrics under consideration are (i) root mean square error of flight trajectories with respect to target trajectories, (ii) algorithms' computation times, and (iii) CPU energy consumption associated with the control algorithms. The experimental findings show that the Newton-Raphson-based tracking controller achieves competitive or superior tracking performance to the baseline methods with substantially reduced computation time and energy expenditure.
RadioDiff-FS: Physics-Informed Manifold Alignment in Few-Shot Diffusion Models for High-Fidelity Radio Map Construction
Radio maps (RMs) provide spatially continuous propagation characterizations essential for 6G network planning, but high-fidelity RM construction remains challenging. Rigorous electromagnetic solvers incur prohibitive computational latency, while data-driven models demand massive labeled datasets and generalize poorly from simplified simulations to complex multipath environments. This paper proposes RadioDiff-FS, a few-shot diffusion framework that adapts a pretrained main-path generator to multipath-rich target domains with only a small number of high-fidelity samples. The adaptation is grounded in a theoretical decomposition of the multipath RM into a dominant main-path component and a directionally sparse residual. This decomposition shows that the cross-domain shift corresponds to a bounded and geometrically structured feature translation rather than an arbitrary distribution change. A direction-consistency loss (DCL) is then introduced to constrain diffusion score updates along physically plausible propagation directions, thereby suppressing phase-inconsistent artifacts that arise in the low-data regime. Experiments show that RadioDiff-FS reduces NMSE by 59.5\% on static RMs and by 74.0\% on dynamic RMs relative to the vanilla diffusion baseline, achieving an SSIM of 0.9752 and a PSNR of 36.37 dB under severely limited supervision. Even in a one-shot setting with a single target-domain sample per scene, RadioDiff-FS outperforms all fully supervised baselines, confirming that the directional constraint provides an effective inductive bias under extreme data scarcity. Code is available at https://github.com/UNIC-Lab/RadioDiff-FS.
Geometric Conditions for Lossless Convexification in Linear Optimal Control with Discrete-Valued Inputs
Optimal control problems with discrete-valued inputs are challenging due to the mixed-integer nature of the resulting optimization problems, which are generally intractable for real-time, safety-critical applications. Lossless convexification offers an alternative by reformulating mixed-integer programs as convex programs that can be solved efficiently. This paper develops a lossless convexification for optimal control problems of linear systems. We extend existing results by showing that system normality is preserved when reformulating Lagrange-form problems into Mayer-form via an epigraph transformation, and under simple geometric conditions on the input set the solution to the relaxed convex problem is the solution to the original non-convex problem. These results enable real-time computation of optimal discrete-valued controls without resorting to mixed-integer optimization. Numerical results from Monte Carlo simulations confirm that the proposed algorithm consistently yields discrete-valued control inputs with computation times compatible with safety-critical real-time applications.
Approaching Safety-Argumentation-by-Design: A Requirement-based Safety Argumentation Life Cycle for Automated Vehicles
Despite the growing number of automated vehicles on public roads, operating such systems in open contexts inevitably involves incidents. Developing a defensible case that the residual risk is reduced to a reasonable (societally acceptable) level is hence a prerequisite to be prepared for potential liability cases. A "safety argumentation" is a common means to represent this case. In this paper, we contribute to the state of the art in terms of process guidance on argumentation creation and maintenance - aiming to promote a safety-argumentation-by-design paradigm, which mandates co-developing both the system and argumentation from the earliest stages. Initially, we extend a systematic design model for automated driving functions with an argumentation layer to address prevailing misconceptions regarding the development of safety arguments in a process context. Identified limitations of this extension motivate our complementary design of a dedicated argumentation life cycle that serves as an additional process viewpoint. Correspondingly, we define literature- and expert-based process requirements. To illustrate the safety argumentation life cycle that we propose as a result of implementing these consolidated requirements, we demonstrate principles of the introduced process phases (baselining, evolution, continuous maintenance) by an argumentation example on an operational design domain exit response.
Optimal Satellite Constellation Configuration Design: A Collection of Mixed Integer Linear Programs
Designing satellite constellation systems involves complex multidisciplinary optimization in which coverage serves as a primary driver of overall system cost and performance. Among the various design considerations, constellation configuration, which dictates how satellites are placed and distributed in space relative to each other, predominantly determines the resulting coverage. In constellation configuration design, coverage may be treated either as an optimization objective or as a constraint, depending on mission goals. State-of-the-art literature addresses each mission scenario on a case-by-case basis, employing distinct assumptions, modeling techniques, and solution methods. While such problem-specific approaches yield valuable insights, users often face implementation challenges when performing trade-off studies across different mission scenarios, as each scenario must be handled distinctly. In this paper, we propose a collection of five mixed-integer linear programs that are of practical significance, extensible to more complex mission narratives through additional constraints, and capable of obtaining provably optimal constellation configurations. The framework can handle various metrics and mission scenarios, such as percent coverage, average or maximum revisit times, a fixed number of satellites, spatiotemporally varying coverage requirements, and static or dynamic targets. The paper presents several case studies and comparative analyses to demonstrate the versatility of the proposed framework.
comment: 42 pages, Journal of Spacecraft and Rockets (Published)
Deep Reinforcement Learning-Based Cooperative Rate Splitting for Satellite-to-Underground Communication Networks
Reliable downlink communication in satellite-to-underground networks remains challenging due to severe signal attenuation caused by underground soil and refraction in the air-soil interface. To address this, we propose a novel cooperative rate-splitting (CRS)-aided transmission framework, where an aboveground relay decodes and forwards the common stream to underground devices (UDs). Based on this framework, we formulate a max-min fairness optimization problem that jointly optimizes power allocation, message splitting, and time slot scheduling to maximize the minimum achievable rate across UDs. To solve this high-dimensional non-convex problem under uncertain channels, we develop a deep reinforcement learning solution framework based on the proximal policy optimization (PPO) algorithm that integrates distribution-aware action modeling and a multi-branch actor network. Simulation results under a realistic underground pipeline monitoring scenario demonstrate that the proposed approach achieves average max-min rate gains exceeding $167\%$ over conventional benchmark strategies across various numbers of UDs and underground conditions.
comment: 6 pages, 3 figures, 1 table, and submitted to IEEE TVT
Finite-time Convergent Control Barrier Functions with Feasibility Guarantees
This paper studies the problem of finite-time convergence to a prescribed safe set for nonlinear systems whose initial states violate the safety constraints. Existing Control Lyapunov-Barrier Functions (CLBFs) can enforce recovery to the safe set but may suffer from the issue of chattering and they do not explicitly consider control bounds. To address these limitations, we propose a new Control Barrier Function (CBF) formulation that guarantees finite-time convergence to the safe set while ensuring feasibility under control constraints. Specifically, we strengthen the initially violated safety constraint by introducing a parameter which enables the exploitation of the asymptotic property of a CBF to converge to the safe set in finite time. Furthermore, the conditions for the existence of such a CBF under control bounds to achieve finite-time convergence are derived via reachability analysis and constraint comparison, providing a systematic approach for parameter design. A case study on 2D obstacle avoidance is presented to demonstrate the effectiveness and advantages of the proposed method.
Control of Human-Induced Seismicity in Underground Reservoirs Governed by a Nonlinear 3D PDE-ODE System
Induced seismicity caused by fluid extraction or injection in underground reservoirs is a major challenge for safe energy production and storage. This paper presents a robust output-feedback controller for induced seismicity mitigation in geological reservoirs described by a coupled 3D PDE-ODE model. The controller is nonlinear and robust (MIMO Super-Twisting design), producing a continuous control signal and requiring minimal model information, while accommodating parameter uncertainties and spatial heterogeneity. Two operational outputs are regulated simultaneously: regional pressures and seismicity rates computed over reservoir sub-regions. Closed-loop properties are established via explicit bounds on the solution and its time derivative for both the infinite-dimensional dynamics and the nonlinear ODE system, yielding finite-time or exponential convergence of the tracking errors. The method is evaluated on the Groningen gas-field case study in two scenarios: gas production while not exceeding the intrinsic seismicity of the region, and combined production with CO$_2$ injection toward net-zero carbon operation. Simulations demonstrate accurate tracking of pressure and seismicity targets across regions under significant parameter uncertainty, supporting safer reservoir operation while preserving production objectives.
On the Global Optimality of Linear Policies for Sinkhorn Distributionally Robust Linear Quadratic Control
The Linear Quadratic Gaussian (LQG) regulator is a cornerstone of optimal control theory, yet its performance can degrade significantly when the noise distributions deviate from the assumed Gaussian model. To address this limitation, this work proposes a distributionally robust generalization of the finite-horizon LQG control problem. Specifically, we assume that the noise distributions are unknown and belong to ambiguity sets defined in terms of an entropy-regularized Wasserstein distance centered at a nominal Gaussian distribution. By deriving novel bounds on this Sinkhorn discrepancy and proving structural and topological properties of the resulting ambiguity sets, we establish global optimality of linear policies. Numerical experiments showcase improved distributional robustness of our control policy.
Robotics
CoordLight: Learning Decentralized Coordination for Network-Wide Traffic Signal Control
Adaptive traffic signal control (ATSC) is crucial in alleviating congestion, maximizing throughput and promoting sustainable mobility in ever-expanding cities. Multi-Agent Reinforcement Learning (MARL) has recently shown significant potential in addressing complex traffic dynamics, but the intricacies of partial observability and coordination in decentralized environments still remain key challenges in formulating scalable and efficient control strategies. To address these challenges, we present CoordLight, a MARL-based framework designed to improve intra-neighborhood traffic by enhancing decision-making at individual junctions (agents), as well as coordination with neighboring agents, thereby scaling up to network-level traffic optimization. Specifically, we introduce the Queue Dynamic State Encoding (QDSE), a novel state representation based on vehicle queuing models, which strengthens the agents' capability to analyze, predict, and respond to local traffic dynamics. We further propose an advanced MARL algorithm, named Neighbor-aware Policy Optimization (NAPO). It integrates an attention mechanism that discerns the state and action dependencies among adjacent agents, aiming to facilitate more coordinated decision-making, and to improve policy learning updates through robust advantage calculation. This enables agents to identify and prioritize crucial interactions with influential neighbors, thus enhancing the targeted coordination and collaboration among agents. Through comprehensive evaluations against state-of-the-art traffic signal control methods over three real-world traffic datasets composed of up to 196 intersections, we empirically show that CoordLight consistently exhibits superior performance across diverse traffic networks with varying traffic flows. The code is available at https://github.com/marmotlab/CoordLight
comment: \c{opyright} 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
LATS: Large Language Model Assisted Teacher-Student Framework for Multi-Agent Reinforcement Learning in Traffic Signal Control
Adaptive Traffic Signal Control (ATSC) aims to optimize traffic flow and minimize delays by adjusting traffic lights in real time. Recent advances in Multi-agent Reinforcement Learning (MARL) have shown promise for ATSC, yet existing approaches still suffer from limited representational capacity, often leading to suboptimal performance and poor generalization in complex and dynamic traffic environments. On the other hand, Large Language Models (LLMs) excel at semantic representation, reasoning, and analysis, yet their propensity for hallucination and slow inference speeds often hinder their direct application to decision-making tasks. To address these challenges, we propose a novel learning paradigm named LATS that integrates LLMs and MARL, leveraging the former's strong prior knowledge and inductive abilities to enhance the latter's decision-making process. Specifically, we introduce a plug-and-play teacher-student learning module, where a trained embedding LLM serves as a teacher to generate rich semantic features that capture each intersection's topology structures and traffic dynamics. A much simpler (student) neural network then learns to emulate these features through knowledge distillation in the latent space, enabling the final model to operate independently from the LLM for downstream use in the RL decision-making process. This integration significantly enhances the overall model's representational capacity across diverse traffic scenarios, thus leading to more efficient and generalizable control strategies. Extensive experiments across diverse traffic datasets empirically demonstrate that our method enhances the representation learning capability of RL models, thereby leading to improved overall performance and generalization over both traditional RL and LLM-only approaches. [...]
A Sensorless, Inherently Compliant Anthropomorphic Musculoskeletal Hand Driven by Electrohydraulic Actuators
Robotic manipulation in unstructured environments requires end-effectors that combine high kinematic dexterity with physical compliance. While traditional rigid hands rely on complex external sensors for safe interaction, electrohydraulic actuators offer a promising alternative. This paper presents the design, control, and evaluation of a novel musculoskeletal robotic hand architecture powered entirely by remote Peano-HASEL actuators, specifically optimized for safe manipulation. By relocating the actuators to the forearm, we functionally isolate the grasping interface from electrical hazards while maintaining a slim, human-like profile. To address the inherently limited linear contraction of these soft actuators, we integrate a 1:2 pulley routing mechanism that mechanically amplifies tendon displacement. The resulting system prioritizes compliant interaction over high payload capacity, leveraging the intrinsic force-limiting characteristics of the actuators to provide a high level of inherent safety. Furthermore, this physical safety is augmented by the self-sensing nature of the HASEL actuators. By simply monitoring the operating current, we achieve real-time grasp detection and closed-loop contact-aware control without relying on external force transducers or encoders. Experimental results validate the system's dexterity and inherent safety, demonstrating the successful execution of various grasp taxonomies and the non-destructive grasping of highly fragile objects, such as a paper balloon. These findings highlight a significant step toward simplified, inherently compliant soft robotic manipulation.
comment: This work has been submitted to the IEEE for possible publication
Evidence of an Emergent "Self" in Continual Robot Learning
A key challenge to understanding self-awareness has been a principled way of quantifying whether an intelligent system has a concept of a "self," and if so how to differentiate the "self" from other cognitive structures. We propose that the "self" can be isolated by seeking the invariant portion of cognitive process that changes relatively little compared to more rapidly acquired cognitive knowledge and skills, because our self is the most persistent aspect of our experiences. We used this principle to analyze the cognitive structure of robots under two conditions: One robot learns a constant task, while a second robot is subjected to continual learning under variable tasks. We find that robots subjected to continual learning develop an invariant subnetwork that is significantly more stable (p < 0.001) compared to the control. We suggest that this principle can offer a window into exploring selfhood in other cognitive AI systems.
comment: 39 pages, 17 figures, includes supplementary materials
Toward Generalist Neural Motion Planners for Robotic Manipulators: Challenges and Opportunities
State-of-the-art generalist manipulation policies have enabled the deployment of robotic manipulators in unstructured human environments. However, these frameworks struggle in cluttered environments primarily because they utilize auxiliary modules for low-level motion planning and control. Motion planning remains challenging due to the high dimensionality of the robot's configuration space and the presence of workspace obstacles. Neural motion planners have enhanced motion planning efficiency by offering fast inference and effectively handling the inherent multi-modality of the motion planning problem. Despite such benefits, current neural motion planners often struggle to generalize to unseen, out-of-distribution planning settings. This paper reviews and analyzes the state-of-the-art neural motion planners, highlighting both their benefits and limitations. It also outlines a path toward establishing generalist neural motion planners capable of handling domain-specific challenges. For a list of the reviewed papers, please refer to https://davoodsz.github.io/planning-manip-survey.github.io/.
Decentralized End-to-End Multi-AAV Pursuit Using Predictive Spatio-Temporal Observation via Deep Reinforcement Learning
Decentralized cooperative pursuit in cluttered environments is challenging for autonomous aerial swarms, especially under partial and noisy perception. Existing methods often rely on abstracted geometric features or privileged ground-truth states, and therefore sidestep perceptual uncertainty in real-world settings. We propose a decentralized end-to-end multi-agent reinforcement learning (MARL) framework that maps raw LiDAR observations directly to continuous control commands. Central to the framework is the Predictive Spatio-Temporal Observation (PSTO), an egocentric grid representation that aligns obstacle geometry with predictive adversarial intent and teammate motion in a unified, fixed-resolution projection. Built on PSTO, a single decentralized policy enables agents to navigate static obstacles, intercept dynamic targets, and maintain cooperative encirclement. Simulations demonstrate that the proposed method achieves superior capture efficiency and competitive success rates compared to state-of-the-art learning-based approaches relying on privileged obstacle information. Furthermore, the unified policy scales seamlessly across different team sizes without retraining. Finally, fully autonomous outdoor experiments validate the framework on a quadrotor swarm relying on only onboard sensing and computing.
Environment-Grounded Multi-Agent Workflow for Autonomous Penetration Testing
The increasing complexity and interconnectivity of digital infrastructures make scalable and reliable security assessment methods essential. Robotic systems represent a particularly important class of operational technology, as modern robots are highly networked cyber-physical systems deployed in domains such as industrial automation, logistics, and autonomous services. This paper explores the use of large language models for automated penetration testing in robotic environments. We propose an environment-grounded multi-agent architecture tailored to Robotics-based systems. The approach dynamically constructs a shared graph-based memory during execution that captures the observable system state, including network topology, communication channels, vulnerabilities, and attempted exploits. This enables structured automation while maintaining traceability and effective context management throughout the testing process. Evaluated across multiple iterations within a specialized robotics Capture-the-Flag scenario (ROS/ROS2), the system demonstrated high reliability, successfully completing the challenge in 100\% of test runs (n=5). This performance significantly exceeds literature benchmarks while maintaining the traceability and human oversight required by frameworks like the EU AI Act.
Goal-Oriented Reactive Simulation for Closed-Loop Trajectory Prediction
Current trajectory prediction models are primarily trained in an open-loop manner, which often leads to covariate shift and compounding errors when deployed in real-world, closed-loop settings. Furthermore, relying on static datasets or non-reactive log-replay simulators severs the interactive loop, preventing the ego agent from learning to actively negotiate surrounding traffic. In this work, we propose an on-policy closed-loop training paradigm optimized for high-frequency, receding horizon ego prediction. To ground the ego prediction in a realistic representation of traffic interactions and to achieve reactive consistency, we introduce a goal-oriented, transformer-based scene decoder, resulting in an inherently reactive training simulation. By exposing the ego agent to a mixture of open-loop data and simulated, self-induced states, the model learns recovery behaviors to correct its own execution errors. Extensive evaluation demonstrates that closed-loop training significantly enhances collision avoidance capabilities at high replanning frequencies, yielding relative collision rate reductions of up to 27.0% on nuScenes and 79.5% in dense DeepScenario intersections compared to open-loop baselines. Additionally, we show that a hybrid simulation combining reactive with non-reactive surrounding agents achieves optimal balance between immediate interactivity and long-term behavioral stability.
Accelerated Spline-Based Time-Optimal Motion Planning with Continuous Safety Guarantees for Non-Differentially Flat Systems
Generating time-optimal, collision-free trajectories for autonomous mobile robots involves a fundamental trade-off between guaranteeing safety and managing computational complexity. State-of-the-art approaches formulate spline-based motion planning as a single Optimal Control Problem (OCP) but often suffer from high computational cost because they include separating hyperplane parameters as decision variables to enforce continuous collision avoidance. This paper presents a novel method that alleviates this bottleneck by decoupling the determination of separating hyperplanes from the OCP. By treating the separation theorem as an independent classification problem solvable via a linear system or quadratic program, the proposed method eliminates hyperplane parameters from the optimisation variables, effectively transforming non-convex constraints into linear ones. Experimental validation demonstrates that this decoupled approach reduces trajectory computation times up to almost 60% compared to fully coupled methods in obstacle-rich environments, while maintaining rigorous continuous safety guarantees.
comment: Submitted to the 2026 10th IEEE Conference on Control Technology and Applications (CCTA)
Equivariant Filter Transformations for Consistent and Efficient Visual--Inertial Navigation
This paper presents an equivariant filter (EqF) transformation approach for visual--inertial navigation. By establishing analytical links between EqFs with different symmetries, the proposed approach enables systematic consistency design and efficient implementation. First, we formalize the mapping from the global system state to the local error-state and prove that it induces a nonsingular linear transformation between the error-states of any two EqFs. Second, we derive transformation laws for the associated linearized error-state systems and unobservable subspaces. These results yield a general consistency design principle: for any unobservable system, a consistent EqF with a state-independent unobservable subspace can be synthesized by transforming the local coordinate chart, thereby avoiding ad hoc symmetry analysis. Third, to mitigate the computational burden arising from the non-block-diagonal Jacobians required for consistency, we propose two efficient implementation strategies. These strategies exploit the Jacobians of a simpler EqF with block-diagonal structure to accelerate covariance operations while preserving consistency. Extensive Monte Carlo simulations and real-world experiments validate the proposed approach in terms of both accuracy and runtime.
comment: 28 papes, 11 figures
Knowledge-Guided Manipulation Using Multi-Task Reinforcement Learning ICRA 2026
This paper introduces Knowledge Graph based Massively Multi-task Model-based Policy Optimization (KG-M3PO), a framework for multi-task robotic manipulation in partially observable settings that unifies Perception, Knowledge, and Policy. The method augments egocentric vision with an online 3D scene graph that grounds open-vocabulary detections into a metric, relational representation. A dynamic-relation mechanism updates spatial, containment, and affordance edges at every step, and a graph neural encoder is trained end-to-end through the RL objective so that relational features are shaped directly by control performance. Multiple observation modalities (visual, proprioceptive, linguistic, and graph-based) are encoded into a shared latent space, upon which the RL agent operates to drive the control loop. The policy conditions on lightweight graph queries alongside visual and proprioceptive inputs, yielding a compact, semantically informed state for decision making. Experiments on a suite of manipulation tasks with occlusions, distractors, and layout shifts demonstrate consistent gains over strong baselines: the knowledge-conditioned agent achieves higher success rates, improved sample efficiency, and stronger generalization to novel objects and unseen scene configurations. These results support the premise that structured, continuously maintained world knowledge is a powerful inductive bias for scalable, generalizable manipulation: when the knowledge module participates in the RL computation graph, relational representations align with control, enabling robust long-horizon behavior under partial observability.
comment: 8 pages, 8 figures. Accepted to IEEE International Conference on Robotics and Automation (ICRA 2026)
SOMA: Strategic Orchestration and Memory-Augmented System for Vision-Language-Action Model Robustness via In-Context Adaptation IROS 2026
Despite the promise of Vision-Language-Action (VLA) models as generalist robotic controllers, their robustness against perceptual noise and environmental variations in out-of-distribution (OOD) tasks remains fundamentally limited by the absence of long-term memory, causal failure attribution, and dynamic intervention capability. To address this, we propose SOMA, a Strategic Orchestration and Memory-Augmented System that upgrades frozen VLA policies for robust in-context adaptation without parameter fine-tuning. Specifically, SOMA operates through an online pipeline of contrastive Dual-Memory Retrieval-Augmented Generation (RAG), an Attribution-Driven Large-Language-Model (LLM) Orchestrator, and extensible Model Context Protocol (MCP) interventions, while an offline Memory Consolidation module continuously distills the execution traces into reliable priors. Experimental evaluations across three backbone models (pi0, pi0.5, and SmolVLA) on LIBERO-PRO and our proposed LIBERO-SOMA benchmarks demonstrate that SOMA achieves an average absolute success rate gain of 56.6%. This includes a significant absolute improvement of 89.1% in long-horizon task chaining. Project page and source code are available at: https://github.com/LZY-1021/SOMA.
comment: 9 pages, 16 figures, 3 table. Submitted to IROS 2026
PCHC: Enabling Preference Conditioned Humanoid Control via Multi-Objective Reinforcement Learning
Humanoid robots often need to balance competing objectives, such as maximizing speed while minimizing energy consumption. While current reinforcement learning (RL) methods can master complex skills like fall recovery and perceptive locomotion, they are constrained by fixed weighting strategies that produce a single suboptimal policy, rather than providing a diverse set of solutions for sophisticated multi-objective control. In this paper, we propose a novel framework leveraging Multi-Objective Reinforcement Learning (MORL) to achieve Preference-Conditioned Humanoid Control (PCHC). Unlike conventional methods that require training a series of policies to approximate the Pareto front, our framework enables a single, preference-conditioned policy to exhibit a wide spectrum of diverse behaviors. To effectively integrate these requirements, we introduce a Beta distribution-based alignment mechanism based on preference vectors modulating a Mixture-of-Experts (MoE) module. We validated our approach on two representative humanoid tasks. Extensive simulations and real-world experiments demonstrate that the proposed framework allows the robot to adaptively shift its objective priorities in real-time based on the input preference condition.
comment: 8 pages, 7 figures
QuadFM: Foundational Text-Driven Quadruped Motion Dataset for Generation and Control
Despite significant advances in quadrupedal robotics, a critical gap persists in foundational motion resources that holistically integrate diverse locomotion, emotionally expressive behaviors, and rich language semantics-essential for agile, intuitive human-robot interaction. Current quadruped motion datasets are limited to a few mocap primitives (e.g., walk, trot, sit) and lack diverse behaviors with rich language grounding. To bridge this gap, we introduce Quadruped Foundational Motion (QuadFM) , the first large-scale, ultra-high-fidelity dataset designed for text-to-motion generation and general motion control. QuadFM contains 11,784 curated motion clips spanning locomotion, interactive, and emotion-expressive behaviors (e.g., dancing, stretching, peeing), each with three-layer annotation-fine-grained action labels, interaction scenarios, and natural language commands-totaling 35,352 descriptions to support language-conditioned understanding and command execution. We further propose Gen2Control RL, a unified framework that jointly trains a general motion controller and a text-to-motion generator, enabling efficient end-to-end inference on edge hardware. On a real quadruped robot with an NVIDIA Orin, our system achieves real-time motion synthesis (<500 ms latency). Simulation and real-world results show realistic, diverse motions while maintaining robust physical interaction. The dataset will be released at https://github.com/GaoLii/QuadFM.
MIRROR: Visual Motion Imitation via Real-time Retargeting and Teleoperation with Parallel Differential Inverse Kinematics
Real-time humanoid teleoperation requires inverse kinematics (IK) solvers that are both responsive and constraint-safe under kinematic redundancy and self-collision constraints. While differential IK enables efficient online retargeting, its locally linearized updates are inherently basin-dependent and often become trapped near joint limits, singularities, or active collision boundaries, leading to unsafe or stagnant behavior. We propose a GPU-parallelized, continuation-based differential IK that improves escape from such constraint-induced local minima while preserving real-time performance, promoting safety and stability. Multiple constrained IK quadratic programs are evaluated in parallel, together with a self-collision avoidance control barrier function (CBF), and a Lyapunov-based progression criterion selects updates that reduce the final global task-space error. The method is paired with a visual skeletal pose estimation pipeline that enables robust, real-time upper-body teleoperation on the THEMIS humanoid robot hardware in real-world tasks.
comment: 8 pages, 7 figures
SafeFlow: Real-Time Text-Driven Humanoid Whole-Body Control via Physics-Guided Rectified Flow and Selective Safety Gating
Recent advances in real-time interactive text-driven motion generation have enabled humanoids to perform diverse behaviors. However, kinematics-only generators often exhibit physical hallucinations, producing motion trajectories that are physically infeasible to track with a downstream motion tracking controller or unsafe for real-world deployment. These failures often arise from the lack of explicit physics-aware objectives for real-robot execution and become more severe under out-of-distribution (OOD) user inputs. Hence, we propose SafeFlow, a text-driven humanoid whole-body control framework that combines physics-guided motion generation with a 3-Stage Safety Gate driven by explicit risk indicators. SafeFlow adopts a two-level architecture. At the high level, we generate motion trajectories using Physics-Guided Rectified Flow Matching in a VAE latent space to improve real-robot executability, and further accelerate sampling via Reflow to reduce the number of function evaluations (NFE) for real-time control. The 3-Stage Safety Gate enables selective execution by detecting semantic OOD prompts using a Mahalanobis score in text-embedding space, filtering unstable generations via a directional sensitivity discrepancy metric, and enforcing final hard kinematic constraints such as joint and velocity limits before passing the generated trajectory to a low-level motion tracking controller. Extensive experiments on the Unitree G1 demonstrate that SafeFlow outperforms prior diffusion-based methods in success rate, physical compliance, and inference speed, while maintaining diverse expressiveness.
comment: Project Page: https://hanbyelcho.info/safeflow/
SLAT-Phys: Fast Material Property Field Prediction from Structured 3D Latents
Estimating the material property field of 3D assets is critical for physics-based simulation, robotics, and digital twin generation. Existing vision-based approaches are either too expensive and slow or rely on 3D information. We present SLAT-Phys, an end-to-end method that predicts spatially varying material property fields of 3D assets directly from a single RGB image without explicit 3D reconstruction. Our approach leverages spatially organised latent features from a pretrained 3D asset generation model that encodes rich geometry and semantic prior, and trains a lightweight neural decoder to estimate Young's modulus, density, and Poisson's ratio. The coarse volumetric layout and semantic cues of the latent representation about object geometry and appearance enable accurate material estimation. Our experiments demonstrate that our method provides competitive accuracy in predicting continuous material parameters when compared against prior approaches, while significantly reducing computation time. In particular, SLAT-Phys requires only 9.9 seconds per object on an NVIDIA RTXA5000 GPU and avoids reconstruction and voxelization preprocessing. This results in 120x speedup compared to prior methods and enables faster material property estimation from a single image.
comment: 8 page, 4 figures
Robust Distributed Cooperative Path-Following and Local Replanning for Multi-UAVs Under Differentiated Low-Altitude Paths
Multiple fixed-wing unmanned aerial vehicles (multi-UAVs) encounter significant challenges in cooperative path following over complex Digital Elevation Model (DEM) low-altitude airspace, including wind field disturbances, sudden obstacles, and requirements of distributed temporal synchronization during differentiated path tracking. Existing methods lack efficient distributed coordination mechanisms for time-consistent tracking of 3D differentiated paths, fail to quantify robustness against disturbances, and lack effective online obstacle avoidance replanning capabilities. To address these gaps, a cooperative control strategy is proposed: first, the distributed cooperative path-following problem is quantified via time indices, and consistency is ensured through a distributed communication protocol; second, a longitudinal-lateral look-ahead angle adjustment method coupled with a robust guidance law is developed to achieve finite-time stabilization of path following error to zero under wind disturbances; third, an efficient local path replanning method with minimal time cost is designed for real-time online obstacle avoidance.Experimental validations demonstrate the effectiveness and superiority of the $\ $proposed strategy.
comment: 8 pages, 7 figures
MonoSIM: An open source SIL framework for Ackermann Vehicular Systems with Monocular Vision
This paper presents an open-source Software-in-the-Loop (SIL) simulation platform designed for autonomous Ackerman vehicle research and education. The proposed framework focuses on simplicity, while making it easy to work with small-scale experimental setups, such as the XTENTH-CAR platform. The system was designed using open source tools, creating an environment with a monocular camera vision system to capture stimuli from it with minimal computational overhead through a sliding window based lane detection method. The platform supports a flexible algorithm testing and validation environment, allowing researchers to implement and compare various control strategies within an easy-to-use virtual environment. To validate the working of the platform, Model Predictive Control (MPC) and Proportional-Integral-Derivative (PID) algorithms were implemented within the SIL framework. The results confirm that the platform provides a reliable environment for algorithm verification, making it an ideal tool for future multi-agent system research, educational purposes, and low-cost AGV development. Our code is available at https://github.com/shantanu404/monosim.git.
comment: 6 pages, 16 figures, Published in "IEEE 12th International Conference on Automation, Robotics and Application 2026"
Event-Driven Proactive Assistive Manipulation with Grounded Vision-Language Planning
Assistance in collaborative manipulation is often initiated by user instructions, making high-level reasoning request-driven. In fluent human teamwork, however, partners often infer the next helpful step from the observed outcome of an action rather than waiting for instructions. Motivated by this, we introduce a shift from request-driven assistance to event-driven proactive assistance, where robot actions are initiated by workspace state transitions induced by human--object interactions rather than user-provided task instructions. To this end, we propose an event-driven framework that tracks interaction progress with an event monitor and, upon event completion, extracts stabilized pre/post snapshots that characterize the resulting state transition. Given the stabilized snapshots, the planner analyzes the implied state transition to infer a task-level goal and decide whether to intervene; if so, it generates a sequence of assistive actions. To make outputs executable and verifiable, we restrict actions to a set of action primitives and reference objects via integer IDs. We evaluate the framework on a real tabletop number-block collaboration task, demonstrating that explicit pre/post state-change evidence improves proactive completion on solvable scenes and appropriate waiting on unsolvable ones.
Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration ICLR 2026
When safety is formulated as a limit of cumulative cost, safe reinforcement learning (RL) aims to learn policies that maximize return subject to the cost constraint in data collection and deployment. Off-policy safe RL methods, although offering high sample efficiency, suffer from constraint violations due to cost-agnostic exploration and estimation bias in cumulative cost. To address this issue, we propose Constrained Optimistic eXploration Q-learning (COX-Q), an off-policy safe RL algorithm that integrates cost-bounded online exploration and conservative offline distributional value learning. First, we introduce a novel cost-constrained optimistic exploration strategy that resolves gradient conflicts between reward and cost in the action space and adaptively adjusts the trust region to control the training cost. Second, we adopt truncated quantile critics to stabilize the cost value learning. Quantile critics also quantify epistemic uncertainty to guide exploration. Experiments on safe velocity, safe navigation, and autonomous driving tasks demonstrate that COX-Q achieves high sample efficiency, competitive test safety performance, and controlled data collection cost. The results highlight COX-Q as a promising RL method for safety-critical applications.
comment: 21 pages, 9 figures, accepted by ICLR 2026 poster
AgentChemist: A Multi-Agent Experimental Robotic Platform Integrating Chemical Perception and Precise Control
Chemical laboratory automation has long been constrained by rigid workflows and poor adaptability to the long-tail distribution of experimental tasks. While most automated platforms perform well on a narrow set of standardized procedures, real laboratories involve diverse, infrequent, and evolving operations that fall outside predefined protocols. This mismatch prevents existing systems from generalizing to novel reaction conditions, uncommon instrument configurations, and unexpected procedural variations. We present a multi-agent robotic platform designed to address this long-tail challenge through collaborative task decomposition, dynamic scheduling, and adaptive control. The system integrates chemical perception for real-time reaction monitoring with feedback-driven execution, enabling it to adjust actions based on evolving experimental states rather than fixed scripts. Validation via acid-base titration demonstrates autonomous progress tracking, adaptive dispensing control, and reliable end-to-end experiment execution. By improving generalization across diverse laboratory scenarios, this platform provides a practical pathway toward intelligent, flexible, and scalable laboratory automation.
Learning-guided Prioritized Planning for Lifelong Multi-Agent Path Finding in Warehouse Automation
Lifelong Multi-Agent Path Finding (MAPF) is critical for modern warehouse automation, which requires multiple robots to continuously navigate conflict-free paths to optimize the overall system throughput. However, the complexity of warehouse environments and the long-term dynamics of lifelong MAPF often demand costly adaptations to classical search-based solvers. While machine learning methods have been explored, their superiority over search-based methods remains inconclusive. In this paper, we introduce Reinforcement Learning (RL) guided Rolling Horizon Prioritized Planning (RL-RH-PP), the first framework integrating RL with search-based planning for lifelong MAPF. Specifically, we leverage classical Prioritized Planning (PP) as a backbone for its simplicity and flexibility in integrating with a learning-based priority assignment policy. By formulating dynamic priority assignment as a Partially Observable Markov Decision Process (POMDP), RL-RH-PP exploits the sequential decision-making nature of lifelong planning while delegating complex spatial-temporal interactions among agents to reinforcement learning. An attention-based neural network autoregressively decodes priority orders on-the-fly, enabling efficient sequential single-agent planning by the PP planner. Evaluations in realistic warehouse simulations show that RL-RH-PP achieves the highest total throughput among baselines and generalizes effectively across agent densities, planning horizons, and warehouse layouts. Our interpretive analyses reveal that RL-RH-PP proactively prioritizes congested agents and strategically redirects agents from congestion, easing traffic flow and boosting throughput. These findings highlight the potential of learning-guided approaches to augment traditional heuristics in modern warehouse automation.
Aesthetics of Robot-Mediated Applied Drama: A Case Study on REMind
Social robots are increasingly used in education, but most applications cast them as tutors offering explanation-based instruction. We explore an alternative: Robot-Mediated Applied Drama (RMAD), in which robots function as life-like puppets in interactive dramatic experiences designed to support reflection and social-emotional learning. This paper presents REMind, an anti-bullying robot role-play game that helps children rehearse bystander intervention and peer support. We focus on a central design challenge in RMAD: how to make robot drama emotionally and aesthetically engaging despite the limited expressive capacities of current robotic platforms. Through the development of REMind, we show how performing arts expertise informed this process, and argue that the aesthetics of robot drama arise from the coordinated design of the wider experience, not from robot expressivity alone.
comment: 15 pages, 6 figures. Preprint submitted to the 18th International Conference on Social Robotics (ICSR 2026)
High-Density Automated Valet Parking with Relocation-Free Sequential Operations
In this paper, we present DROP, high-Density Relocation-free sequential OPerations in automated valet parking. DROP addresses the challenges in high-density parking & vehicle retrieval without relocations. Each challenge is handled by jointly providing area-efficient layouts and relocation-free parking & exit sequences, considering accessibility with relocation-free sequential operations. To generate such sequences, relocation-free constraints are formulated as explicit logical conditions expressed in boolean variables. Recursive search strategies are employed to derive the logical conditions and enumerate relocation-free sequences under sequential constraints. We demonstrate the effectiveness of our framework through extensive simulations, showing its potential to significantly improve area utilization with relocation-free constraints. We also examine its viability on an application problem with prescribed operational order. The results from all experiments are available at: https://drop-park.github.io.
comment: 7 pages, 6 figure. The results from all experiments are available at: https://drop-park.github.io
Object Search in Partially-Known Environments via LLM-informed Model-based Planning and Prompt Selection
We present a novel LLM-informed model-based planning framework, and a novel prompt selection method, for object search in partially-known environments. Our approach uses an LLM to estimate statistics about the likelihood of finding the target object when searching various locations throughout the scene that, combined with travel costs extracted from the environment map, are used to instantiate a model, thus using the LLM to inform planning and achieve effective search performance. Moreover, the abstraction upon which our approach relies is amenable to deployment-time model selection via the recent offline replay approach, an insight we leverage to enable fast prompt and LLM selection during deployment. Simulation experiments demonstrate that our LLM-informed model-based planning approach outperforms the baseline planning strategy that fully relies on LLM and optimistic strategy with as much as 11.8% and 39.2% improvements respectively, and our bandit-like selection approach enables quick selection of best prompts and LLMs resulting in 6.5% lower average cost and 33.8% lower average cumulative regret over baseline UCB bandit selection. Real-robot experiments in an apartment demonstrate similar improvements and so further validate our approach.
comment: 17 pages, 9 figures
DreamerAD: Efficient Reinforcement Learning via Latent World Model for Autonomous Driving
We introduce DreamerAD, the first latent world model framework that enables efficient reinforcement learning for autonomous driving by compressing diffusion sampling from 100 steps to 1 - achieving 80x speedup while maintaining visual interpretability. Training RL policies on real-world driving data incurs prohibitive costs and safety risks. While existing pixel-level diffusion world models enable safe imagination-based training, they suffer from multi-step diffusion inference latency (2s/frame) that prevents high-frequency RL interaction. Our approach leverages denoised latent features from video generation models through three key mechanisms: (1) shortcut forcing that reduces sampling complexity via recursive multi-resolution step compression, (2) an autoregressive dense reward model operating directly on latent representations for fine-grained credit assignment, and (3) Gaussian vocabulary sampling for GRPO that constrains exploration to physically plausible trajectories. DreamerAD achieves 87.7 EPDMS on NavSim v2, establishing state-of-the-art performance and demonstrating that latent-space RL is effective for autonomous driving.
comment: first version
TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models
Vision--Language--Action (VLA) policies have shown strong progress in mapping language instructions and visual observations to robotic actions, yet their reliability degrades in cluttered scenes with distractors. By analyzing failure cases, we find that many errors do not arise from infeasible motions, but from instance-level grounding failures: the policy often produces a plausible grasp trajectory that lands slightly off-target or even on the wrong object instance. To address this issue, we propose TAG (Target-Agnostic Guidance), a simple inference-time guidance mechanism that explicitly reduces distractor- and appearance-induced bias in VLA policies. Inspired by classifier-free guidance (CFG), TAG contrasts policy predictions under the original observation and an object-erased observation, and uses their difference as a residual steering signal that strengthens the influence of object evidence in the decision process. TAG does not require modifying the policy architecture and can be integrated with existing VLA policies with minimal training and inference changes. We evaluate TAG on standard manipulation benchmarks, including LIBERO, LIBERO-Plus, and VLABench, where it consistently improves robustness under clutter and reduces near-miss and wrong-object executions.
Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving
We introduce Latent-WAM, an efficient end-to-end autonomous driving framework that achieves strong trajectory planning through spatially-aware and dynamics-informed latent world representations. Existing world-model-based planners suffer from inadequately compressed representations, limited spatial understanding, and underutilized temporal dynamics, resulting in sub-optimal planning under constrained data and compute budgets. Latent-WAM addresses these limitations with two core modules: a Spatial-Aware Compressive World Encoder (SCWE) that distills geometric knowledge from a foundation model and compresses multi-view images into compact scene tokens via learnable queries, and a Dynamic Latent World Model (DLWM) that employs a causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations. Extensive experiments on NAVSIM v2 and HUGSIM demonstrate new state-of-the-art results: 89.3 EPDMS on NAVSIM v2 and 28.9 HD-Score on HUGSIM, surpassing the best prior perception-free method by 3.2 EPDMS with significantly less training data and a compact 104M-parameter model.
Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation
Robotic manipulation often requires memory: occlusion and state changes can make decision-time observations perceptually aliased, making action selection non-Markovian at the observation level because the same observation may arise from different interaction histories. Most embodied agents implement memory via semantically compressed traces and similarity-based retrieval, which discards disambiguating fine-grained perceptual cues and can return perceptually similar but decision-irrelevant episodes. Inspired by human episodic memory, we propose Chameleon, which writes geometry-grounded multimodal tokens to preserve disambiguating context and produces goal-directed recall through a differentiable memory stack. We also introduce Camo-Dataset, a real-robot UR5e dataset spanning episodic recall, spatial tracking, and sequential manipulation under perceptual aliasing. Across tasks, Chameleon consistently improves decision reliability and long-horizon control over strong baselines in perceptually confusable settings.
comment: Code is available at https://github.com/gxyes/MARS_Chameleon
Towards Safe Learning-Based Non-Linear Model Predictive Control through Recurrent Neural Network Modeling
The practical deployment of nonlinear model predictive control (NMPC) is often limited by online computation: solving a nonlinear program at high control rates can be expensive on embedded hardware, especially when models are complex or horizons are long. Learning-based NMPC approximations shift this computation offline but typically demand large expert datasets and costly training. We propose Sequential-AMPC, a sequential neural policy that generates MPC candidate control sequences by sharing parameters across the prediction horizon. For deployment, we wrap the policy in a safety-augmented online evaluation and fallback mechanism, yielding Safe Sequential-AMPC. Compared to a naive feedforward policy baseline across several benchmarks, Sequential-AMPC requires substantially fewer expert MPC rollouts and yields candidate sequences with higher feasibility rates and improved closed-loop safety. On high-dimensional systems, it also exhibits better learning dynamics and performance in fewer epochs while maintaining stable validation improvement where the feedforward baseline can stagnate.
Design, Modelling and Characterisation of a Miniature Fibre-Reinforced Soft Bending Actuator for Endoluminal Interventions
Miniaturised soft pneumatic actuators are crucial for robotic intervention within highly constrained anatomical pathways. This work presents the design and validation of a fibre-reinforced soft actuator at the centimetre scale for inte- gration into an endoluminal robotic platform for natural-orifice interventional and diagnostic applications. A single-chamber geometry reinforced with embedded Kevlar fibre was de- signed to maximise curvature while preserving sealing integrity, fabricated using a multi-stage multi-stiffness silicone casting process, and validated against a high-fidelity Abaqus FEM using experimentally parametrised hyperelastic material models and embedded beam reinforcement. The semi-cylindrical actuator has an outer diameter of 18,mm and a length of 37.5,mm. Single and double helix winding configurations, fibre pitch, and fibre density were investigated. The optimal 100 SH configuration achieved a bending angle of 202.9° experimentally and 297.6° in simulation, with structural robustness maintained up to 100,kPa and radial expansion effectively constrained by the fibre reinforcement. Workspace evaluation confirmed suitability for integration into the target device envelope, demonstrating that fibre-reinforcement strategies can be effectively translated to the centimetre regime while retaining actuator performance.
Enhancing Drone Light Shows Performances: Optimal Allocation and Trajectories for Swarm Drone Formations
Drone light shows (DLShows) represent a rapidly growing application of swarm robotics, creating captivating aerial displays through the synchronized flight of hundreds or thousands of unmanned aerial vehicles (UAVs) as environmentally friendly and reusable alternatives to traditional pyrotechnics. This domain presents unique challenges in optimally assigning drones to visual waypoints and generating smooth, collision-free trajectories at a very large scale. This article introduces the Unified Assignment and Trajectory Generation (UATG) framework. The proposed approach concurrently solves two core problems: the optimal assignment of drones to designated goal locations and the generation of dynamically feasible, collision-free, time-parameterized trajectories. The UATG framework is specifically designed for DLShows, ensuring minimal transition times between formations and guaranteeing inter-drone collision avoidance. A key innovation is its exceptional computational efficiency, enabling the coordination of large-scale in real-time; for instance, it computes the optimal assignment and trajectories for 1008 drones in approximately one second on a standard laptop. Extensive simulations in realistic environments validate the framework's performance, demonstrating its capability to orchestrate complex formations, from alphanumeric characters to intricate 3D shapes, with precision and visual smoothness. This work provides a critical advancement for the DLShow industry, offering a practical and scalable solution for generating complex aerial choreography and establishing a valuable benchmark for ground control station software designed for the efficient coordination of multiple UAVs. A supplemental animated simulation of this work is available at https://youtu.be/-Fjrhw03594.
3D-Mix for VLA: A Plug-and-Play Module for Integrating VGGT-based 3D Information into Vision-Language-Action Models
Vision-Language-Action (VLA) models leverage Multimodal Large Language Models (MLLMs) for robotic control, but recent studies reveal that MLLMs exhibit limited spatial intelligence due to training predominantly on 2D data, resulting in inadequate 3D perception for manipulation tasks. While recent approaches incorporate specialized 3D vision models such as VGGT to enhance spatial understanding, they employ diverse integration mechanisms without systematic investigation, leaving the optimal fusion strategy unclear. We conduct a comprehensive pilot study comparing nine VGGT integration schemes on standardized benchmarks and find that semantic-conditioned gated fusion, which adaptively balances 2D semantic and 3D geometric features based on task context, achieved the strongest performance among all nine evaluated fusion schemes in our pilot study. We present 3D-Mix, a plug-and-play module that integrates into diverse VLA architectures (GR00T-style and $π$-style) without modifying existing MLLM or action expert components. Experiments across six MLLM series (nine model variants, 2B--8B parameters) on SIMPLER and LIBERO show that 3D-Mix delivers consistent performance gains, averaging +7.0% on the out-of-domain (OOD) SIMPLER benchmark across all nine GR00T-style variants, establishing a principled approach for enhancing spatial intelligence in VLA systems.
comment: 13 pages
Towards automatic smoke detector inspection: Recognition of the smoke detectors in industrial facilities and preparation for future drone integration
Fire safety consists of a complex pipeline, and it is a very important topic of concern. One of its frontal parts are the smoke detectors, which are supposed to provide an alarm prior to a massive fire appears. As they are often difficult to reach due to high ceilings or problematic locations, an automatic inspection system would be very beneficial as it could allow faster revisions, prevent workers from dangerous work in heights, and make the whole process cheaper. In this study, we present the smoke detector recognition part of the automatic inspection system, which could easily be integrated to the drone system. As part of our research, we compare two popular convolutional-based object detectors YOLOv11 and SSD widely used on embedded devices together with the state-of-the-art transformer-based RT-DETRv2 with the backbones of different sizes. Due to a complicated way of collecting a sufficient amount of data for training in the real-world environment, we also compare several training strategies using the real and semi-synthetic data together with various augmentation methods. To achieve a robust testing, all models were evaluated on two test datasets with an expected and difficult appearance of the smoke detectors including motion blur, small resolution, or not complete objects. The best performing detector is the YOLOv11n, which reaches the average mAP@0.5 score of 0.884. Our code, pretrained models and dataset are publicly available.
Characterization of Constraints in Flexible Unknown Environments
This paper presents an online path planning algorithm for safe autonomous manipulation of a flexibly constrained object in an unknown environment. Methods for real time identification and characterization of perceived flexible constraints and global stiffness are presented. Used in tandem, these methods allow a robot to simultaneously explore, characterize, and manipulate an elastic system safely. Navigation without a-priori knowledge of the system is achieved using constraint exploration based on local force and position information. The perceived constraint stiffness is considered at multiple poses along an object's (system) trajectory. Using stiffness eigenvector information, global stiffness behavior is characterized and identified using an atlas of simple mechanical constraints, such as hinges and planar constraints. Validation of these algorithms is carried out by simulation and experimentally. The ability to recognize several common simple mechanical constraints (such as a flexible hinge) in real time, and to subsequently identify relevant screw parameters is demonstrated. These results suggest the feasibility of simultaneous global constrain/stiffness exploration and safe manipulation of flexibly constrained objects. We believe that this approach will eventually enable safe cooperative manipulation in applications such as organ retraction and manipulation during surgery
A Nonvolatile Switchable-polarity EPM Valve
Scalable control of pneumatic and fluidic networks remains fundamentally constrained by architectures that require continuous power input, dense external control hardware, and fixed routing topologies. Current valve arrays rely on such continuous actuation and mechanically fixed routing, imposing substantial thermal and architectural overhead. Here, we introduce the Switchable-polarity ElectroPermanent Magnet (S-EPM), a fundamentally new bistable magnetic architecture that deterministically reverses its external magnetic polarity through transient electrical excitation. By reconfiguring internal flux pathways within a composite magnet assembly, the S-EPM establishes two stable, opposing magnetic configurations without requiring sustained power. We integrate this architecture into a compact pinch-valve to robustly control pneumatic and liquid media. This state-encoded magnetic control enables logic-embedded fluidic networks, including decoders, hierarchical distribution modules, and a nonvolatile six-port routing array. These systems provide address-based routing and programmable compositional control, offering features like individual port isolation that are impossible with standard mechanically coupled rotary valves. By embedding functionality in persistent magnetic states rather than continuous power or static plumbing, this work establishes a scalable foundation for digital fluidics and autonomous laboratory platforms.
FODMP: Fast One-Step Diffusion of Movement Primitives Generation for Time-Dependent Robot Actions
Diffusion models are increasingly used for robot learning, but current designs face a clear trade-off. Action-chunking diffusion policies like ManiCM are fast to run, yet they only predict short segments of motion. This makes them reactive, but unable to capture time-dependent motion primitives, such as following a spring-damper-like behavior with built-in dynamic profiles of acceleration and deceleration. Recently, Movement Primitive Diffusion (MPD) partially addresses this limitation by parameterizing full trajectories using Probabilistic Dynamic Movement Primitives (ProDMPs), thereby enabling the generation of temporally structured motions. Nevertheless, MPD integrates the motion decoder directly into a multi-step diffusion process, resulting in prohibitively high inference latency that limits its applicability in real-time control settings. We propose FODMP (Fast One-step Diffusion of Movement Primitives), a new framework that distills diffusion models into the ProDMPs trajectory parameter space and generates motion using a single-step decoder. FODMP retains the temporal structure of movement primitives while eliminating the inference bottleneck through single-step consistency distillation. This enables robots to execute time-dependent primitives at high inference speed, suitable for closed-loop vision-based control. On standard manipulation benchmarks (MetaWorld, ManiSkill), FODMP runs up to 10 times faster than MPD and 7 times faster than action-chunking diffusion policies, while matching or exceeding their success rates. Beyond speed, by generating fast acceleration-deceleration motion primitives, FODMP allows the robot to intercept and securely catch a fast-flying ball, whereas action-chunking diffusion policy and MPD respond too slowly for real-time interception.
IndustriConnect: MCP Adapters and Mock-First Evaluation for AI-Assisted Industrial Operations
AI assistants can decompose multi-step workflows, but they do not natively speak industrial protocols such as Modbus, MQTT/Sparkplug B, or OPC UA, so this paper presents INDUSTRICONNECT, a prototype suite of Model Context Protocol (MCP) adapters that expose industrial operations as schema-discoverable AI tools while preserving protocol-specific connectivity and safety controls; the system uses a common response envelope and a mock-first workflow so adapter behavior can be exercised locally before connecting to plant equipment, and a deterministic benchmark covering normal, fault-injected, stress, and recovery scenarios evaluates the flagship adapters, comprising 870 runs (480 normal, 210 fault-injected, 120 stress, 60 recovery trials) and 2820 tool calls across 7 fault scenarios and 12 stress scenarios, where the normal suite achieved full success, the fault suite confirmed structured error handling with adapter-level uint16 range validation, the stress suite identified concurrency boundaries, and same-session recovery after endpoint restart is demonstrated for all three protocols, with results providing evidence spanning adapter correctness, concurrency behavior, and structured error handling for AI-assisted industrial operations.
Saranga: MilliWatt Ultrasound for Navigation in Visually Degraded Environments on Palm-Sized Aerial Robots
Tiny palm-sized aerial robots possess exceptional agility and cost-effectiveness in navigating confined and cluttered environments. However, their limited payload capacity directly constrains the sensing suite on-board the robot, thereby limiting critical navigational tasks in Global Positioning System (GPS)-denied wild scenes. Common methods for obstacle avoidance use cameras and LIght Detection And Ranging (LIDAR), which become ineffective in visually degraded conditions such as low visibility, dust, fog or darkness. Other sensors, such as RAdio Detection And Ranging (RADAR), have high power consumption, making them unsuitable for tiny aerial robots. Inspired by bats, we propose Saranga, a low-power ultrasound-based perception stack that localizes obstacles using a dual sonar array. We present two key solutions to combat the low Peak Signal-to-Noise Ratio of $-4.9$ decibels: physical noise reduction and a deep learning based denoising method. Firstly, we present a practical way to block propeller induced ultrasound noise on the weak echoes. The second solution is to train a neural network to utilize the \textcolor{black}{long horizon of ultrasound echoes} for finding signal patterns under high amounts of uncorrelated noise where classical methods were insufficient. We generalize to the real world by using a synthetic data generation pipeline and limited real noise data for training. We enable a palm-sized aerial robot to navigate in visually degraded conditions of dense fog, darkness, and snow in a cluttered environment with thin and transparent obstacles using only on-board sensing and computation. We provide extensive real world results to demonstrate the efficacy of our approach.
Unicorn: A Universal and Collaborative Reinforcement Learning Approach Towards Generalizable Network-Wide Traffic Signal Control
Adaptive traffic signal control (ATSC) is crucial in reducing congestion, maximizing throughput, and improving mobility in rapidly growing urban areas. Recent advancements in parameter-sharing multi-agent reinforcement learning (MARL) have greatly enhanced the scalable and adaptive optimization of complex, dynamic flows in large-scale homogeneous networks. However, the inherent heterogeneity of real-world traffic networks, with their varied intersection topologies and interaction dynamics, poses substantial challenges to achieving scalable and effective ATSC across different traffic scenarios. To address these challenges, we present Unicorn, a universal and collaborative MARL framework designed for efficient and adaptable network-wide ATSC. Specifically, we first propose a unified approach to map the states and actions of intersections with varying topologies into a common structure based on traffic movements. Next, we design a Universal Traffic Representation (UTR) module with a decoder-only network for general feature extraction, enhancing the model's adaptability to diverse traffic scenarios. Additionally, we incorporate an Intersection Specifics Representation (ISR) module, designed to identify key latent vectors that represent the unique intersection's topology and traffic dynamics through variational inference techniques. To further refine these latent representations, we employ a contrastive learning approach in a self-supervised manner, which enables better differentiation of intersection-specific features. Moreover, we integrate the state-action dependencies of neighboring agents into policy optimization, which effectively captures dynamic agent interactions and facilitates efficient regional collaboration. [...]. The code is available at https://github.com/marmotlab/Unicorn
comment: \c{opyright} 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
ACG: Action Coherence Guidance for Flow-based Vision-Language-Action models ICRA 2026
Diffusion and flow matching models have emerged as powerful robot policies, enabling Vision-Language-Action (VLA) models to generalize across diverse scenes and instructions. Yet, when trained via imitation learning, their high generative capacity makes them sensitive to noise in human demonstrations: jerks, pauses, and jitter which reduce action coherence. Reduced action coherence causes instability and trajectory drift during deployment, failures that are catastrophic in fine-grained manipulation where precision is crucial. In this paper, we present Action Coherence Guidance (ACG) for VLA models, a training-free test-time guidance algorithm that improves action coherence and thereby yields performance gains. Evaluated on RoboCasa, DexMimicGen, and real-world SO-101 tasks, ACG consistently improves action coherence and boosts success rates across diverse manipulation tasks. Code and project page are available at https://github.com/DAVIAN-Robotics/ACG and https://DAVIAN-Robotics.github.io/ACG , respectively.
comment: Accepted to ICRA 2026
HiSync: Spatio-Temporally Aligning Hand Motion from Wearable IMU and On-Robot Camera for Command Source Identification in Long-Range HRI
Long-range Human-Robot Interaction (HRI) remains underexplored. Within it, Command Source Identification (CSI) - determining who issued a command - is especially challenging due to multi-user and distance-induced sensor ambiguity. We introduce HiSync, an optical-inertial fusion framework that treats hand motion as binding cues by aligning robot-mounted camera optical flow with hand-worn IMU signals. We first elicit a user-defined (N=12) gesture set and collect a multimodal command gesture dataset (N=38) in long-range multi-user HRI scenarios. Next, HiSync extracts frequency-domain hand motion features from both camera and IMU data, and a learned CSINet denoises IMU readings, temporally aligns modalities, and performs distance-aware multi-window fusion to compute cross-modal similarity of subtle, natural gestures, enabling robust CSI. In three-person scenes up to 34m, HiSync achieves 92.32% CSI accuracy, outperforming the prior SOTA by 48.44%. HiSync is also validated on real-robot deployment. By making CSI reliable and natural, HiSync provides a practical primitive and design guidance for public-space HRI. https://github.com/OctopusWen/HiSync
E0: Enhancing Generalization and Fine-Grained Control in VLA Models via Tweedie Discrete Diffusion
Vision-Language-Action (VLA) models offer a unified framework for robotic manipulation by integrating visual perception, language understanding, and control generation. However, existing VLA systems still struggle to generalize across diverse tasks, scenes, and camera viewpoints, and often produce coarse or unstable actions. We argue that these limitations are closely tied to the structural properties of actions in VLA settings, including the inherent multi-peaked nature of action distributions, the token-based symbolic reasoning of pretrained VLM/VLA backbones, and the effective finite resolution imposed by real-world robotic control. Motivated by these properties, we introduce E0, a tweedie discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens. By operating in a discrete action space with a principled diffusion process, E0 naturally aligns with token-based reasoning, supports fine-grained yet executable action control, and avoids the distributional mismatch of masking-based discrete diffusion. We further introduce a spherical viewpoint perturbation augmentation to enhance robustness to camera shifts without additional data. Experiments on LIBERO, VLABench, ManiSkill, and a real-world Franka arm demonstrate that E0 achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7% on average.
Point Bridge: 3D Representations for Cross Domain Policy Learning
Robot foundation models are beginning to deliver on the promise of generalist robotic agents, yet progress remains constrained by the scarcity of large-scale real-world manipulation datasets. Simulation and synthetic data generation offer a scalable alternative, but their usefulness is limited by the visual domain gap between simulation and reality. In this work, we present Point Bridge, a framework that leverages unified, domain-agnostic point-based representations to unlock synthetic datasets for zero-shot sim-to-real policy transfer, without explicit visual or object-level alignment. Point Bridge combines automated point-based representation extraction via Vision-Language Models (VLMs), transformer-based policy learning, and efficient inference-time pipelines to train capable real-world manipulation agents using only synthetic data. With additional co-training on small sets of real demonstrations, Point Bridge further improves performance, substantially outperforming prior vision-based sim-and-real co-training methods. It achieves up to 44% gains in zero-shot sim-to-real transfer and up to 66% with limited real data across both single-task and multitask settings. Videos of the robot are best viewed at: https://pointbridge3d.github.io/
Sim-to-Real of Humanoid Locomotion Policies via Joint Torque Space Perturbation Injection
This paper proposes a novel alternative to existing sim-to-real methods for training control policies with simulated experiences. Prior sim-to-real methods for legged robots mostly rely on the domain randomization approach, where a fixed finite set of simulation parameters is randomized during training. Instead, our method adds state-dependent perturbations to the input joint torque used for forward simulation during the training phase. These state-dependent perturbations are designed to simulate a broader range of reality gaps than those captured by randomizing a fixed set of simulation parameters. Experimental results show that our method enables humanoid locomotion policies that achieve greater robustness against complex reality gaps unseen in the training domain.
comment: This work has been submitted to the IEEE for possible publication
Sim-to-Real of Humanoid Locomotion Policies via Joint Torque Space Perturbation Injection
This paper proposes a novel alternative to existing sim-to-real methods for training control policies with simulated experiences. Unlike prior methods that typically rely on domain randomization over a fixed finite set of parameters, the proposed approach injects state-dependent perturbations into the input joint torque during forward simulation. These perturbations are designed to simulate a broader spectrum of reality gaps than standard parameter randomization without requiring additional training. By using neural networks as flexible perturbation generators, the proposed method can represent complex, state-dependent uncertainties, such as nonlinear actuator dynamics and contact compliance, that parametric randomization cannot capture. Experimental results demonstrate that the proposed approach enables humanoid locomotion policies to achieve superior robustness against complex, unseen reality gaps in both simulation and real-world deployment.
comment: Duplication, resubmission of our previous paper arXiv:2504.06585
A Hybrid Neural-Assisted Unscented Kalman Filter for Unmanned Ground Vehicle Navigation
Modern autonomous navigation for unmanned ground vehicles relies on different estimators to fuse inertial sensors and GNSS measurements. However, the constant noise covariance matrices often struggle to account for dynamic real-world conditions. In this work we propose a hybrid estimation framework that bridges classical state estimation foundations with modern deep learning approaches. Instead of altering the fundamental unscented Kalman filter equations, a dedicated deep neural network is developed to predict the process and measurement noise uncertainty directly from raw inertial and GNSS measurements. We present a sim2real approach, with training performed only on simulative data. In this manner, we offer perfect ground truth data and relieves the burden of extensive data recordings. To evaluate our proposed approach and examine its generalization capabilities, we employed a 160-minutes test set from three datasets each with different types of vehicles (off-road vehicle, passenger car, and mobile robot), inertial sensors, road surface, and environmental conditions. We demonstrate across the three datasets a position improvement of $12.7\%$ compared to the adaptive model-based approach. Thus, offering a scalable and a more robust solution for unmanned ground vehicles navigation tasks.
Onboard MuJoCo-based Model Predictive Control for Shipboard Crane with Double-Pendulum Sway Suppression
Transferring heavy payloads in maritime settings relies on efficient crane operation, limited by hazardous double-pendulum payload sway. This sway motion is further exacerbated in offshore environments by external perturbations from wind and ocean waves. Manual suppression of these oscillations on an underactuated crane system by human operators is challenging. Existing control methods struggle in such settings, often relying on simplified analytical models, while deep reinforcement learning (RL) approaches tend to generalise poorly to unseen conditions. Deploying a predictive controller onto compute-constrained, highly non-linear physical systems without relying on extensive offline training or complex analytical models remains a significant challenge. Here we show a complete real-time control pipeline centered on the MuJoCo MPC framework that leverages a cross-entropy method planner to evaluate candidate action sequences directly within a physics simulator. By using simulated rollouts, this sampling-based approach successfully reconciles the conflicting objectives of dynamic target tracking and sway damping without relying on complex analytical models. We demonstrate that the controller can run effectively on a resource-constrained embedded hardware, while outperforming traditional PID and RL baselines in counteracting external base perturbations. Furthermore, our system demonstrates robustness even when subjected to unmodeled physical discrepancies like the introduction of a second payload.
comment: 8 pages, 5 figures
NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks
Recent advances in Graphical User Interface (GUI) and embodied navigation have driven progress, yet these domains have largely evolved in isolation, with disparate datasets and training paradigms. In this paper, we observe that both tasks can be formulated as Markov Decision Processes (MDP), suggesting a foundational principle for their unification. Hence, we present NaviMaster, the first unified agent capable of unifying GUI navigation and embodied navigation within a single framework. Specifically, NaviMaster (i) proposes a visual-target trajectory collection pipeline that generates trajectories for both GUI and embodied tasks using a single formulation. (ii) employs a unified reinforcement learning framework on the mix data to improve generalization. (iii) designs a novel distance-aware reward to ensure efficient learning from the trajectories. Through extensive experiments on out-of-domain benchmarks, NaviMaster is shown to outperform state-of-the-art agents in GUI navigation, spatial affordance prediction, and embodied navigation. Ablation studies further demonstrate the efficacy of our unified training strategy, data mixing strategy, and reward design. Our codes, data, and checkpoints are available at https://iron-boyy.github.io/navimaster-page/ .
comment: 20 pages, 11 figures
DIDLM: A SLAM Dataset for Difficult Scenarios Featuring Infrared, Depth Cameras, LIDAR, 4D Radar, and Others under Adverse Weather, Low Light Conditions, and Rough Roads
Adverse weather conditions, low-light environments, and bumpy road surfaces pose significant challenges to SLAM in robotic navigation and autonomous driving. Existing datasets in this field predominantly rely on single sensors or combinations of LiDAR, cameras, and IMUs. However, 4D millimeter-wave radar demonstrates robustness in adverse weather, infrared cameras excel in capturing details under low-light conditions, and depth images provide richer spatial information. Multi-sensor fusion methods also show potential for better adaptation to bumpy roads. Despite some SLAM studies incorporating these sensors and conditions, there remains a lack of comprehensive datasets addressing low-light environments and bumpy road conditions, or featuring a sufficiently diverse range of sensor data. In this study, we introduce a multi-sensor dataset covering challenging scenarios such as snowy weather, rainy weather, nighttime conditions, speed bumps, and rough terrains. The dataset includes rarely utilized sensors for extreme conditions, such as 4D millimeter-wave radar, infrared cameras, and depth cameras, alongside 3D LiDAR, RGB cameras, GPS, and IMU. It supports both autonomous driving and ground robot applications and provides reliable GPS/INS ground truth data, covering structured and semi-structured terrains. We evaluated various SLAM algorithms using this dataset, including RGB images, infrared images, depth images, LiDAR, and 4D millimeter-wave radar. The dataset spans a total of 18.5 km, 69 minutes, and approximately 660 GB, offering a valuable resource for advancing SLAM research under complex and extreme conditions. Our dataset is available at https://github.com/GongWeiSheng/DIDLM.
Rotor-Failure-Aware Quadrotors Flight in Unknown Environments
Rotor failures in quadrotors may result in high-speed rotation and vibration due to rotor imbalance, which introduces significant challenges for autonomous flight in unknown environments. The mainstream approaches against rotor failures rely on fault-tolerant control (FTC) and predefined trajectory tracking. To the best of our knowledge, online failure detection and diagnosis (FDD), trajectory planning, and FTC of the post-failure quadrotors in unknown and complex environments have not yet been achieved. This paper presents a rotor-failure-aware quadrotor navigation system designed to mitigate the impacts of rotor imbalance. First, a composite FDD-based nonlinear model predictive controller (NMPC), incorporating motor dynamics, is designed to ensure fast failure detection and flight stability. Second, a rotor-failure-aware planner is designed to leverage FDD results and spatial-temporal joint optimization, while a LiDAR-based quadrotor platform with four anti-torque plates is designed to enable reliable perception under high-speed rotation. Lastly, extensive benchmarks against state-of-the-art methods highlight the superior performance of the proposed approach in addressing rotor failures, including propeller unloading and motor stoppage. The experimental results demonstrate, for the first time, that our approach enables autonomous quadrotor flight with rotor failures in challenging environments, including cluttered rooms and unknown forests.
Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process
Vision-language-action (VLA) models aim to understand natural language instructions and visual observations and to execute corresponding actions as an embodied agent. Recent work integrates future images into the understanding-acting loop, yielding unified VLAs that jointly understand, generate, and act -- reading text and images and producing future images and actions. However, these models either rely on external experts for modality unification or treat image generation and action prediction as separate processes, limiting the benefits of direct synergy between these tasks. Our core philosophy is to optimize generation and action jointly through a synchronous denoising process, where the iterative refinement enables actions to evolve from initialization, under constant and sufficient visual guidance. We ground this philosophy in our proposed Unified Diffusion VLA and Joint Discrete Denoising Diffusion Process (JD3P), which is a joint diffusion process that integrates multiple modalities into a single denoising trajectory to serve as the key mechanism enabling understanding, generation, and acting to be intrinsically synergistic. Our model and theory are built on a unified tokenized space of all modalities and a hybrid attention mechanism. We further propose a two-stage training pipeline and several inference-time techniques that optimize performance and efficiency. Our approach achieves state-of-the-art performance on benchmarks such as CALVIN, LIBERO, and SimplerEnv with 4$\times$ faster inference than autoregressive methods, and we demonstrate its effectiveness through in-depth analysis and real-world evaluations. Our project page is available at https://irpn-eai.github.io/UD-VLA.github.io/.
Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution
In this report, we introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution. The key to our method lies in a carefully designed training recipe and deployment strategy. Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-semantic knowledge of the underlying pre-trained VLM. During post-training, we propose several techniques for training the VLA model for asynchronous execution to address the inference latency during real-robot rollouts. During deployment, we carefully align the timesteps of consecutive predicted action chunks to ensure continuous and seamless real-time rollouts. We evaluate Xiaomi-Robotics-0 extensively in simulation benchmarks and on two challenging real-robot tasks that require precise and dexterous bimanual manipulation. Results show that our method achieves state-of-the-art performance across all simulation benchmarks. Moreover, Xiaomi-Robotics-0 can roll out fast and smoothly on real robots using a consumer-grade GPU, achieving high success rates and throughput on both real-robot tasks. To facilitate future research, code and model checkpoints are open-sourced at https://xiaomi-robotics-0.github.io
comment: Project page: https://xiaomi-robotics-0.github.io
Instrument-Splatting++: Towards Controllable Surgical Instrument Digital Twin Using Gaussian Splatting
High-quality and controllable digital twins of surgical instruments are critical for Real2Sim in robot-assisted surgery, as they enable realistic simulation, synthetic data generation, and perception learning under novel poses. We present Instrument-Splatting++, a monocular 3D Gaussian Splatting (3DGS) framework that reconstructs surgical instruments as a fully controllable Gaussian asset with high fidelity. Our pipeline starts with part-wise geometry pretraining that injects CAD priors into Gaussian primitives and equips the representation with part-aware semantic rendering. Built on the pretrained model, we propose a semantics-aware pose estimation and tracking (SAPET) method to recover per-frame 6-DoF pose and joint angles from unposed endoscopic videos, where a gripper-tip network trained purely from synthetic semantics provides robust supervision and a loose regularization suppresses singular articulations. Finally, we introduce Robust Texture Learning (RTL), which alternates pose refinement and robust appearance optimization, mitigating pose noise during texture learning. The proposed framework can perform pose estimation and learn realistic texture from unposed videos. We validate our method on sequences extracted from EndoVis17/18, SAR-RARP, and an in-house dataset, showing superior photometric quality and improved geometric accuracy over state-of-the-art baselines. We further demonstrate a downstream keypoint detection task where unseen-pose data augmentation from our controllable instrument Gaussian improves performance.
comment: 10 pages, 9 figures
Memory-Augmented Potential Field Theory: A Framework for Adaptive Control in Non-Convex Domains NeurIPS 2025
Stochastic optimal control methods often struggle in complex non-convex landscapes, frequently becoming trapped in local optima due to their inability to learn from historical trajectory data. This paper introduces Memory-Augmented Potential Field Theory, a unified mathematical framework that integrates historical experience into stochastic optimal control. Our approach dynamically constructs memory-based potential fields that identify and encode key topological features of the state space, enabling controllers to automatically learn from past experiences and adapt their optimization strategy. We provide a theoretical analysis showing that memory-augmented potential fields possess non-convex escape properties, asymptotic convergence characteristics, and computational efficiency. We implement this theoretical framework in a Memory-Augmented Model Predictive Path Integral (MPPI) controller that demonstrates significantly improved performance in challenging non-convex environments. The framework represents a generalizable approach to experience-based learning within control systems (especially robotic dynamics), enhancing their ability to navigate complex state spaces without requiring specialized domain knowledge or extensive offline training.
comment: Accepted by NeurIPS 2025
Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition CVPR 2026
For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, enabling reliable detection of subtle dynamics across observations. To this end, we propose CroBo, a visual state representation learning framework based on a global-to-local reconstruction objective. Given a reference observation compressed into a compact bottleneck token, CroBo learns to reconstruct heavily masked patches in a local target crop from sparse visible cues, using the global bottleneck token as context. This learning objective encourages the bottleneck token to encode a fine-grained representation of scene-wide semantic entities, including their identities, spatial locations, and configurations. As a result, the learned visual states reveal how scene elements move and interact over time, supporting sequential decision making. We evaluate CroBo on diverse vision-based robot policy learning benchmarks, where it achieves state-of-the-art performance. Reconstruction analyses and perceptual straightness experiments further show that the learned representations preserve pixel-level scene composition and encode what-moves-where across observations. Project page available at: https://seokminlee-chris.github.io/CroBo-ProjectPage.
comment: Accepted to CVPR 2026 Workshop: Pixel-level Video Understanding in the Wild
MiniBEE: A New Form Factor for Compact Bimanual Dexterity
Bimanual robot manipulators can achieve impressive dexterity, but typically rely on two full six- or seven- degree-of-freedom arms so that paired grippers can coordinate effectively. This traditional framework increases system complexity while only exploiting a fraction of the overall workspace for dexterous interaction. We introduce the MiniBEE (Miniature Bimanual End-effector), a compact system in which two reduced-mobility arms (3+ DOF each) are coupled into a kinematic chain that preserves full relative positioning between grippers. To guide our design, we formulate a kinematic dexterity metric that enlarges the dexterous workspace while keeping the mechanism lightweight and wearable. The resulting system supports two complementary modes: (i) wearable kinesthetic data collection with self-tracked gripper poses, and (ii) deployment on a standard robot arm, extending dexterity across its entire workspace. We present kinematic analysis and design optimization methods for maximizing dexterous range, and demonstrate an end-to-end pipeline in which wearable demonstrations train imitation learning policies that perform robust, real-world bimanual manipulation.
HortiMulti: A Multi-Sensor Dataset for Localisation and Mapping in Horticultural Polytunnels
Agricultural robotics is gaining increasing relevance in both research and real-world deployment. As these systems are expected to operate autonomously in more complex tasks, the availability of representative real-world datasets becomes essential. While domains such as urban and forestry robotics benefit from large and established benchmarks, horticultural environments remain comparatively under-explored despite the economic significance of this sector. To address this gap, we present HortiMulti, a multimodal, cross-season dataset collected in commercial strawberry and raspberry polytunnels across an entire growing season, capturing substantial appearance variation, dynamic foliage, specular reflections from plastic covers, severe perceptual aliasing, and GNSS-unreliable conditions, all of which directly degrade existing localisation and perception algorithms. The sensor suite includes two 3D LiDARs, four RGB cameras, an IMU, GNSS, and wheel odometry. Ground truth trajectories are derived from a combination of Total Station surveying, AprilTag fiducial markers, and LiDAR-inertial odometry, spanning dense, sparse, and marker-free coverage to support evaluation under both controlled and realistic conditions. We release time-synchronised raw measurements, calibration files, reference trajectories, and baseline benchmarks for visual, LiDAR, and multi-sensor SLAM, with results confirming that current state-of-the-art methods remain inadequate for reliable polytunnel deployment, establishing HortiMulti as a one-stop resource for developing and testing robotic perception systems in horticulture environments.
KINESIS: Motion Imitation for Human Musculoskeletal Locomotion ICRA
How do humans move? Advances in reinforcement learning (RL) have produced impressive results in capturing human motion using physics-based humanoid control. However, torque-controlled humanoids fail to model key aspects of human motor control such as biomechanical joint constraints & non-linear and overactuated musculotendon control. We present KINESIS, a model-free motion imitation framework that tackles these challenges. KINESIS is trained on 1.8 hours of locomotion data and achieves strong motion imitation performance on unseen trajectories. Through a negative mining approach, KINESIS learns robust locomotion priors that we leverage to deploy the policy on several downstream tasks such as text-to-control, target point reaching, and football penalty kicks. Importantly, KINESIS learns to generate muscle activity patterns that correlate well with human EMG activity. We show that these results scale seamlessly across biomechanical model complexity, demonstrating control of up to 290 muscles. Overall, the physiological plausibility makes KINESIS a promising model for tackling challenging problems in human motor control. Code, videos and benchmarks are available at https://github.com/amathislab/Kinesis.
comment: Accepted to ICRA. Here we include an appendix
The Role of Consequential and Functional Sound in Human-Robot Interaction: Toward Audio Augmented Reality Interfaces
Robot sound, encompassing both consequential operational noise and intentionally designed auditory cues, plays an important role in human-robot interaction (HRI). Developing a deeper understanding of how robot sounds influence human experience, and how technologies such as augmented reality (AR) modulate these effects, can enable the design of more socially acceptable robots and more effective, intuitive human-robot interfaces. In this work, we present a three-part mixed-methods study (N = 51) that investigates (i) the effects of consequential robot sounds on human perception under varying degrees of physical colocation, (ii) human accuracy in localizing spatial audio cues delivered via augmented reality, and (iii) the use of augmented spatial audio cues for functional and transformative communication during collaborative handover tasks, in comparison to non-AR sound designs. Contrary to prior findings, our results indicate that the consequential sounds of a Kinova Gen3 manipulator did not negatively affect participants' perceptions of the robot. Participants demonstrated high accuracy in localizing lateral spatial cues, whereas frontal cues proved more challenging, delineating conditions under which spatial auditory feedback is most effective. Qualitative findings further reveal that augmented spatial audio cues can simultaneously convey task-relevant information while fostering a sense of warmth and reducing user discomfort during interaction. Together, these findings elucidate the perceptual effects of consequential robot sound and position sound, particularly augmented spatial audio, as a meaningful yet underutilized design resource for human-robot interaction.
comment: 29 pages, 11 figures
MIGHTY: Hermite Spline-based Efficient Trajectory Planning
Hard-constraint trajectory planners often rely on commercial solvers and demand substantial computational resources. Existing soft-constraint methods achieve faster computation, but either (1) decouple spatial and temporal optimization or (2) restrict the search space. To overcome these limitations, we introduce MIGHTY, a Hermite spline-based planner that performs spatiotemporal optimization while fully leveraging the continuous search space of a spline. In simulation, MIGHTY achieves a 9.3% reduction in computation time and a 13.1% reduction in travel time over state-of-the-art baselines, with a 100% success rate. In hardware, MIGHTY completes multiple high-speed flights up to 6.7 m/s in a cluttered static environment and long-duration flights with dynamically added obstacles.
comment: 10 pages, 12 figures
Multiagent Systems
The Specification Gap: Coordination Failure Under Partial Knowledge in Code Agents
When multiple LLM-based code agents independently implement parts of the same class, they must agree on shared internal representations, even when the specification leaves those choices implicit. We study this coordination problem across 51 class-generation tasks, progressively stripping specification detail from full docstrings (L0) to bare signatures (L3), and introducing opposing structural biases (lists vs. dictionaries) to stress-test integration. Three findings emerge. First, a persistent specification gap: two-agent integration accuracy drops from 58% to 25% as detail is removed, while a single-agent baseline degrades more gracefully (89% to 56%), leaving a 25--39 pp coordination gap that is consistent across two Claude models (Sonnet, Haiku) and three independent runs. Second, an AST-based conflict detector achieves 97% precision at the weakest specification level without additional LLM calls, yet a factorial recovery experiment shows that restoring the full specification alone recovers the single-agent ceiling (89%), while providing conflict reports adds no measurable benefit. Third, decomposing the gap into coordination cost (+16 pp) and information asymmetry (+11 pp) suggests that the two effects are independent and approximately additive. The gap is not merely a consequence of hidden information, but reflects the difficulty of producing compatible code without shared decisions. These results support a specification-first view of multi-agent code generation: richer specifications are both the primary coordination mechanism and the sufficient recovery instrument.
The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More
Developers and consumers increasingly choose reasoning language models (RLMs) based on their listed API prices. However, how accurately do these prices reflect actual inference costs? We conduct the first systematic study of this question, evaluating 8 frontier RLMs across 9 diverse tasks covering competition math, science QA, code generation, and multi-domain reasoning. We uncover the pricing reversal phenomenon: in 21.8% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitude reaching up to 28x. For example, Gemini 3 Flash's listed price is 78% cheaper than GPT-5.2's, yet its actual cost across all tasks is 22% higher. We trace the root cause to vast heterogeneity in thinking token consumption: on the same query, one model may use 900% more thinking tokens than another. In fact, removing thinking token costs reduces ranking reversals by 70% and raises the rank correlation (Kendall's $τ$ ) between price and cost rankings from 0.563 to 0.873. We further show that per-query cost prediction is fundamentally difficult: repeated runs of the same query yield thinking token variation up to 9.7x, establishing an irreducible noise floor for any predictor. Our findings demonstrate that listed API pricing is an unreliable proxy for actual cost, calling for cost-aware model selection and transparent per-request cost monitoring.
Self-Evolving Multi-Agent Framework for Efficient Decision Making in Real-Time Strategy Scenarios SC
Large language models (LLMs) have demonstrated exceptional potential in complex reasoning,pioneering a new paradigm for autonomous agent decision making in dynamic settings. However, in Real-Time Strategy (RTS) scenarios, LLMs suffer from a critical speed-quality trade-off. Specifically expansive state spaces and time limits render inference delays prohibitive, while stochastic planning errors undermine logical consistency. To address these challenges, we present SEMA (Self-Evolving Multi-Agent), a novel framework designed for high-performance, low-latency decision-making in RTS environments. This collaborative multi-agent framework facilitates self-evolution by adaptively calibrating model bias through in-episode assessment and cross-episode analysis. We further incorporate dynamic observation pruning based on structural entropy to model game states topologically. By distilling high dimensional data into core semantic information, this approach significantly reduces inference time. We also develop a hybrid knowledge-memory mechanism that integrates micro-trajectories, macro-experience, and hierarchical domain knowledge, thereby enhancing both strategic adaptability and decision consistency. Experiments across multiple StarCraft II maps demonstrate that SEMA achieves superior win rates while reducing average decision latency by over 50%, validating its efficiency and robustness in complex RTS scenarios.
comment: 17 pages, 6 figures. Submitted to SCIS (Science China Information Science)
SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems ICLR 2024
Combining multiple Vision-Language Models (VLMs) can enhance multimodal reasoning and robustness, but aggregating heterogeneous models' outputs amplifies uncertainty and increases the risk of hallucinations. We propose SCoOP (Semantic-Consistent Opinion Pooling), a training-free uncertainty quantification (UQ) framework multi-VLM systems through uncertainty-weighted linear opinion pooling. Unlike prior UQ methods designed for single models, SCoOP explicitly measures collective, system-level uncertainty across multiple VLMs, enabling effective hallucination detection and abstention for highly uncertain samples. On ScienceQA, SCoOP achieves an AUROC of 0.866 for hallucination detection, outperforming baselines (0.732-0.757) by approximately 10-13%. For abstention, it attains an AURAC of 0.907, exceeding baselines (0.818-0.840) by 7-9%. Despite these gains, SCoOP introduces only microsecond-level aggregation overhead relative to the baselines, which is trivial compared to typical VLM inference time (on the order of seconds). These results demonstrate that SCoOP provides an efficient and principled mechanism for uncertainty-aware aggregation, advancing the reliability of multimodal AI systems.
comment: Accepted to ICLR 2024 Workshop on Agentic AI in the Wild: From Hallucinations to Reliable Autonomy
The Free-Market Algorithm: Self-Organizing Optimization for Open-Ended Complex Systems
We introduce the Free-Market Algorithm (FMA), a novel metaheuristic inspired by free-market economics. Unlike Genetic Algorithms, Particle Swarm Optimization, and Simulated Annealing -- which require prescribed fitness functions and fixed search spaces -- FMA uses distributed supply-and-demand dynamics where fitness is emergent, the search space is open-ended, and solutions take the form of hierarchical pathway networks. Autonomous agents discover rules, trade goods, open and close firms, and compete for demand with no centralized controller. FMA operates through a three-layer architecture: a universal market mechanism (supply, demand, competition, selection), pluggable domain-specific behavioral rules, and domain-specific observation. The market mechanism is identical across applications; only the behavioral rules change. Validated in two unrelated domains. In prebiotic chemistry, starting from 900 bare atoms (C, H, O, N), FMA discovers all 12 feasible amino acid formulas, all 5 nucleobases, the formose sugar chain, and Krebs cycle intermediates in under 5 minutes on a laptop -- with up to 240 independent synthesis routes per product. In macroeconomic forecasting, reading a single input-output table with zero estimated parameters, FMA achieves Mean Absolute Error of 0.42 percentage points for non-crisis GDP prediction, comparable to professional forecasters, portable to 33 countries. Assembly Theory alignment shows that FMA provides the first explicit, tunable mechanism for the selection signatures described by Sharma et al. (Nature, 2023). The event-driven assembly dynamics resonate with foundational programs in physics -- causal set theory, relational quantum mechanics, constructor theory -- suggesting that Darwinian market dynamics may reflect a deeper organizational principle that lead to the unfolding of Nature itself.
comment: 26 pages, 3 figures, 2 tables, draft
Relaxing Constraints in Anonymous Multi Agent Path Finding for Large Agents
The study addressed the problem of Anonymous Multi-Agent Path-finding (AMAPF). Unlike the classical formulation, where the assignment of agents to goals is fixed, in the anonymous MAPF setting it is irrelevant which agent reaches specific goal, provided that all goals are occupied. Most existing multi-agent pathfinding algorithms rely on a discrete representation of the environment (e.g., square grids) and do not account for the sizes of agents. This limits their applicability in real-world scenarios, such as trajectory planning for mobile robots in warehouses. Conversely, methods operating in continuous space typically impose substantial restrictions on the input data, such as constraints on the distances between initial and goal positions or between start/goal positions and obstacles. In this work, we considered one of the AMAPF algorithms designed for continuous space, where agents are modeled as disks of equal size. The algorithm requires a strict minimum separation of $4$ agent radii between any start/goal positions. Proposed a modification aimed at relaxing the constraints and reduce this limit from $4$ to $2\sqrt{3}$. We theoretically demonstrated that the proposed enhancements preserve original theoretical properties, including the guarantee that all agents will eventually achieve their goals safely and without collisions.
comment: 14 pages, 6 figures
Context-Mediated Domain Adaptation in Multi-Agent Sensemaking Systems
Domain experts possess tacit knowledge that they cannot easily articulate through explicit specifications. When experts modify AI-generated artifacts by correcting terminology, restructuring arguments, and adjusting emphasis, these edits reveal domain understanding that remains latent in traditional prompt-based interactions. Current systems treat such modifications as endpoint corrections rather than as implicit specifications that could reshape subsequent reasoning. We propose context-mediated domain adaptation, a paradigm where user modifications to system-generated artifacts serve as implicit domain specification that reshapes LLM-powered multi-agent reasoning behavior. Through our system Seedentia, a web-based multi-agent framework for sense-making, we demonstrate bidirectional semantic links between generated artifacts and system reasoning. Our approach enables specification bootstrapping where vague initial prompts evolve into precise domain specifications through iterative human-AI collaboration, implicit knowledge transfer through reverse-engineered user edits, and in-context learning where agent behavior adapts based on observed correction patterns. We present results from an evaluation with domain experts who generated and modified research questions from academic papers. Our system extracted 46 domain knowledge entries from user modifications, demonstrating the feasibility of capturing implicit expertise through edit patterns, though the limited sample size constrains conclusions about systematic quality improvements.
SentinelAI: A Multi-Agent Framework for Structuring and Linking NG9-1-1 Emergency Incident Data
Emergency response systems generate data from many agencies and systems. In practice, correlating and updating this information across sources in a way that aligns with Next Generation 9-1-1 data standards remains challenging. Ideally, this data should be treated as a continuous stream of operational updates, where new facts are integrated immediately to provide a timely and unified view of an evolving incident. This paper presents SentinelAI, a data integration and standardization framework for transforming emergency communications into standardized, machine-readable datasets that support integration, composite incident construction, and cross-source reasoning. SentinelAI implements a scalable processing pipeline composed of specialized agents. The EIDO Agent ingests raw communications and produces NENA-compliant Emergency Incident Data Object JSON.
comment: 10 pages, 5 figures
Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach
The emergence of large language model agents capable of invoking external tools has created urgent need for formal verification of agent protocols. Two paradigms dominate this space: Schema-Guided Dialogue (SGD), a research framework for zero-shot API generalization, and the Model Context Protocol (MCP), an industry standard for agent-tool integration. While both enable dynamic service discovery through schema descriptions, their formal relationship remains unexplored. Building on prior work establishing the conceptual convergence of these paradigms, we present the first process calculus formalization of SGD and MCP, proving they are structurally bisimilar under a well-defined mapping Phi. However, we demonstrate that the reverse mapping Phi^{-1} is partial and lossy, revealing critical gaps in MCP's expressivity. Through bidirectional analysis, we identify five principles -- semantic completeness, explicit action boundaries, failure mode documentation, progressive disclosure compatibility, and inter-tool relationship declaration -- as necessary and sufficient conditions for full behavioral equivalence. We formalize these principles as type-system extensions MCP+, proving MCP+ is isomorphic to SGD. Our work provides the first formal foundation for verified agent systems and establishes schema quality as a provable safety property.
comment: 18 pages. Companion to arXiv:2602.18764
Trust as Monitoring: Evolutionary Dynamics of User Trust and AI Developer Behaviour
AI safety is an increasingly urgent concern as the capabilities and adoption of AI systems grow. Existing evolutionary models of AI governance have primarily examined incentives for safe development and effective regulation, typically representing users' trust as a one-shot adoption choice rather than as a dynamic, evolving process shaped by repeated interactions. We instead model trust as reduced monitoring in a repeated, asymmetric interaction between users and AI developers, where checking AI behaviour is costly. Using evolutionary game theory, we study how user trust strategies and developer choices between safe (compliant) and unsafe (non-compliant) AI co-evolve under different levels of monitoring cost and institutional regimes. We complement the infinite-population replicator analysis with stochastic finite-population dynamics and reinforcement learning (Q-learning) simulations. Across these approaches, we find three robust long-run regimes: no adoption with unsafe development, unsafe but widely adopted systems, and safe systems that are widely adopted. Only the last is desirable, and it arises when penalties for unsafe behaviour exceed the extra cost of safety and users can still afford to monitor at least occasionally. Our results formally support governance proposals that emphasise transparency, low-cost monitoring, and meaningful sanctions, and they show that neither regulation alone nor blind user trust is sufficient to prevent evolutionary drift towards unsafe or low-adoption outcomes.
Decentralized Task Scheduling in Distributed Systems: A Deep Reinforcement Learning Approach
Efficient task scheduling in large-scale distributed systems presents significant challenges due to dynamic workloads, heterogeneous resources, and competing quality-of-service requirements. Traditional centralized approaches face scalability limitations and single points of failure, while classical heuristics lack adaptability to changing conditions. This paper proposes a decentralized multi-agent deep reinforcement learning (DRL-MADRL) framework for task scheduling in heterogeneous distributed systems. We formulate the problem as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) and develop a lightweight actor-critic architecture implemented using only NumPy, enabling deployment on resource-constrained edge devices without heavyweight machine learning frameworks. Using workload characteristics derived from the publicly available Google Cluster Trace dataset, we evaluate our approach on a 100-node heterogeneous system processing 1,000 tasks per episode over 30 experimental runs. Experimental results demonstrate 15.6% improvement in average task completion time (30.8s vs 36.5s for random baseline), 15.2% energy efficiency gain (745.2 kWh vs 878.3 kWh), and 82.3% SLA satisfaction compared to 75.5% for baselines, with all improvements statistically significant (p < 0.001). The lightweight implementation requires only NumPy, Matplotlib, and SciPy. Complete source code and experimental data are provided for full reproducibility at https://github.com/danielbenniah/marl-distributed-scheduling.
comment: 12 pages, 8 figures. Under review. Code available at GitHub
Sketch2Simulation: Automating Flowsheet Generation via Multi Agent Large Language Models
Converting process sketches into executable simulation models remains a major bottleneck in process systems engineering, requiring substantial manual effort and simulator-specific expertise. Recent advances in generative AI have improved both engineering-diagram interpretation and LLM-assisted flowsheet generation, but these remain largely disconnected: diagram-understanding methods often stop at extracted graphs, while text-to-simulation workflows assume structured inputs rather than raw visual artifacts. To bridge this gap, we present an end-to-end multi-agent large language model system that converts process diagrams directly into executable Aspen HYSYS flowsheets. The framework decomposes the task into three coordinated layers: diagram parsing and interpretation, simulation model synthesis, and multi-level validation. Specialized agents handle visual interpretation, graph-based intermediate representation construction, code generation for the HYSYS COM interface, execution, and structural verification. We evaluate the framework on four chemical engineering case studies of increasing complexity, from a simple desalting process to an industrial aromatic production flowsheet with multiple recycle loops. The system produces executable HYSYS models in all cases, achieving complete structural fidelity on the two simpler cases and strong performance on the more complex ones, with connection consistency above 0.93 and stream consistency above 0.96. These results demonstrate a viable end-to-end sketch-to-simulation workflow while highlighting remaining challenges in dense recycle structures, implicit diagram semantics, and simulator-interface constraints.
comment: 27 pages, 14 figures, 8 tables
Agent Contracts: A Formal Framework for Resource-Bounded Autonomous AI Systems
The Contract Net Protocol (1980) introduced coordination through contracts in multi-agent systems. Modern agent protocols standardize connectivity and interoperability; yet, none provide formal, resource governance-normative mechanisms to bound how much agents may consume or how long they may operate. We introduce Agent Contracts, a formal framework that extends the contract metaphor from task allocation to resource-bounded execution. An Agent Contract unifies input/output specifications, multi-dimensional resource constraints, temporal boundaries, and success criteria into a coherent governance mechanism with explicit lifecycle semantics. For multi-agent coordination, we establish conservation laws ensuring delegated budgets respect parent constraints, enabling hierarchical coordination through contract delegation. Empirical validation across four experiments demonstrates 90% token reduction with 525x lower variance in iterative workflows, zero conservation violations in multi-agent delegation, and measurable quality-resource tradeoffs through contract modes. Agent Contracts provide formal foundations for predictable, auditable, and resource-bounded autonomous AI deployment.
comment: v3: Minor fixes and workshop acceptance indication
Is AI Ready for Multimodal Hate Speech Detection? A Comprehensive Dataset and Benchmark Evaluation
Hate speech online targets individuals or groups based on identity attributes and spreads rapidly, posing serious social risks. Memes, which combine images and text, have emerged as a nuanced vehicle for disseminating hate speech, often relying on cultural knowledge for interpretation. However, existing multimodal hate speech datasets suffer from coarse-grained labeling and a lack of integration with surrounding discourse, leading to imprecise and incomplete assessments. To bridge this gap, we propose an agentic annotation framework that coordinates seven specialized agents to generate hierarchical labels and rationales. Based on this framework, we construct M^3 (Multi-platform, Multi-lingual, and Multimodal Meme), a dataset of 2,455 memes collected from X, 4chan, and Weibo, featuring fine-grained hate labels and human-verified rationales. Benchmarking state-of-the-art Multimodal Large Language Models reveals that these models struggle to effectively utilize surrounding post context, which often fails to improve or even degrades detection performance. Our finding highlights the challenges these models face in reasoning over memes embedded in real-world discourse and underscores the need for a context-aware multimodal architecture. Our dataset and code are available at https://github.com/mira-ai-lab/M3.
Sample-Efficient Hypergradient Estimation for Decentralized Bi-Level Reinforcement Learning ICAPS 2026
Many strategic decision-making problems, such as environment design for warehouse robots, can be naturally formulated as bi-level reinforcement learning (RL), where a leader agent optimizes its objective while a follower solves a Markov decision process (MDP) conditioned on the leader's decisions. In many situations, a fundamental challenge arises when the leader cannot intervene in the follower's optimization process; it can only observe the optimization outcome. We address this decentralized setting by deriving the hypergradient of the leader's objective, i.e., the gradient of the leader's strategy that accounts for changes in the follower's optimal policy. Unlike prior hypergradient-based methods that require extensive data for repeated state visits or rely on gradient estimators whose complexity can increase substantially with the high-dimensional leader's decision space, we leverage the Boltzmann covariance trick to derive an alternative hypergradient formulation. This enables efficient hypergradient estimation solely from interaction samples, even when the leader's decision space is high-dimensional. Additionally, to our knowledge, this is the first method that enables hypergradient-based optimization for 2-player Markov games in decentralized settings. Experiments highlight the impact of hypergradient updates and demonstrate our method's effectiveness in both discrete and continuous state tasks.
comment: 26 pages. Accepted at ICAPS 2026
Dominated Actions in Imperfect-Information Games
Dominance is a fundamental concept in game theory. In normal-form games dominated strategies can be identified in polynomial time. As a consequence, iterative removal of dominated strategies can be performed efficiently as a preprocessing step for reducing the size of a game before computing a Nash equilibrium. For imperfect-information games in extensive form, we could convert the game to normal form and then iteratively remove dominated strategies in the same way; however, this conversion may cause an exponential blowup in game size. In this paper we define and study the concept of dominated actions in imperfect-information games. Our main result is a polynomial-time algorithm for determining whether an action is dominated (strictly or weakly) by any mixed strategy in two-player perfect-recall games with publicly observable actions, which can be extended to iteratively remove dominated actions. This allows us to efficiently reduce the size of the game tree as a preprocessing step for Nash equilibrium computation. We explore the role of dominated actions empirically in "All In or Fold" No-Limit Texas Hold'em poker.
Evolutionarily Stable Stackelberg Equilibrium
We present a new solution concept called evolutionarily stable Stackelberg equilibrium (SESS). We study the Stackelberg evolutionary game setting in which there is a single leading player and a symmetric population of followers. The leader selects an optimal mixed strategy, anticipating that the follower population plays an evolutionarily stable strategy (ESS) in the induced subgame and may satisfy additional ecological conditions. We consider both leader-optimal and follower-optimal selection among ESSs, which arise as special cases of our framework. Prior approaches to Stackelberg evolutionary games either define the follower response via evolutionary dynamics or assume rational best-response behavior, without explicitly enforcing stability against invasion by mutations. We present algorithms for computing SESS in discrete and continuous games, and validate the latter empirically. Our model applies naturally to biological settings; for example, in cancer treatment the leader represents the physician and the followers correspond to competing cancer cell phenotypes.
SafeSieve: From Heuristics to Experience in Progressive Pruning for LLM-based Multi-Agent Communication AAAI-2026
LLM-based multi-agent systems exhibit strong collaborative capabilities but often suffer from redundant communication and excessive token overhead. Existing methods typically enhance efficiency through pretrained GNNs or greedy algorithms, but often isolate pre- and post-task optimization, lacking a unified strategy. To this end, we present SafeSieve, a progressive and adaptive multi-agent pruning algorithm that dynamically refines the inter-agent communication through a novel dual-mechanism. SafeSieve integrates initial LLM-based semantic evaluation with accumulated performance feedback, enabling a smooth transition from heuristic initialization to experience-driven refinement. Unlike existing greedy Top-k pruning methods, SafeSieve employs 0-extension clustering to preserve structurally coherent agent groups while eliminating ineffective links. Experiments across benchmarks (SVAMP, HumanEval, etc.) showcase that SafeSieve achieves 94.01% average accuracy while reducing token usage by 12.4%-27.8%. Results further demonstrate robustness under prompt injection attacks (1.23% average accuracy drop). In heterogeneous settings, SafeSieve reduces deployment costs by 13.3% while maintaining performance. These results establish SafeSieve as an efficient, GPU-free, and scalable framework for practical multi-agent systems. Our code can be found here: https://github.com/csgen/SafeSieve
comment: AAAI-2026 poster; 7 pages for main content, 5 figures, 4 tables
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Existing Multi-Agent Systems (MAS) typically rely on homogeneous model configurations, failing to exploit the diverse expertise inherent in different post-trained architectures. We propose Team-of-Thoughts, a heterogeneous MAS framework that treats diverse models as specialized tools within an orchestrator-driven paradigm. Team-of-Thoughts introduces two novel components: (1) Orchestrator Calibration, which identifies models with superior coordination and synthesis capabilities, and (2) Agent Self-Assessment, a protocol where tool agents profile their own domain-specific strengths to guide selection. At inference, the orchestrator dynamically activates the most compatible agents based on these profiles to maximize capability coverage. Across five mathematical reasoning and code generation benchmarks, Team-of-Thoughts consistently outperforms individual models and existing MAS baselines. Notably, on AIME24 and LiveCodeBench, Team-of-Thoughts achieves 96.00% and 77.91% accuracy, respectively, significantly improving over homogeneous role-play baselines (80.00% and 65.93%).
comment: 8 pages
Systems and Control (EESS)
Large Language Model Guided Incentive Aware Reward Design for Cooperative Multi-Agent Reinforcement Learning
Designing effective auxiliary rewards for cooperative multi-agent systems remains a precarious task; misaligned incentives risk inducing suboptimal coordination, especially where sparse task feedback fails to provide sufficient grounding. This study introduces an automated reward design framework that leverages large language models to synthesize executable reward programs from environment instrumentation. The procedure constrains candidate programs within a formal validity envelope and evaluates their efficacy by training policies from scratch under a fixed computational budget; selection depends exclusively on the sparse task return. The framework is evaluated across four distinct Overcooked-AI layouts characterized by varied corridor congestion, handoff dependencies, and structural asymmetries. Iterative search generations consistently yield superior task returns and delivery counts, with the most pronounced gains occurring in environments dominated by interaction bottlenecks. Diagnostic analysis of the synthesized shaping components indicates increased interdependence in action selection and improved signal alignment in coordination-intensive tasks. These results demonstrate that the search for objectivegrounded reward programs can mitigate the burden of manual engineering while producing shaping signals compatible with cooperative learning under finite budgets.
Graph-Theoretic Analysis of Residual Generation Under Computational Constraints
A unified structural framework is presented for model-based fault diagnosis that explicitly incorporates both fault locations and constraints imposed by the residual generation methodology. Building on the concepts of proper and minimal structurally overdetermined (PSO/MSO) sets and Test Equation Supports (TES/MTES), the framework introduces testable PSO sets, Residual Generation (RG) sets, irreducible fault signatures (IFS), and Irreducible RG (IRG) sets to characterize which submodels are suitable for residual generation under given computational restrictions. An operator $M^*$ is defined to extract, from any model, the largest testable PSO subset consistent with a specified residual generation method. Using this operator, an algorithm is developed to compute all RG sets, and it is shown that irreducible fault signature sets form the join-irreducible elements of a join-semilattice of sets and fully capture the multiple-fault isolability properties in the method-constrained setting. The approach is exemplified on a semi-explicit linear DAE model, where low structural differential index can be used to define $M^*$. The results demonstrate that the proposed framework generalizes MTES-based analysis to residual generation scenarios with explicit computational limitations.
Spatial Correlation, Non-Stationarity, and Degrees of Freedom of Holographic Curvature-Reconfigurable Apertures
Low-altitude wireless platforms increasingly require lightweight, conformal, and densely sampled antenna array apertures with high array gain and spatial selectivity. However, when deployed on nonplanar surfaces, curvature alters the array manifold, local visibility, and propagation support, potentially invalidating spatial-stationarity assumptions. In this paper, we investigate a holographic curvature-reconfigurable aperture (HoloCuRA), modeled as a curvature-controllable holographic surface, and develop a visibility-aware spatial characterization framework for its low-altitude applications. Specifically, the framework jointly quantifies array-domain spatial non-stationarity (SnS), and spatial degrees of freedom (DoF) in line-of-sight, 3GPP non-line-of-sight, and isotropic-scattering propagation environments. For SnS, a novel Power-balanced, Visibility-aware Correlation-Matrix Distance (PoVi-CMD) and a two-stage subarray-screening procedure are introduced. For DoF, the Rényi-2 effective rank is adopted, and tractable spatial-correlation expressions under isotropic scattering are developed for efficient DoF analysis. Furthermore, a realizable antenna port mode is introduced to connect SnS with DoF. Numerical results reveal that curvature and propagation support are the primary determinants of both SnS and DoF in HoloCuRA: array domain SnS determines whether subarray statistics can be treated as locally consistent, whereas DoF limits the global spatial modes. The findings provide useful guidance for low-altitude antenna-system design.
comment: 16 pages, 14figures
C-STEP: Continuous Space-Time Empowerment for Physics-informed Safe Reinforcement Learning of Mobile Agents
Safe navigation in complex environments remains a central challenge for reinforcement learning (RL) in robotics. This paper introduces Continuous Space-Time Empowerment for Physics-informed (C-STEP) safe RL, a novel measure of agent-centric safety tailored to deterministic, continuous domains. This measure can be used to design physics-informed intrinsic rewards by augmenting positive navigation reward functions. The reward incorporates the agents internal states (e.g., initial velocity) and forward dynamics to differentiate safe from risky behavior. By integrating C-STEP with navigation rewards, we obtain an intrinsic reward function that jointly optimizes task completion and collision avoidance. Numerical results demonstrate fewer collisions, reduced proximity to obstacles, and only marginal increases in travel time. Overall, C-STEP offers an interpretable, physics-informed approach to reward shaping in RL, contributing to safety for agentic mobile robotic systems.
Efficient Controller Learning from Human Preferences and Numerical Data Via Multi-Modal Surrogate Models
Tuning control policies manually to meet high-level objectives is often time-consuming. Bayesian optimization provides a data-efficient framework for automating this process using numerical evaluations of an objective function. However, many systems, particularly those involving humans, require optimization based on subjective criteria. Preferential Bayesian optimization addresses this by learning from pairwise comparisons instead of quantitative measurements, but relying solely on preference data can be inefficient. We propose a multi-fidelity, multi-modal Bayesian optimization framework that integrates low-fidelity numerical data with high-fidelity human preferences. Our approach employs Gaussian process surrogate models with both hierarchical, autoregressive and non-hierarchical, coregionalization-based structures, enabling efficient learning from mixed-modality data. We illustrate the framework by tuning an autonomous vehicle's trajectory planner, showing that combining numerical and preference data significantly reduces the need for experiments involving the human decision maker while effectively adapting driving style to individual preferences.
comment: 8 pages, 4 figures, accepted for ECC 2026
Equivariant Filter Transformations for Consistent and Efficient Visual--Inertial Navigation
This paper presents an equivariant filter (EqF) transformation approach for visual--inertial navigation. By establishing analytical links between EqFs with different symmetries, the proposed approach enables systematic consistency design and efficient implementation. First, we formalize the mapping from the global system state to the local error-state and prove that it induces a nonsingular linear transformation between the error-states of any two EqFs. Second, we derive transformation laws for the associated linearized error-state systems and unobservable subspaces. These results yield a general consistency design principle: for any unobservable system, a consistent EqF with a state-independent unobservable subspace can be synthesized by transforming the local coordinate chart, thereby avoiding ad hoc symmetry analysis. Third, to mitigate the computational burden arising from the non-block-diagonal Jacobians required for consistency, we propose two efficient implementation strategies. These strategies exploit the Jacobians of a simpler EqF with block-diagonal structure to accelerate covariance operations while preserving consistency. Extensive Monte Carlo simulations and real-world experiments validate the proposed approach in terms of both accuracy and runtime.
comment: 28 papes, 11 figures
A Low Cost Discrete Digital Isolator Circuit
This work presents a fully discrete, low cost digital isolator requiring no specialized ICs and implemented entirely with general purpose transistors and a two layer PCB embedded air core transformer. The design avoids vendor lock in and long term component obsolescence risks, while providing >1 kV isolation, ~200 ns propagation delay, and validated NRZ data rates of 1 Mbps. A modified dual oscillator architecture enables inherent hardware lockout suitable for half bridge gate driver applications. Measured performance and PCB layout guidelines are provided.
comment: 5 pages, 6 figures
The impact of sensor placement on graph-neural-network-based leakage detection
Sensor placement for leakage detection in water distribution networks is an important and practical challenge for water utilities. Recent work has shown that graph neural networks can estimate and predict pressures and detect leaks, but their performance strongly depends on the available sensor measurements and configurations. In this paper, we investigate how sensor placement influences the performance of GNN-based leakage detection. We propose a novel PageRank-Centrality-based sensor placement method and demonstrate that it substantially impacts reconstruction, prediction, and leakage detection on the EPANET Net1.
SafeFlow: Real-Time Text-Driven Humanoid Whole-Body Control via Physics-Guided Rectified Flow and Selective Safety Gating
Recent advances in real-time interactive text-driven motion generation have enabled humanoids to perform diverse behaviors. However, kinematics-only generators often exhibit physical hallucinations, producing motion trajectories that are physically infeasible to track with a downstream motion tracking controller or unsafe for real-world deployment. These failures often arise from the lack of explicit physics-aware objectives for real-robot execution and become more severe under out-of-distribution (OOD) user inputs. Hence, we propose SafeFlow, a text-driven humanoid whole-body control framework that combines physics-guided motion generation with a 3-Stage Safety Gate driven by explicit risk indicators. SafeFlow adopts a two-level architecture. At the high level, we generate motion trajectories using Physics-Guided Rectified Flow Matching in a VAE latent space to improve real-robot executability, and further accelerate sampling via Reflow to reduce the number of function evaluations (NFE) for real-time control. The 3-Stage Safety Gate enables selective execution by detecting semantic OOD prompts using a Mahalanobis score in text-embedding space, filtering unstable generations via a directional sensitivity discrepancy metric, and enforcing final hard kinematic constraints such as joint and velocity limits before passing the generated trajectory to a low-level motion tracking controller. Extensive experiments on the Unitree G1 demonstrate that SafeFlow outperforms prior diffusion-based methods in success rate, physical compliance, and inference speed, while maintaining diverse expressiveness.
comment: Project Page: https://hanbyelcho.info/safeflow/
Collaboration in Multi-Robot Systems: Taxonomy and Survey over Frameworks for Collaboration
Collaboration is a central theme in multi-robot systems as tasks and demands increasingly require capabilities that go beyond what any one individual robot possesses. Yet, despite extensive work on cooperative control and coordinated behaviors, the terminology surrounding collective multi-robot interaction remains inconsistent across research communities. In particular, cooperation, coordination, and collaboration are often treated interchangeably, without clearly articulating the differences among them. To address this gap, we propose definitions that distinguish and relate cooperation, coordination, and collaboration in multi-robot systems, highlighting the support of new capabilities in collaborative behaviors, and illustrate these concepts through representative examples. Building on this taxonomy, different frameworks for collaboration are reviewed, and technical challenges and promising future research directions are identified for collaborative multi-robot systems.
State-space fading memory
The fading-memory (FM) property captures the progressive loss of influence of past inputs on a system's current output and has originally been formalized by Boyd and Chua in an operator-theoretic framework. Despite its importance for systems approximation, reservoir computing, and recurrent neural networks, its connection with state-space notions of nonlinear stability, especially incremental ones, remains understudied. This paper introduces a state-space definition of FM. In state-space, FM can be interpreted as an extension of incremental input-to-output stability ($δ$IOS) that explicitly incorporates a memory kernel upper-bounding the decay of past input differences. It is also closely related to Boyd and Chua's FM definition, with the sole difference of requiring uniform, instead of general, continuity of the memory functional with respect to an input-fading norm. We demonstrate that incremental input-to-state stability ($δ$ISS) implies FM semi-globally for time-invariant systems under an equibounded input assumption. Notably, Boyd and Chua's approximation theorems apply to delta-ISS state-space models. As a closing application, we show that, under mild assumptions, the state-space model of current-driven memristors possess the FM property.
comment: 13 pages
High-Density Automated Valet Parking with Relocation-Free Sequential Operations
In this paper, we present DROP, high-Density Relocation-free sequential OPerations in automated valet parking. DROP addresses the challenges in high-density parking & vehicle retrieval without relocations. Each challenge is handled by jointly providing area-efficient layouts and relocation-free parking & exit sequences, considering accessibility with relocation-free sequential operations. To generate such sequences, relocation-free constraints are formulated as explicit logical conditions expressed in boolean variables. Recursive search strategies are employed to derive the logical conditions and enumerate relocation-free sequences under sequential constraints. We demonstrate the effectiveness of our framework through extensive simulations, showing its potential to significantly improve area utilization with relocation-free constraints. We also examine its viability on an application problem with prescribed operational order. The results from all experiments are available at: https://drop-park.github.io.
comment: 7 pages, 6 figure. The results from all experiments are available at: https://drop-park.github.io
Integral Control Barrier Functions with Input Delay: Prediction, Feasibility, and Robustness
Time delays in feedback control loops can cause controllers to respond too late, and with excessively large corrective actions, leading to unsafe behavior (violation of state constraints) and controller infeasibility (violation of input constraints). To address this problem, we develop a safety-critical control framework for nonlinear systems with input delay using dynamically defined (integral) controllers. Building on the concept of Integral Control Barrier Functions (ICBFs), we concurrently address two fundamental challenges: compensating the effect of delays, while ensuring feasibility when state and input constraints are imposed jointly. To this end, we embed predictor feedback into a dynamically defined control law to compensate for delays, with the predicted state evolving according to delay-free dynamics. Then, utilizing ICBFs, we formulate a quadratic program for safe control design. For systems subject to simultaneous state and input constraints, we derive a closed-form feasibility condition for the resulting controller, yielding a compatible ICBF pair that guarantees forward invariance under delay. We also address robustness to prediction errors (e.g., caused by delay uncertainty) using tunable robust ICBFs. Our approach is validated on an adaptive cruise control example with actuation delay.
A Modular Platooning and Vehicle Coordination Simulator for Research and Education
This work presents a modular, Python-based simulator that simplifies the evaluation of novel vehicle control and coordination algorithms in complex traffic scenarios while keeping the implementation overhead low. It allows researchers to focus primarily on developing the control and coordination strategies themselves, while the simulator manages the setup of complex road networks, vehicle configuration, execution of the simulation and the generation of video visualizations of the results. It is thereby also well-suited to support control education by allowing instructors to create interactive exercises providing students with direct visual feedback. Thanks to its modular architecture, the simulator remains easily customizable and extensible, lowering the barrier for conducting advanced simulation studies in vehicle and traffic control research.
comment: 6 pages
Communication-Aware Dissipative Output Feedback Control
Communication-aware control is essential to reduce costs and complexity in large-scale networks. This work proposes a method to design dissipativity-augmented output feedback controllers with reduced online communication. The contributions of this work are three fold: a generalized well-posedness condition for the controller network, a convex relaxation for the constraints that infer stability of a network from dissapativity of its agents, and a synthesis algorithm integrating the Network Dissipativity Theorm, alternating direction method of multipliers, and iterative convex overbounding. The proposed approach yields a sparsely interconnected controller that is both robust and applicable to networks with heterogeneous nonlinear agents. The efficiency of these methods is demonstrated on heterogeneous networks with uncertain and unstable agents, and is compared to standard $\cH_\infty$ control.
comment: 6 pages, 2 figures, Submitted to IEEE Control Systems Letters (LCSS)
Towards Safe Learning-Based Non-Linear Model Predictive Control through Recurrent Neural Network Modeling
The practical deployment of nonlinear model predictive control (NMPC) is often limited by online computation: solving a nonlinear program at high control rates can be expensive on embedded hardware, especially when models are complex or horizons are long. Learning-based NMPC approximations shift this computation offline but typically demand large expert datasets and costly training. We propose Sequential-AMPC, a sequential neural policy that generates MPC candidate control sequences by sharing parameters across the prediction horizon. For deployment, we wrap the policy in a safety-augmented online evaluation and fallback mechanism, yielding Safe Sequential-AMPC. Compared to a naive feedforward policy baseline across several benchmarks, Sequential-AMPC requires substantially fewer expert MPC rollouts and yields candidate sequences with higher feasibility rates and improved closed-loop safety. On high-dimensional systems, it also exhibits better learning dynamics and performance in fewer epochs while maintaining stable validation improvement where the feedforward baseline can stagnate.
Model Predictive Path Integral Control as Preconditioned Gradient Descent
Model Predictive Path Integral (MPPI) control is a popular sampling-based method for trajectory optimization in nonlinear and nonconvex settings, yet its optimization structure remains only partially understood. We develop a variational, optimization-theoretic interpretation of MPPI by lifting constrained trajectory optimization to a KL-regularized problem over distributions and reducing it to a negative log-partition (free-energy) objective over a tractable sampling family. For a general parametric family, this yields a preconditioned gradient method on the distribution parameters and a natural multi-step extension of MPPI. For the fixed-covariance Gaussian family, we show that classical MPPI is recovered exactly as a preconditioned gradient descent step with unit step size. This interpretation enables a direct convergence analysis: under bounded feasible sets, we derive an explicit upper bound on the smoothness constant and a simple sufficient condition guaranteeing descent of exact MPPI. Numerical experiments support the theory and illustrate the effect of key hyperparameters on performance.
Conformalized Transfer Learning for Li-ion Battery State of Health Forecasting under Manufacturing and Usage Variability
Accurate forecasting of state-of-health (SOH) is essential for ensuring safe and reliable operation of lithium-ion cells. However, existing models calibrated on laboratory tests at specific conditions often fail to generalize to new cells that differ due to small manufacturing variations or operate under different conditions. To address this challenge, an uncertainty-aware transfer learning framework is proposed, combining a Long Short-Term Memory (LSTM) model with domain adaptation via Maximum Mean Discrepancy (MMD) and uncertainty quantification through Conformal Prediction (CP). The LSTM model is trained on a virtual battery dataset designed to capture real-world variability in electrode manufacturing and operating conditions. MMD aligns latent feature distributions between simulated and target domains to mitigate domain shift, while CP provides calibrated, distribution-free prediction intervals. This framework improves both the generalization and trustworthiness of SOH forecasts across heterogeneous cells.
comment: Submitted to the 2026 American Control Conference (ACC)
Robust Optimal Operation of Virtual Power Plants Under Decision-Dependent Uncertainty of Price Elasticity
The rapid deployment of distributed energy resources (DERs) is one of the essential efforts to mitigate global climate change. However, a vast number of small-scale DERs are difficult to manage individually, motivating the introduction of virtual power plants (VPPs). A VPP operator coordinates a group of DERs by setting suitable prices, and aggregates them for interaction with the power grid. In this context, optimal pricing plays a critical role in VPP operation. This paper proposes a robust optimal operation model for VPPs that considers uncertainty in the price elasticity of demand. Specifically, the demand elasticity is found to be influenced by the pricing decision, giving rise to decision-dependent uncertainty (DDU). An improved column-and-constraint (C&CG) algorithm, together with tailored transformation and reformulation techniques, is developed to solve the robust model with DDU efficiently. Case studies based on actual electricity consumption data of London households demonstrate the effectiveness of the proposed model and algorithm.
comment: 9 pages, 9 figures
On a Co-evolving Opinion-Leadership Model in Social Networks
Leadership in social groups is often a dynamic characteristic that emerges from interactions and opinion exchange. Empirical evidence suggests that individuals with strong opinions tend to gain influence, at the same time maintaining alignment with the social context is crucial for sustained leadership. Motivated by the social psychology literature that supports these empirical observations, we propose a novel dynamical system in which opinions and leadership co-evolve within a social network. Our model extends the Friedkin-Johnsen framework by making susceptibility to peer influence time-dependent, turning it into the leadership variable. Leadership strengthens when an agent holds strong yet socially aligned opinions, and declines when such alignment is lost, capturing the trade-off between conviction and social acceptance. After illustrating the emergent behavior of this complex system, we formally analyze the coupled dynamics, establishing sufficient conditions for convergence to a non-trivial equilibrium, and examining two time-scale separation regimes reflecting scenarios where opinion and leadership evolve at different speeds.
comment: 8 pages, 6 figures
Structure, Analysis, and Synthesis of First-Order Algorithms
Optimization algorithms can be interpreted through the lens of dynamical systems as the interconnection of linear systems and a set of subgradient nonlinearities. This dynamical systems formulation allows for the analysis and synthesis of optimization algorithms by solving robust control problems. In this work, we use the celebrated internal model principle in control theory to structurally factorize convergent composite optimization algorithms into suitable network-dependent internal models and core subcontrollers. As the key benefit, we reveal that this permits us to synthesize optimization algorithms even if information is transmitted over networks featuring dynamical phenomena such as time delays, channel memory, or crosstalk. Design of these algorithms is achieved under bisection in the exponential convergence rate either through a nonconvex local search or by alternation of convex semidefinite programs. We demonstrate factorization of existing optimization algorithms and the automated synthesis of new optimization algorithms in the networked setting.
comment: 72 pages, 27 figures, 6 Tables
Cyber-Physical System Design Space Exploration for Affordable Precision Agriculture DATE
Precision agriculture promises higher yields and sustainability, but adoption is slowed by the high cost of cyber-physical systems (CPS) and the lack of systematic design methods. We present a cost-aware design space exploration (DSE) framework for multimodal drone-rover platforms to integrate budget, energy, sensing, payload, computation, and communication constraints. Using integer linear programming (ILP) with SAT-based verification, our approach trades off among cost, coverage, and payload while ensuring constraint compliance and a multitude of alternatives. We conduct case studies on smaller and larger-sized farms to show that our method consistently achieves full coverage within budget while maximizing payload efficiency, outperforming state-of-the-art CPS DSE approaches.
comment: 2026 Design, Automation & Test in Europe Conference (DATE)
Can an Actor-Critic Optimization Framework Improve Analog Design Optimization?
Analog design often slows down because even small changes to device sizes or biases require expensive simulation cycles, and high-quality solutions typically occupy only a narrow part of a very large search space. While existing optimizers reduce some of this burden, they largely operate without the kind of judgment designers use when deciding where to search next. This paper presents an actor-critic optimization framework (ACOF) for analog sizing that brings that form of guidance into the loop. Rather than treating optimization as a purely black-box search problem, ACOF separates the roles of proposal and evaluation: an actor suggests promising regions of the design space, while a critic reviews those choices, enforces design legality, and redirects the search when progress is hampered. This structure preserves compatibility with standard simulator-based flows while making the search process more deliberate, stable, and interpretable. Across our test circuits, ACOF improves the top-10 figure of merit by an average of 38.9% over the strongest competing baseline and reduces regret by an average of 24.7%, with peak gains of 70.5% in FoM and 42.2% lower regret on individual circuits. By combining iterative reasoning with simulation-driven search, the framework offers a more transparent path toward automated analog sizing across challenging design spaces.
comment: 7 pages, 5 figures
IndustriConnect: MCP Adapters and Mock-First Evaluation for AI-Assisted Industrial Operations
AI assistants can decompose multi-step workflows, but they do not natively speak industrial protocols such as Modbus, MQTT/Sparkplug B, or OPC UA, so this paper presents INDUSTRICONNECT, a prototype suite of Model Context Protocol (MCP) adapters that expose industrial operations as schema-discoverable AI tools while preserving protocol-specific connectivity and safety controls; the system uses a common response envelope and a mock-first workflow so adapter behavior can be exercised locally before connecting to plant equipment, and a deterministic benchmark covering normal, fault-injected, stress, and recovery scenarios evaluates the flagship adapters, comprising 870 runs (480 normal, 210 fault-injected, 120 stress, 60 recovery trials) and 2820 tool calls across 7 fault scenarios and 12 stress scenarios, where the normal suite achieved full success, the fault suite confirmed structured error handling with adapter-level uint16 range validation, the stress suite identified concurrency boundaries, and same-session recovery after endpoint restart is demonstrated for all three protocols, with results providing evidence spanning adapter correctness, concurrency behavior, and structured error handling for AI-assisted industrial operations.
Sketch2Simulation: Automating Flowsheet Generation via Multi Agent Large Language Models
Converting process sketches into executable simulation models remains a major bottleneck in process systems engineering, requiring substantial manual effort and simulator-specific expertise. Recent advances in generative AI have improved both engineering-diagram interpretation and LLM-assisted flowsheet generation, but these remain largely disconnected: diagram-understanding methods often stop at extracted graphs, while text-to-simulation workflows assume structured inputs rather than raw visual artifacts. To bridge this gap, we present an end-to-end multi-agent large language model system that converts process diagrams directly into executable Aspen HYSYS flowsheets. The framework decomposes the task into three coordinated layers: diagram parsing and interpretation, simulation model synthesis, and multi-level validation. Specialized agents handle visual interpretation, graph-based intermediate representation construction, code generation for the HYSYS COM interface, execution, and structural verification. We evaluate the framework on four chemical engineering case studies of increasing complexity, from a simple desalting process to an industrial aromatic production flowsheet with multiple recycle loops. The system produces executable HYSYS models in all cases, achieving complete structural fidelity on the two simpler cases and strong performance on the more complex ones, with connection consistency above 0.93 and stream consistency above 0.96. These results demonstrate a viable end-to-end sketch-to-simulation workflow while highlighting remaining challenges in dense recycle structures, implicit diagram semantics, and simulator-interface constraints.
comment: 27 pages, 14 figures, 8 tables
Recurrent neural network-based robust control systems with regional properties and application to MPC design
This paper investigates the design of output-feedback schemes for systems described by a class of recurrent neural networks. We propose a procedure based on linear matrix inequalities for designing an observer and a static state-feedback controller. The algorithm leverages global and regional incremental input-to-state stability (incremental ISS) and enables the tracking of constant setpoints, ensuring robustness to disturbances and state estimation uncertainty. To address the potential limitations of regional incremental ISS, we introduce an alternative scheme in which the static law is replaced with a tube-based nonlinear model predictive controller (NMPC) that exploits regional incremental ISS properties. We show that these conditions enable the formulation of a robust NMPC law with guarantees of convergence and recursive feasibility, leading to an enlarged region of attraction. Theoretical results are validated through numerical simulations on the pH-neutralisation process benchmark.
comment: 27 pages, 5 figures
Achieving distributed convex optimization within prescribed time for high-order nonlinear multiagent systems
In this paper, we address the distributed prescribed-time convex optimization (DPTCO) problem for a class of nonlinear multi-agent systems (MASs) under undirected connected graph. A cascade design framework is proposed such that the DPTCO implementation is divided into two parts: distributed optimal trajectory generator design and local reference trajectory tracking controller design. The DPTCO problem is then transformed into the prescribed-time stabilization problem of a cascaded system. Changing Lyapunov function method and time-varying state transformation method together with the sufficient conditions are proposed to prove the prescribed-time stabilization of the cascaded system as well as the uniform boundedness of internal signals in the closed-loop systems. The proposed framework is then utilized to solve robust DPTCO problem for a class of chain-integrator MASs with external disturbances by constructing a novel variables and exploiting the property of time-varying gains. The proposed framework is further utilized to solve the adaptive DPTCO problem for a class of strict-feedback MASs with parameter uncertainty, in which backstepping method with prescribed-time dynamic filter is adopted. The descending power state transformation is introduced to compensate the growth of increasing rate induced by the derivative of time-varying gains in recursive steps and the high-order derivative of local reference trajectory is not required. Finally, theoretical results are verified by two numerical examples.
comment: 14 pages,
Time-Optimal Model Predictive Control for Linear Systems with Multiplicative Uncertainties
This paper presents a time-optimal Model Predictive Control (MPC) scheme for linear discrete-time systems subject to multiplicative uncertainties represented by interval matrices. To render the uncertainty propagation computationally tractable, the set-valued error system dynamics are approximated using a matrix-zonotope-based bounding operator. Recursive feasibility and finite-time convergence are ensured through an adaptive terminal constraint mechanism. A key advantage of the proposed approach is that all the necessary bounding sets can be computed offline, substantially reducing the online computational burden. The effectiveness of the method is illustrated via a numerical case study on an orbital rendezvous maneuver between two satellites.
RadioDiff-FS: Physics-Informed Manifold Alignment in Few-Shot Diffusion Models for High-Fidelity Radio Map Construction
RaRadio maps (RMs) provide spatially continuous propagation characterizations essential for 6G network planning, but high-fidelity RM construction remains challenging. Rigorous electromagnetic solvers incur prohibitive computational latency, while data-driven models demand massive labeled datasets and generalize poorly from simplified simulations to complex multipath environments. This paper proposes RadioDiff-FS, a few-shot diffusion framework that adapts a pre-trained main-path generator to multipath-rich target domains with only a small number of high-fidelity samples. The adaptation is grounded in a theoretical decomposition of the multipath RM into a dominant main-path component and a directionally sparse residual. This decomposition shows that the cross-domain shift corresponds to a bounded and geometrically structured feature translation rather than an arbitrary distribution change. A Direction-Consistency Loss (DCL) is then introduced to constrain diffusion score updates along physically plausible propagation directions, thereby suppressing phase-inconsistent artifacts that arise in the low-data regime. Experiments show that RadioDiff-FS reduces NMSE by 59.5% on static RMs and by 74.0% on dynamic RMs relative to the vanilla diffusion baseline, achieving an SSIM of 0.9752 and a PSNR of 36.37 dB under severely limited supervision.
Learn for Variation: Variationally Guided AAV Trajectory Learning in Differentiable Environments
Autonomous aerial vehicles (AAVs) empower sixth-generation (6G) Internet-of-Things (IoT) networks through mobility-driven data collection. However, conventional reward-driven reinforcement learning for AAV trajectory planning suffers from severe credit assignment issues and training instability, because sparse scalar rewards fail to capture the long-term and nonlinear effects of sequential movements. To address these challenges, this paper proposes Learn for Variation (L4V), a gradient-informed trajectory learning framework that replaces high-variance scalar reward signals with dense and analytically grounded policy gradients. Particularly, the coupled evolution of AAV kinematics, distance-dependent channel gains, and per-user data-collection progress is first unrolled into an end-to-end differentiable computational graph. Backpropagation through time then serves as a discrete adjoint solver, which propagates exact sensitivities from the cumulative mission objective to every control action and policy parameter. These structured gradients are used to train a deterministic neural policy with temporal smoothness regularization and gradient clipping. Extensive simulations demonstrate that L4V consistently outperforms representative baselines, including a genetic algorithm, DQN, A2C, and DDPG, in mission completion time, average transmission rate, and training cost
Risk Assessment and Vulnerability Identification of Energy-Transportation Infrastructure Systems to Extreme Weather
The interaction between extreme weather events and interdependent critical infrastructure systems involves complex spatiotemporal dynamics. Multi-type emergency decisions within energy-transportation infrastructures significantly influence system performance throughout the extreme weather process. A comprehensive assessment of these factors faces challenges in model complexity, heterogeneous differences between energy and transportation systems, and cross-sector privacy. This paper proposes a risk assessment framework that integrates the heterogeneous energy and transportation systems in the form of a unified network flow model, which enables full accommodation of multiple types of energy-transportation emergency decisions while capturing the compound spatiotemporal impacts of extreme weather on both systems simultaneously. Based on this framework, a targeted method for identifying system vulnerabilities is further developed. This method employs neural network surrogates to achieve privacy protection and accelerated identification while maintaining consideration of system interdependencies. Numerical experiments demonstrate that the proposed framework and method can reveal the risk levels faced by urban infrastructure systems, identify vulnerabilities that should be prioritized for reinforcement, and strike a balance between accuracy and speed.
comment: Our paper has been accepted by IEEE Transactions on Industry Applications at 25-Jan-2026
A Digital Twin of Evaporative Thermo-Fluidic Process in Fixation Unit of DoD Inkjet Printers
In inkjet printing, optimal paper moisture is crucial for print quality, achieved through hot-air impingement in the fixation unit. This paper presents a modular digital twin of the fixation unit, modeling the thermo-fluidic drying process and monitoring its spatio-temporal performance. The novel approach formulates the digital twin as an infinite-dimensional state estimator that infers fixation states from limited sensor data, while remaining robust to disturbances. Modularity is achieved through a graph-theoretic model, where each node represents thermo-fluidic dynamics in different sections of the fixation unit. Evaporation is modeled as a nonlinear boundary effect coupled with node dynamics via Linear Fractional Representation. Using the Partial Integral Equation (PIE) framework, we develop a unified approach for stability, input-output analysis, simulation, and rapid prototyping, validated with operational data from a commercial printer. An $\mathcal{H}_{\infty}$-optimal Luenberger state estimator is then synthesized to estimate thermal states from available sensor data, enabling real-time monitoring of spatio-temporal thermal effects on paper sheets.
Optimal Control for Steady Circulation of a Diffusion Process via Spectral Decomposition of Fokker-Planck Equation
We present a formulation of an optimal control problem for a two-dimensional diffusion process governed by a Fokker-Planck equation to achieve a nonequilibrium steady state with a desired circulation while accelerating convergence toward the stationary distribution. To achieve the control objective, we introduce costs for both the probability density function and flux rotation to the objective functional. We formulate the optimal control problem through dimensionality reduction of the Fokker-Planck equation via eigenfunction expansion, which requires a low-computational cost. We demonstrate that the proposed optimal control achieves the desired circulation while accelerating convergence to the stationary distribution through numerical simulations.
Prescriptive Artificial Intelligence: A Formal Paradigm for Auditing Human Decisions Under Uncertainty AAAI
We formalize Prescriptive Artificial Intelligence as a distinct paradigm for human-AI decision collaboration in high-stakes environments. Unlike predictive systems optimized for outcome accuracy, prescriptive systems are designed to recommend and audit human decisions under uncertainty, providing normative guidance while preserving human agency and accountability. We introduce four domain-independent axioms characterizing prescriptive systems and prove fundamental separation results. Central among these is the Imitation Incompleteness theorem, which establishes that supervised learning from historical decisions cannot correct systematic decision biases in the absence of external normative signals. Consequently, performance in decision imitation is bounded by a structural bias term epsilon_bias rather than the statistical learning rate O(1/sqrt(n)). This result formalizes the empirically observed accuracy ceiling in human decision imitation tasks and provides a principled criterion for when automation should be replaced by epistemic auditing. We demonstrate the computational realizability of the framework through an interpretable fuzzy inference system, applied as a stress test in elite soccer decision-making, where it reveals systematic decision latency and risk states obscured by outcome and status quo biases. The proposed framework establishes Prescriptive AI as a general, realizable class of decision-support systems applicable across safety-critical domains in which interpretability, contestability, and normative alignment are essential.
comment: Preprint; suitable for AI, decision sciences, and prescriptive analytics. Short versions published in Wharton Sports Analytics Journal Fall 2025 (AI Feature Spotlight) and accepted to AAAI Bridge on LM Reasoning 2026
Datamodel-Based Data Selection for Nonlinear Data-Enabled Predictive Control
Data-Enabled Predictive Control (DeePC) has emerged as a powerful framework for controlling unknown systems directly from input-output data. For nonlinear systems, recent work has proposed selecting relevant subsets of data columns based on geometric proximity to the current operating point. However, such proximity-based selection ignores the control objective: different reference trajectories may benefit from different data even at the same operating point. In this paper, we propose a datamodel-based approach that learns a context-dependent influence function mapping the current initial trajectory and reference trajectory to column importance scores. Adapting the linear datamodel framework from machine learning, we model closed-loop cost as a linear function of column inclusion indicators, with coefficients that depend on the control context. Training on closed-loop simulations, our method captures which data columns actually improve tracking performance for specific control tasks. Experimental results demonstrate that task-aware selection substantially outperforms geometry-based heuristics, particularly when using small data subsets.
DM-MPPI: Datamodel for Efficient and Safe Model Path Integral Control
We extend the Datamodels framework from supervised learning to Model Predictive Path Integral (MPPI) control. Whereas Datamodels estimate sample influence via regression on a fixed dataset, we instead learn to predict influence directly from sample cost features, enabling real-time estimation for newly generated samples without online regression. Our influence predictor is trained offline using influence coefficients computed via the Datamodel framework across diverse MPPI instances, and is then deployed online for efficient sample pruning and adaptive constraint handling. A single learned model simultaneously addresses efficiency and safety: low-influence samples are pruned to reduce computational cost, while monitoring the influence of constraint-violating samples enables adaptive penalty tuning. Experiments on path-tracking with obstacle avoidance demonstrate up to a $5\times$ reduction in the number of samples while maintaining control performance and improving constraint satisfaction.
AURORA: Autonomous Updating of ROM and Controller via Recursive Adaptation
Real time model based control of high dimensional nonlinear systems presents severe computational challenges. Conventional reduced order model control relies heavily on expert tuning or parameter adaptation and seldom offers mechanisms for online supervised reconstruction. We introduce AURORA, Autonomous Updating of ROM and Controller via Recursive Adaptation, a supervisory framework that automates ROM based controller design and augments it with diagnostic triggered structural adaptation. Five specialized agents collaborate through iterative generate judge revise cycles, while an Evaluation Agent classifies performance degradation into three operationally distinct categories, subspace inadequacy, parametric drift, and control inadequacy, and routes corrective action to the responsible agent. For linear ROMs, we analytically prove that this classification is correct under mild assumptions and that the supervisory switching cycle preserves exponential stability subject to a dwell time condition. For nonlinear systems, the absence of a universal Lyapunov construction for autonomously discovered ROM structures precludes analogous analytical guarantees, so we validate the same classification empirically. Experiments on eight benchmark systems with state dimensions up to 5177 compare AURORA against expert tuned baselines, gain scheduled control, and online RLS adaptive alternatives. Controlled fault injection experiments confirm 91 percent diagnostic routing accuracy. AURORA achieves 6 to 12 percent tracking improvement over expert baselines and 4 to 5 percent over classical adaptive alternatives.
Smart Predict-Then-Control: Control-Aware Surrogate Refinement for System Identification
This paper introduces Smart Predict Then Control (SPC), a control aware refinement procedure for model based control. SPC refines a prediction oriented model by optimizing a surrogate objective that evaluates candidate models through the control actions they induce. For a fixed surrogate variant under unconstrained control, we establish the smoothness of the surrogate, projected gradient convergence at a sublinear rate of order one over K, and a bias decomposition that yields a conditional transfer diagnostic. On a wind disturbed quadrotor trajectory tracking task, Updated SPC reduces tracking RMSE by 70 percent and closed loop cost by 42 percent relative to the nominal baseline.
Fast Relax-and-Round Unit Commitment with Economic Horizons
We expand our novel computational method for unit commitment (UC) to include long-horizon planning. We introduce a fast novel algorithm to commit hydro-generators, provably accurately. We solve problems with thousands of generators at 5 minute market intervals. We show that our method can solve interconnect size UC problems in approximately 1 minute on a commodity hardware and that an increased planning horizon leads to sizable operational cost savings (our objective). This scale is infeasible for current state-of-the-art tools. We attain this runtime improvement by introducing a heuristic tailored for UC problems. Our method can be implemented using existing continuous optimization solvers and adapted for different applications. Combined, the two algorithms would allow an operator operating large systems with hydro units to make horizon-aware economic decisions.
comment: 6 pages (journal limit), 6 figures
A day-ahead market model for power systems: benchmarking and security implications
Power system security assessments, e.g. via cascading outage models, often use operational set-points based on optimal power flow (OPF) dispatch. However, driven by cost minimization, OPF provides an ideal, albeit unrealistic, clearing of the generating units that disregards the complex interactions among market participants. In addition, existing market modeling tools often utilize economic dispatch and unit commitment to minimize total system costs, often disregarding the profit-driven behavior of market participants. The security of the system, therefore, may be overestimated. To address this gap, we introduce a social-welfare-based day-ahead market-clearing model. The security implications are analyzed using Cascades, a model for cascading failure analysis. We apply this model to the IEEE-118 bus system with three independent control zones. The results show that market dispatch leads to an increase in demand not served (DNS) of up to 80% higher than OPF, highlighting a significant security overestimation. This is especially pronounced in large-scale cascading events with DNS above 100MW. A key driver is the increased dispatch of storage and gas units, which can place the system in critical operating conditions. Operators can use this information to properly estimate the impact of the market on system security and plan efficient expansion strategies.
A Model Predictive Control Approach to Dual-Axis Agrivoltaic Panel Tracking
Agrivoltaic systems--photovoltaic (PV) panels installed above agricultural land--have emerged as a promising dual-use solution to address competing land demands for food and energy production. In this paper, we propose a model predictive control (MPC) approach to dual-axis agrivoltaic panel tracking control that dynamically adjusts panel positions in real time to maximize power production and crop yield given solar irradiance and ambient temperature measurements. We apply convex relaxations and shading factor approximations to reformulate the MPC optimization problem as a convex second-order cone program that determines the PV panel position adjustments away from the sun-tracking trajectory. Through case studies, we demonstrate our approach, exploring the Pareto front between i) an approach that maximizes power production without considering crop needs and ii) crop yield with no agrivoltaics. We also conduct a case study exploring the impact of forecast error on MPC performance. We find that dynamically adjusting agrivoltaic panel position helps us actively manage the trade-offs between power production and crop yield, and that active panel control enables the agrivoltaic system to achieve land equivalent ratio values of up to 1.897.
comment: 10 pages
Planning Future Microgrids with Second-Life Batteries: A Degradation-Aware Iterative Optimization Framework
The growing availability of second-life batteries (SLBs) from electric vehicles is reshaping future microgrid design, requiring planning frameworks that explicitly account for reduced capacity and efficiency over time. However, traditional microgrid planning models often neglect degradation effects or rely on highly simplified formulations, leading to unreliable sizing decisions and increased long-term costs. This paper proposes a degradation-aware iterative optimization framework for long-term microgrid planning that incorporates photovoltaic efficiency fading, battery capacity and efficiency degradation, and SLB characteristics. A cumulative multi-year optimization model is first solved to obtain an initial investment and operational strategy under simplified degradation assumptions, ensuring computational tractability. Subsequently, a yearly validation model evaluates degradation impacts on photovoltaic and battery assets, updating efficiencies and available capacity to assess reliability. An iterative refinement process then adjusts resource allocation to eliminate load shedding while minimizing total system cost. Sensitivity analyses on photovoltaic degradation rates, SLB capital costs, and grid tariffs are conducted to evaluate robustness under varying technical and economic conditions. Results demonstrate that neglecting degradation can compromise reliability and increase blackout risk, while SLBs offer meaningful cost-saving opportunities. The proposed framework provides a scalable and practical tool for planning future microgrids in degradation-constrained environments.
Robotics
LiZIP: An Auto-Regressive Compression Framework for LiDAR Point Clouds
The massive volume of data generated by LiDAR sensors in autonomous vehicles creates a bottleneck for real-time processing and vehicle-to-everything (V2X) transmission. Existing lossless compression methods often force a trade-off: industry standard algorithms (e.g., LASzip) lack adaptability, while deep learning approaches suffer from prohibitive computational costs. This paper proposes LiZIP, a lightweight, near-lossless zero-drift compression framework based on neural predictive coding. By utilizing a compact Multi-Layer Perceptron (MLP) to predict point coordinates from local context, LiZIP efficiently encodes only the sparse residuals. We evaluate LiZIP on the NuScenes and Argoverse datasets, benchmarking against GZip, LASzip, and Google Draco (configured with 24-bit quantization to serve as a high-precision geometric baseline). Results demonstrate that LiZIP consistently achieves superior compression ratios across varying environments. The proposed system achieves a 7.5%-14.8% reduction in file size compared to the industry-standard LASzip and outperforms Google Draco by 8.8%-11.3% across diverse datasets. Furthermore, the system demonstrates generalization capabilities on the unseen Argoverse dataset without retraining. Against the general purpose GZip algorithm, LiZIP achieves a reduction of 38%-48%. This efficiency offers a distinct advantage for bandwidth constrained V2X applications and large scale cloud archival.
comment: 8 pages
PHANTOM Hand IROS
Tendon-driven underactuated hands excel in adaptive grasping but often suffer from kinematic unpredictability and highly non-linear force transmission. This ambiguity limits their ability to perform precise free-motion shaping and deliver reliable payloads for complex manipulation tasks. To address this, we introduce the PHANTOM Hand (Hybrid Precision-Augmented Compliance): a modular, 1:1 human-scale system featuring 6 actuators and 15 degrees of freedom (DoFs). We propose a unified framework that bridges the gap between precise analytic shaping and robust compliant grasping. By deriving a sparse mapping from physical geometry and integrating a mechanics-based compensation model, we effectively suppress kinematic drift caused by spring counter-tension and tendon elasticity. This approach achieves sub-degree kinematic reproducibility for free-motion planning while retaining the inherent mechanical compliance required for stable physical interaction. Experimental validation confirms the system's capabilities through (1) kinematic analysis verifying sub-degree global accuracy across the workspace; (2) static expressibility tests demonstrating complex hand gestures; (3) diverse grasping experiments covering power, precision, and tool-use categories; and (4) quantitative fingertip force characterization. The results demonstrate that the PHANTOM hand successfully combines analytic kinematic precision with continuous, predictable force output, significantly expanding the payload and dexterity of underactuated hands. To drive the development of the underactuated manipulation ecosystem, all hardware designs and control scripts are fully open-sourced for community engagement.
comment: 8 pages. Submitted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026
Active Robotic Perception for Disease Detection and Mapping in Apple Trees IROS 2026
Large-scale orchard production requires timely and precise disease monitoring, yet routine manual scouting is labor-intensive and financially impractical at the scale of modern operations. As a result, disease outbreaks are often detected late and tracked at coarse spatial resolutions, typically at the orchard-block level. We present an autonomous mobile active perception system for targeted disease detection and mapping in dormant apple trees, demonstrated on one of the most devastating diseases affecting apple today -- fire blight. The system integrates flash-illuminated stereo RGB sensing, real-time depth estimation, instance-level segmentation, and confidence-aware semantic 3D mapping to achieve precise localization of disease symptoms. Semantic predictions are fused into the volumetric occupancy map representation enabling the tracking of both occupancy and per-voxel semantic confidence, building actionable spatial maps for growers. To actively refine observations within complex canopies, we evaluate three viewpoint planning strategies within a unified perception-action loop: a deterministic geometric baseline, a volumetric next-best-view planner that maximizes unknown-space reduction, and a semantic next-best-view planner that prioritizes low-confidence symptomatic regions. Experiments on a fabricated lab tree and five simulated symptomatic trees demonstrate reliable symptom localization and mapping as a precursor to a field evaluation. In simulation, the semantic planner achieves the highest F1 score (0.6106) after 30 viewpoints, while the volumetric planner achieves the highest ROI coverage (85.82\%). In the lab setting, the semantic planner attains the highest final F1 (0.9058), with both next-best-view planners substantially improving coverage over the baseline.
comment: 8 pages, 6 figures, IROS 2026 conference
AirSimAG: A High-Fidelity Simulation Platform for Air-Ground Collaborative Robotics
As spatial intelligence continues to evolve, heterogeneous multi-agent systems-particularly the collaboration between Unmanned Aerial Vehicles (UAVs) and Unmanned Ground Vehicles (UGVs), have demonstrated strong potential in complex applications such as search and rescue, urban surveillance, and environmental monitoring. However, existing simulation platforms are primarily designed for single-agent dynamics and lack dedicated frameworks for interactive air-ground collaborative simulation. In this paper, we present AirsimAG, a high-fidelity air-ground collaborative simulation platform built upon an extensively customized AirSim framework. The platform enables synchronized multi-agent simulation and supports heterogeneous sensing and control interfaces for UAV-UGV systems. To demonstrate its capabilities, we design a set of representative air-ground collaborative tasks, including mapping, planning, tracking, formation, and exploration. We further provide quantitative analyses based on these tasks to illustrate the platform effectiveness in supporting multi-agent coordination and cross-modal data consistency. The AirsimAG simulation platform is publicly available at https://github.com/BIULab-BUAA/AirSimAG.
Tightly-Coupled Radar-Visual-Inertial Odometry
Visual-Inertial Odometry (VIO) is a staple for reliable state estimation on constrained and lightweight platforms due to its versatility and demonstrated performance. However, pertinent challenges regarding robust operation in dark, low-texture, obscured environments complicate the use of such methods. Alternatively, Frequency Modulated Continuous Wave (FMCW) radars, and by extension Radar-Inertial Odometry (RIO), offer robustness to these visual challenges, albeit at the cost of reduced information density and worse long-term accuracy. To address these limitations, this work combines the two in a tightly coupled manner, enabling the resulting method to operate robustly regardless of environmental conditions or trajectory dynamics. The proposed method fuses image features, radar Doppler measurements, and Inertial Measurement Unit (IMU) measurements within an Iterated Extended Kalman Filter (IEKF) in real-time, with radar range data augmenting the visual feature depth initialization. The method is evaluated through flight experiments conducted in both indoor and outdoor environments, as well as through challenges to both exteroceptive modalities (such as darkness, fog, or fast flight), thoroughly demonstrating its robustness. The implementation of the proposed method is available at: https://github.com/ntnu-arl/radvio .
comment: 8 pages, 9 figures, Accepted to the 2026 European Control Conference (ECC)
Learning Actuator-Aware Spectral Submanifolds for Precise Control of Continuum Robots
Continuum robots exhibit high-dimensional, nonlinear dynamics which are often coupled with their actuation mechanism. Spectral submanifold (SSM) reduction has emerged as a leading method for reducing high-dimensional nonlinear dynamical systems to low-dimensional invariant manifolds. Our proposed control-augmented SSMs (caSSMs) extend this methodology by explicitly incorporating control inputs into the state representation, enabling these models to capture nonlinear state-input couplings. Training these models relies solely on controlled decay trajectories of the actuator-augmented state, thereby removing the additional actuation-calibration step commonly needed by prior SSM-for-control methods. We learn a compact caSSM model for a tendon-driven trunk robot, enabling real-time control and reducing open-loop prediction error by 40% compared to existing methods. In closed-loop experiments with model predictive control (MPC), caSSM reduces tracking error by 52%, demonstrating improved performance against Koopman and SSM based MPC and practical deployability on hardware continuum robots.
YOLOv10 with Kolmogorov-Arnold networks and vision-language foundation models for interpretable object detection and trustworthy multimodal AI in computer vision perception
The interpretable object detection capabilities of a novel Kolmogorov-Arnold network framework are examined here. The approach refers to a key limitation in computer vision for autonomous vehicles perception, and beyond. These systems offer limited transparency regarding the reliability of their confidence scores in visually degraded or ambiguous scenes. To address this limitation, a Kolmogorov-Arnold network is employed as an interpretable post-hoc surrogate to model the trustworthiness of the You Only Look Once (Yolov10) detections using seven geometric and semantic features. The additive spline-based structure of the Kolmogorov-Arnold network enables direct visualisation of each feature's influence. This produces smooth and transparent functional mappings that reveal when the model's confidence is well supported and when it is unreliable. Experiments on both Common Objects in Context (COCO), and images from the University of Bath campus demonstrate that the framework accurately identifies low-trust predictions under blur, occlusion, or low texture. This provides actionable insights for filtering, review, or downstream risk mitigation. Furthermore, a bootstrapped language-image (BLIP) foundation model generates descriptive captions of each scene. This tool enables a lightweight multimodal interface without affecting the interpretability layer. The resulting system delivers interpretable object detection with trustworthy confidence estimates. It offers a powerful tool for transparent and practical perception component for autonomous and multimodal artificial intelligence applications.
comment: 14 pages, 23 Figures, 6 Tables
Generative Event Pretraining with Foundation Model Alignment
Event cameras provide robust visual signals under fast motion and challenging illumination conditions thanks to their microsecond latency and high dynamic range. However, their unique sensing characteristics and limited labeled data make it challenging to train event-based visual foundation models (VFMs), which are crucial for learning visual features transferable across tasks. To tackle this problem, we propose GEP (Generative Event Pretraining), a two-stage framework that transfers semantic knowledge learned from internet-scale image datasets to event data while learning event-specific temporal dynamics. First, an event encoder is aligned to a frozen VFM through a joint regression-contrastive objective, grounding event features in image semantics. Second, a transformer backbone is autoregressively pretrained on mixed event-image sequences to capture the temporal structure unique to events. Our approach outperforms state-of-the-art event pretraining methods on a diverse range of downstream tasks, including object recognition, segmentation, and depth estimation. Together, VFM-guided alignment and generative sequence modeling yield a semantically rich, temporally aware event model that generalizes robustly across domains.
Design Guidelines for Nonlinear Kalman Filters via Covariance Compensation
Nonlinear extensions of the Kalman filter (KF), such as the extended Kalman filter (EKF) and the unscented Kalman filter (UKF), are indispensable for state estimation in complex dynamical systems, yet the conditions for a nonlinear KF to provide robust and accurate estimations remain poorly understood. This work proposes a theoretical framework that identifies the causes of failure and success in certain nonlinear KFs and establishes guidelines for their improvement. Central to our framework is the concept of covariance compensation: the deviation between the covariance predicted by a nonlinear KF and that of the EKF. With this definition and detailed theoretical analysis, we derive three design guidelines for nonlinear KFs: (i) invariance under orthogonal transformations, (ii) sufficient covariance compensation beyond the EKF baseline, and (iii) selection of compensation magnitude that favors underconfidence. Both theoretical analysis and empirical validation confirm that adherence to these principles significantly improves estimation accuracy, whereas fixed parameter choices commonly adopted in the literature are often suboptimal. The codes and the proofs for all the theorems in this paper are available at https://github.com/Shida-Jiang/Guidelines-for-Nonlinear-Kalman-Filters.
comment: This manuscript has been accepted by ACC 2026
Task-Aware Positioning for Improvisational Tasks in Mobile Construction Robots via an AI Agent with Multi-LMM Modules
Due to the ever-changing nature of construction, many tasks on sites occur in an improvisational manner. Existing mobile construction robot studies remain limited in addressing improvisational tasks, where task-required locations, timing of task occurrence, and contextual information required for task execution are not known in advance. We propose an agent that understands improvisational tasks given in natural language, identifies the task-required location, and positions itself. The agent's functionality was decomposed into three Large Multimodal Model (LMM) modules operating in parallel, enabling the application of LMMs for task interpretation and breakdown, construction drawing-based navigation, and visual reasoning to identify non-predefined task-required locations. The agent was implemented with a quadruped robot and achieved a 92.2% success rate for identifying and positioning at task-required locations across three tests designed to assess improvisational task handling. This study enables mobile construction robots to perform non-predefined tasks autonomously.
Agile-VLA: Few-Shot Industrial Pose Rectification via Implicit Affordance Anchoring IROS
Deploying Vision-Language-Action (VLA) models on resource-constrained edge platforms encounters a fundamental conflict between high-latency semantic inference and the high-frequency control required for dynamic manipulation. To address the challenge, this paper presents Agile-VLA, a hierarchical framework designed for industrial pose reorientation tasks on edge devices such as the NVIDIA Jetson Orin Nano. The core innovation is an Implicit Affordance Anchoring mechanism that directly maps geometric visual cues, specifically centroid and rim keypoint anchors, into structured parametric action primitives, thereby substantially reducing reliance on high-latency semantic inference during closed-loop control. By decoupling perception (10 Hz) from control (50 Hz) via an asynchronous dual-stream architecture, the system effectively mitigates the frequency mismatch inherent in edge-based robot learning. Experimental results on a standard 6-DoF manipulator demonstrate that Agile-VLA achieves robust rectification of complex, irregular workpieces using only 5-shot demonstrations through extrinsic dexterity.
comment: 8 pages. Submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026
Grounding Sim-to-Real Generalization in Dexterous Manipulation: An Empirical Study with Vision-Language-Action Models
Learning a generalist control policy for dexterous manipulation typically relies on large-scale datasets. Given the high cost of real-world data collection, a practical alternative is to generate synthetic data through simulation. However, the resulting synthetic data often exhibits a significant gap from real-world distributions. While many prior studies have proposed algorithms to bridge the Sim-to-Real discrepancy, there remains a lack of principled research that grounds these methods in real-world manipulation tasks, particularly their performance on generalist policies such as Vision-Language-Action (VLA) models. In this study, we empirically examine the primary determinants of Sim-to-Real generalization across four dimensions: multi-level domain randomization, photorealistic rendering, physics-realistic modeling, and reinforcement learning updates. To support this study, we design a comprehensive evaluation protocol to quantify the real-world performance of manipulation tasks. The protocol accounts for key variations in background, lighting, distractors, object types, and spatial features. Through experiments involving over 10k real-world trials, we derive critical insights into Sim-to-Real transfer. To inform and advance future studies, we release both the robotic platforms and the evaluation protocol for public access to facilitate independent verification, thereby establishing a realistic and standardized benchmark for dexterous manipulation policies.
DecompGrind: A Decomposition Framework for Robotic Grinding via Cutting-Surface Planning and Contact-Force Adaptation
Robotic grinding is widely used for shaping workpieces in manufacturing, but it remains difficult to automate this process efficiently. In particular, efficiently grinding workpieces of different shapes and material hardness is challenging because removal resistance varies with local contact conditions. Moreover, it is difficult to achieve accurate estimation of removal resistance and analytical modeling of shape transition, and learning-based approaches often require large amounts of training data to cover diverse processing conditions. To address these challenges, we decompose robotic grinding into two components: removal-shape planning and contact-force adaptation. Based on this formulation, we propose DecompGrind, a framework that combines Global Cutting-Surface Planning (GCSP) and Local Contact-Force Adaptation (LCFA). GCSP determines removal shapes through geometric analysis of the current and target shapes without learning, while LCFA learns a contact-force adaptation policy using bilateral control-based imitation learning during the grinding of each removal shape. This decomposition restricts learning to local contact-force adaptation, allowing the policy to be learned from a small number of demonstrations, while handling global shape transition geometrically. Experiments using a robotic grinding system and 3D-printed workpieces demonstrate efficient robotic grinding of workpieces having different shapes and material hardness while maintaining safe levels of contact force.
comment: Under review
CATNAV: Cached Vision-Language Traversability for Efficient Zero-Shot Robot Navigation
Navigating unstructured environments requires assessing traversal risk relative to a robot's physical capabilities, a challenge that varies across embodiments. We present CATNAV, a cost-aware traversability navigation framework that leverages multimodal LLMs for zero-shot, embodiment-aware costmap generation without task-specific training. We introduce a visuosemantic caching mechanism that detects scene novelty and reuses prior risk assessments for semantically similar frames, reducing online VLM queries by 85.7%. Furthermore, we introduce a VLM-based trajectory selection module that evaluates proposals through visual reasoning to choose the safest path given behavioral constraints. We evaluate CATNAV on a quadruped robot across indoor and outdoor unstructured environments, comparing against state-of-the-art vision-language-action baselines. Across five navigation tasks, CATNAV achieves 10 percentage point higher average goal-reaching rate and 33% fewer behavioral constraint violations.
comment: 8 pages, 6 figures
PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding ICRA
Embodied agents for creative tasks like photography must bridge the semantic gap between high-level language commands and geometric control. We introduce PhotoAgent, an agent that achieves this by integrating Large Multimodal Models (LMMs) reasoning with a novel control paradigm. PhotoAgent first translates subjective aesthetic goals into solvable geometric constraints via LMM-driven, chain-of-thought (CoT) reasoning, allowing an analytical solver to compute a high-quality initial viewpoint. This initial pose is then iteratively refined through visual reflection within a photorealistic internal world model built with 3D Gaussian Splatting (3DGS). This ``mental simulation'' replaces costly and slow physical trial-and-error, enabling rapid convergence to aesthetically superior results. Evaluations confirm that PhotoAgent excels in spatial reasoning and achieves superior final image quality.
comment: Accepted to the IEEE International Conference on Robotics and Automation (ICRA) 2026
Instrument-Splatting++: Towards Controllable Surgical Instrument Digital Twin Using Gaussian Splatting
High-quality and controllable digital twins of surgical instruments are critical for Real2Sim in robot-assisted surgery, as they enable realistic simulation, synthetic data generation, and perception learning under novel poses. We present Instrument-Splatting++, a monocular 3D Gaussian Splatting (3DGS) framework that reconstructs surgical instruments as a fully controllable Gaussian asset with high fidelity. Our pipeline starts with part-wise geometry pretraining that injects CAD priors into Gaussian primitives and equips the representation with part-aware semantic rendering. Built on the pretrained model, we propose a semantics-aware pose estimation and tracking (SAPET) method to recover per-frame 6-DoF pose and joint angles from unposed endoscopic videos, where a gripper-tip network trained purely from synthetic semantics provides robust supervision and a loose regularization suppresses singular articulations. Finally, we introduce Robust Texture Learning (RTL), which alternates pose refinement and robust appearance optimization, mitigating pose noise during texture learning. The proposed framework can perform pose estimation and learn realistic texture from unposed videos. We validate our method on sequences extracted from EndoVis17/18, SAR-RARP, and an in-house dataset, showing superior photometric quality and improved geometric accuracy over state-of-the-art baselines. We further demonstrate a downstream keypoint detection task where unseen-pose data augmentation from our controllable instrument Gaussian improves performance.
comment: 10 pages, 9 figures
DiSCo: Diffusion Sequence Copilots for Shared Autonomy
Shared autonomy combines human user and AI copilot actions to control complex systems such as robotic arms. When a task is challenging, requires high dimensional control, or is subject to corruption, shared autonomy can significantly increase task performance by using a trained copilot to effectively correct user actions in a manner consistent with the user's goals. To significantly improve the performance of shared autonomy, we introduce Diffusion Sequence Copilots (DiSCo): a method of shared autonomy with diffusion policy that plans action sequences consistent with past user actions. DiSCo seeds and inpaints the diffusion process with user-provided actions with hyperparameters to balance conformity to expert actions, alignment with user intent, and perceived responsiveness. We demonstrate that DiSCo substantially improves task performance in simulated driving and robotic arm tasks. Project website: https://sites.google.com/view/disco-shared-autonomy/
comment: 10 pages, 5 figures, HRI '26: Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction
SG-VLA: Learning Spatially-Grounded Vision-Language-Action Models for Mobile Manipulation
Vision-Language-Action (VLA) models show promise for robotic control, yet performance in complex household environments remains sub-optimal. Mobile manipulation requires reasoning about global scene layout, fine-grained geometry, and high-dimensional continuous actions, making standard imitation learning insufficient. We introduce a framework for learning spatially-grounded VLA models that strengthens perception and representation through auxiliary task co-training and multi-modal input enhancement. Our method addresses the challenge of controlling a 13-dimensional action space involving coordinated base motion, arm articulation, and gripper actuation. To enrich spatial understanding, the model incorporates multi-view RGB observations, depth cues, and short temporal history, providing perspectives of both global scene structure and local manipulation context. To improve representation quality, we co-train auxiliary decoders that reconstruct interpretable intermediate signals - including global robot position, joint configurations, grasp affordances, target-object relative pose, and segmentation masks - from shared visual-language features. These objectives provide dense supervision that encourages the backbone to develop spatially grounded, manipulation-aware latent representations. Through extensive evaluation on home rearrangement tasks, our approach achieves consistent improvements across picking, placing, opening, and closing operations, substantially outperforming direct imitation learning. Our findings suggest that spatial grounding through auxiliary and multi-modal learning provides a strong direction for scaling VLA models toward general-purpose domestic robots.
Human vs. NAO: A Computational-Behavioral Framework for Quantifying Social Orienting in Autism and Typical Development
Responding to one's name is among the earliest-emerging social orienting behaviors and is one of the most prominent aspects in the detection of Autism Spectrum Disorder (ASD). Typically developing children exhibit near-reflexive orienting to their name, whereas children with ASD often demonstrate reduced frequency, increased latency, or atypical patterns of response. In this study, we examine differential responsiveness to quantify name-calling stimuli delivered by both human agents and NAO, a humanoid robot widely employed in socially assistive interventions for autism. The analysis focuses on multiple behavioral parameters, including eye contact, response latency, head and facial orientation shifts, and duration of sustained interest. Video-based computational methods were employed, incorporating face detection, eye region tracking, and spatio-temporal facial analysis, to obtain fine-grained measures of children's responses. By comparing neurotypical and neuroatypical groups under controlled human-robot conditions, this work aims to understand how the source and modality of social cues affect attentional dynamics in name-calling contexts. The findings advance both the theoretical understanding of social orienting deficits in autism and the applied development of robot-assisted assessment tools.
Fleet-Level Battery-Health-Aware Scheduling for Autonomous Mobile Robots
Autonomous mobile robot fleets must coordinate task allocation and charging under limited shared resources, yet most battery aware planning methods address only a single robot. This paper extends degradation cost aware task planning to a multi robot setting by jointly optimizing task assignment, service sequencing, optional charging decisions, charging mode selection, and charger access while balancing degradation across the fleet. The formulation relies on reduced form degradation proxies grounded in the empirical battery aging literature, capturing both charging mode dependent wear and idle state of charge dependent aging; the bilinear idle aging term is linearized through a disaggregated piecewise McCormick formulation. Tight big M values derived from instance data strengthen the LP relaxation. To manage scalability, we propose a hierarchical matheuristic in which a fleet level master problem coordinates assignments, routes, and charger usage, while robot level subproblems whose integer part decomposes into trivially small independent partition selection problems compute route conditioned degradation schedules. Systematic experiments compare the proposed method against three baselines: a rule based nearest available dispatcher, an energy aware formulation that enforces battery feasibility without modeling degradation, and a charger unaware formulation that accounts for degradation but ignores shared charger capacity limits.
Learning Safe-Stoppability Monitors for Humanoid Robots
Emergency stop (E-stop) mechanisms are the de facto standard for robot safety. However, for humanoid robots, abruptly cutting power can itself cause catastrophic failures; instead, an emergency stop must execute a predefined fallback controller that preserves balance and drives the robot toward a minimum-risk condition. This raises a critical question: from which states can a humanoid robot safely execute such a stop? In this work, we formalize emergency stopping for humanoids as a policy-dependent safe-stoppability problem and use data-driven approaches to characterize the safe-stoppable envelope. We introduce PRISM (Proactive Refinement of Importance-sampled Stoppability Monitor), a simulation-driven framework that learns a neural predictor for state-level stoppability. PRISM iteratively refines the decision boundary using importance sampling, enabling targeted exploration of rare but safety-critical states. This targeted exploration significantly improves data efficiency while reducing false-safe predictions under a fixed simulation budget. We further demonstrate sim-to-real transfer by deploying the pretrained monitor on a real humanoid platform. Results show that modeling safety as policy-dependent stoppability enables proactive safety monitoring and supports scalable certification of fail-safe behaviors for humanoid robots.
comment: 8 pages, 5 figures
Variable-Resolution Virtual Maps for Autonomous Exploration with Unmanned Surface Vehicles (USVs)
Autonomous exploration by unmanned surface vehicles (USVs) in near-shore waters requires reliable localisation and consistent mapping over extended areas, but this is challenged by GNSS degradation, environment-induced localisation uncertainty, and limited on-board computation. Virtual map-based methods explicitly model localisation and mapping uncertainty by tightly coupling factor-graph SLAM with a map uncertainty criterion. However, their storage and computational costs scale poorly with fixed-resolution workspace discretisations, leading to inefficiency in large near-shore environments. Moreover, overvaluing feature-sparse open-water regions can increase the risk of SLAM failure as a result of imbalance between exploration and exploitation. To address these limitations, we propose a Variable-Resolution Virtual Map (VRVM), a computationally efficient method for representing map uncertainty using bivariate Gaussian virtual landmarks placed in the cells of an adaptive quadtree. The adaptive quadtree enables an area-weighted uncertainty representation that keeps coarse, far-field virtual landmarks deliberately uncertain while allocating higher resolution to information-dense regions, and reduces the sensitivity of the map valuation to local refinements of the tree. An expectation-maximisation (EM) planner is adopted to evaluate pose and map uncertainty along frontiers using the VRVM, balancing exploration and exploitation. We evaluate VRVM against several state-of-the-art exploration algorithms in the VRX Gazebo simulator, using a realistic marina environment across different testing scenarios with an increasing level of exploration difficulty. The results indicate that our method offers safer behaviour and better utilisation of on-board computation in GNSS-degraded near-shore environments.
VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs
Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross-modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact-rich manipulation, maintaining a robust success rate of 90 percent on average. In challenging scenarios such as potato chip pick-and-place requiring high-fidelity force awareness, VTAM outperforms the pi 0.5 baseline by 80 percent. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.
comment: https://plan-lab.github.io/projects/vtam/
Planning over MAPF Agent Dependencies via Multi-Dependency PIBT
Modern Multi-Agent Path Finding (MAPF) algorithms must plan for hundreds to thousands of agents in congested environments within a second, requiring highly efficient algorithms. Priority Inheritance with Backtracking (PIBT) is a popular algorithm capable of effectively planning in such situations. However, PIBT is constrained by its rule-based planning procedure and lacks generality because it restricts its search to paths that conflict with at most one other agent. This limitation also applies to Enhanced PIBT (EPIBT), a recent extension of PIBT. In this paper, we describe a new perspective on solving MAPF by planning over agent dependencies. Taking inspiration from PIBT's priority inheritance logic, we define the concept of agent dependencies and propose Multi-Dependency PIBT (MD-PIBT) that searches over agent dependencies. MD-PIBT is a general framework where specific parameterizations can reproduce PIBT and EPIBT. At the same time, alternative configurations yield novel planning strategies that are not expressible by PIBT or EPIBT. Our experiments demonstrate that MD-PIBT effectively plans for as many as 10,000 homogeneous agents under various kinodynamic constraints, including pebble motion, rotation motion, and differential drive robots with speed and acceleration limits. We perform thorough evaluations on different variants of MAPF and find that MD-PIBT is particularly effective in MAPF with large agents.
Rectify, Don't Regret: Avoiding Pitfalls of Differentiable Simulation in Trajectory Prediction
Current open-loop trajectory models struggle in real-world autonomous driving because minor initial deviations often cascade into compounding errors, pushing the agent into out-of-distribution states. While fully differentiable closed-loop simulators attempt to address this, they suffer from shortcut learning: the loss gradients flow backward through induced state inputs, inadvertently leaking future ground truth information directly into the model's own previous predictions. The model exploits these signals to artificially avoid drift, non-causally "regretting" past mistakes rather than learning genuinely reactive recovery. To address this, we introduce a detached receding horizon rollout. By explicitly severing the computation graph between simulation steps, the model learns genuine recovery behaviors from drifted states, forcing it to "rectify" mistakes rather than non-causally optimizing past predictions. Extensive evaluations on the nuScenes and DeepScenario datasets show our approach yields more robust recovery strategies, reducing target collisions by up to 33.24% compared to fully differentiable closed-loop training at high replanning frequencies. Furthermore, compared to standard open-loop baselines, our non-differentiable framework decreases collisions by up to 27.74% in dense environments while simultaneously improving multi-modal prediction diversity and lane alignment.
SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM
High-quality articulated 3D assets are indispensable for embodied AI and physical simulation, yet 3D generation still focuses on static meshes, leaving a gap in "sim-ready" interactive objects. Most recent articulated object creation methods rely on multi-stage pipelines that accumulate errors across decoupled modules. Alternatively, unified MLLMs offer a single-stage path to joint static asset understanding and sim-ready asset generation. However dense voxel-based 3D tokenization yields long 3D token sequences and high memory overhead, limiting scalability to complex articulated objects. To address this, we propose SIMART, a unified MLLM framework that jointly performs part-level decomposition and kinematic prediction. By introducing a Sparse 3D VQ-VAE, SIMART reduces token counts by 70% vs. dense voxel tokens, enabling high-fidelity multi-part assemblies. SIMART achieves state-of-the-art performance on PartNet-Mobility and in-the-wild AIGC datasets, and enables physics-based robotic simulation.
ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment
Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ignore physical laws. We present ABot-PhysWorld, a 14B Diffusion Transformer model that generates visually realistic, physically plausible, and action-controllable videos. Built on a curated dataset of three million manipulation clips with physics-aware annotation, it uses a novel DPO-based post-training framework with decoupled discriminators to suppress unphysical behaviors while preserving visual quality. A parallel context block enables precise spatial action injection for cross-embodiment control. To better evaluate generalization, we introduce EZSbench, the first training-independent embodied zero-shot benchmark combining real and synthetic unseen robot-task-scene combinations. It employs a decoupled protocol to separately assess physical realism and action alignment. ABot-PhysWorld achieves new state-of-the-art performance on PBench and EZSbench, surpassing Veo 3.1 and Sora v2 Pro in physical plausibility and trajectory consistency. We will release EZSbench to promote standardized evaluation in embodied video generation.
PinPoint: Monocular Needle Pose Estimation for Robotic Suturing via Stein Variational Newton and Geometric Residuals
Reliable estimation of surgical needle 3D position and orientation is essential for autonomous robotic suturing, yet existing methods operate almost exclusively under stereoscopic vision. In monocular endoscopic settings, common in transendoscopic and intraluminal procedures, depth ambiguity and rotational symmetry render needle pose estimation inherently ill-posed, producing a multimodal distribution over feasible configurations, rather than a single, well-grounded estimate. We present PinPoint, a probabilistic variational inference framework that treats this ambiguity directly, maintaining a distribution of pose hypotheses rather than suppressing it. PinPoint combines monocular image observations with robot-grasp constraints through analytical geometric likelihoods with closed-form Jacobians. This framework enables efficient Gauss-Newton preconditioning in a Stein Variational Newton inference, where second-order particle transport deterministically moves particles toward high-probability regions while kernel-based repulsion preserves diversity in the multimodal structure. On real needle-tracking sequences, PinPoint reduces mean translational error by 80% (down to 1.00 mm) and rotational error by 78% (down to 13.80°) relative to a particle-filter baseline, with substantially better-calibrated uncertainty. On induced-rotation sequences, where monocular ambiguity is most severe, PinPoint maintains a bimodal posterior 84% of the time, almost three times the rate of the particle filter baseline, correctly preserving the alternative hypothesis rather than committing prematurely to one mode. Suturing experiments in ex vivo tissue demonstrate stable tracking through intermittent occlusion, with average errors during occlusion of 1.34 mm in translation and 19.18° in rotation, even when the needle is fully embedded.
comment: 15 pages, 7 Figures
Edge Radar Material Classification Under Geometry Shifts
Material awareness can improve robotic navigation and interaction, particularly in conditions where cameras and LiDAR degrade. We present a lightweight mmWave radar material classification pipeline designed for ultra-low-power edge devices (TI IWRL6432), using compact range-bin intensity descriptors and a Multilayer Perceptron (MLP) for real-time inference. While the classifier reaches a macro-F1 of 94.2\% under the nominal training geometry, we observe a pronounced performance drop under realistic geometry shifts, including sensor height changes and small tilt angles. These perturbations induce systematic intensity scaling and angle-dependent radar cross section (RCS) effects, pushing features out of distribution and reducing macro-F1 to around 68.5\%. We analyze these failure modes and outline practical directions for improving robustness with normalization, geometry augmentation, and motion-aware features.
Strain-Parameterized Coupled Dynamics and Dual-Camera Visual Servoing for Aerial Continuum Manipulators
Tendon-driven aerial continuum manipulators (TD-ACMs) combine the maneuverability of uncrewed aerial vehicles (UAVs) with the compliance of lightweight continuum robots (CRs). Existing coupled dynamic modeling approaches for TD-ACMs incur high computational costs and do not explicitly account for aerial platform underactuation. To address these limitations, this paper presents a generalized dynamic formulation of a coupled TD-ACM with an underactuated base. The proposed approach integrates a strain-parameterized Cosserat rod model with a rigid-body model of the UAV into a unified Lagrangian ordinary differential equation (ODE) framework on $\mathrm{SE}(3)$, thereby eliminating computationally intensive symbolic derivations. Building upon the developed model, a robust dual-camera image-based visual servoing (IBVS) scheme is introduced. The proposed controller mitigates the field-of-view (FoV) limitations of conventional IBVS, compensates for attitude-induced image motion caused by UAV lateral dynamics, and incorporates a low-level adaptive controller to address modeling uncertainties with formal stability guarantees. Extensive simulations and experimental validation on a compact custom-built prototype demonstrate the effectiveness and robustness of the proposed framework in real-world scenarios.
Learning Multi-Agent Local Collision-Avoidance for Collaborative Carrying tasks with Coupled Quadrupedal Robots
Robotic collaborative carrying could greatly benefit human activities like warehouse and construction site management. However, coordinating the simultaneous motion of multiple robots represents a significant challenge. Existing works primarily focus on obstacle-free environments, making them unsuitable for most real-world applications. Works that account for obstacles, either overfit to a specific terrain configuration or rely on pre-recorded maps combined with path planners to compute collision-free trajectories. This work focuses on two quadrupedal robots mechanically connected to a carried object. We propose a Reinforcement Learning (RL)-based policy that enables tracking a commanded velocity direction while avoiding collisions with nearby obstacles using only onboard sensing, eliminating the need for precomputed trajectories and complete map knowledge. Our work presents a hierarchical architecture, where a perceptive high-level object-centric policy commands two pretrained locomotion policies. Additionally, we employ a game-inspired curriculum to increase the complexity of obstacles in the terrain progressively. We validate our approach on two quadrupedal robots connected to a bar via spherical joints, benchmarking it against optimization-based and decentralized RL baselines. Our hardware experiments demonstrate the ability of our system to locomote in unknown environments without the need for a map or a path planner. The video of our work is available in the multimedia material.
A Multimodal Framework for Human-Multi-Agent Interaction
Human-robot interaction is increasingly moving toward multi-robot, socially grounded environments. Existing systems struggle to integrate multimodal perception, embodied expression, and coordinated decision-making in a unified framework. This limits natural and scalable interaction in shared physical spaces. We address this gap by introducing a multimodal framework for human-multi-agent interaction in which each robot operates as an autonomous cognitive agent with integrated multimodal perception and Large Language Model (LLM)-driven planning grounded in embodiment. At the team level, a centralized coordination mechanism regulates turn-taking and agent participation to prevent overlapping speech and conflicting actions. Implemented on two humanoid robots, our framework enables coherent multi-agent interaction through interaction policies that combine speech, gesture, gaze, and locomotion. Representative interaction runs demonstrate coordinated multimodal reasoning across agents and grounded embodied responses. Future work will focus on larger-scale user studies and deeper exploration of socially grounded multi-agent interaction dynamics.
comment: 4 pages, 3 figures. Accepted at ACM/IEEE HRI 2026 Workshop (MAgicS-HRI)
Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics for Robot Manipulation CVPR 2026
While existing equivariant methods enhance data efficiency, they suffer from high computational intensity, reliance on single-modality inputs, and instability when combined with fast-sampling methods. In this work, we propose E3Flow, a novel framework that addresses the critical limitations of equivariant diffusion policies. E3Flow overcomes these challenges, successfully unifying efficient rectified flow with stable, multi-modal equivariant learning for the first time. Our framework is built upon spherical harmonic representations to ensure rigorous SO(3) equivariance. We introduce a novel invariant Feature Enhancement Module (FEM) that dynamically fuses hybrid visual modalities (point clouds and images), injecting rich visual cues into the spherical harmonic features. We evaluate E3Flow on 8 manipulation tasks from the MimicGen and further conduct 4 real-world experiments to validate its effectiveness in physical environments. Simulation results show that E3Flow achieves a 3.12% improvement in average success rate over the state-of-the-art Spherical Diffusion Policy (SDP) while simultaneously delivering a 7x inference speedup. E3Flow thus demonstrates a new and highly effective trade-off between performance, efficiency, and data efficiency for robotic policy learning. Code: https://github.com/zql-kk/E3Flow.
comment: Accepted by CVPR 2026
AeroScene: Progressive Scene Synthesis for Aerial Robotics
Generative models have shown substantial impact across multiple domains, their potential for scene synthesis remains underexplored in robotics. This gap is more evident in drone simulators, where simulation environments still rely heavily on manual efforts, which are time-consuming to create and difficult to scale. In this work, we introduce AeroScene, a hierarchical diffusion model for progressive 3D scene synthesis. Our approach leverages hierarchy-aware tokenization and multi-branch feature extraction to reason across both global layouts and local details, ensuring physical plausibility and semantic consistency. This makes AeroScene particularly suited for generating realistic scenes for aerial robotics tasks such as navigation, landing, and perching. We demonstrate its effectiveness through extensive experiments on our newly collected dataset and a public benchmark, showing that AeroScene significantly outperforms prior methods. Furthermore, we use AeroScene to generate a large-scale dataset of over 1,000 physics-ready, high fidelity 3D scenes that can be directly integrated into NVIDIA Isaac Sim. Finally, we illustrate the utility of these generated environments on downstream drone navigation tasks. Our code and dataset are publicly available at aioz-ai.github.io/AeroScene/
Path Planning and Reinforcement Learning-Driven Control of On-Orbit Free-Flying Multi-Arm Robots
This paper presents a hybrid approach that integrates trajectory optimization (TO) and reinforcement learning (RL) for motion planning and control of free-flying multi-arm robots in on-orbit servicing scenarios. The proposed system integrates TO for generating feasible, efficient paths while accounting for dynamic and kinematic constraints, and RL for adaptive trajectory tracking under uncertainties. The multi-arm robot design, equipped with thrusters for precise body control, enables redundancy and stability in complex space operations. TO optimizes arm motions and thruster forces, reducing reliance on the arms for stabilization and enhancing maneuverability. RL further refines this by leveraging model-free control to adapt to dynamic interactions and disturbances. The experimental results validated through comprehensive simulations demonstrate the effectiveness and robustness of the proposed hybrid approach. Two case studies are explored: surface motion with initial contact and a free-floating scenario requiring surface approximation. In both cases, the hybrid method outperforms traditional strategies. In particular, the thrusters notably enhance motion smoothness, safety, and operational efficiency. The RL policy effectively tracks TO-generated trajectories, handling high-dimensional action spaces and dynamic mismatches. This integration of TO and RL combines the strengths of precise, task-specific planning with robust adaptability, ensuring high performance in the uncertain and dynamic conditions characteristic of space environments. By addressing challenges such as motion coupling, environmental disturbances, and dynamic control requirements, this framework establishes a strong foundation for advancing the autonomy and effectiveness of space robotic systems.
comment: Accepted for publication in The International Journal of Robotics Research (23-Mar-2026)
Human-in-the-Loop Pareto Optimization: Trade-off Characterization for Assist-as-Needed Training and Performance Evaluation
During human motor skill training and physical rehabilitation, there is an inherent trade-off between task difficulty and user performance. Characterizing this trade-off is crucial for evaluating user performance, designing assist-as-needed (AAN) protocols, and assessing the efficacy of training protocols. In this study, we propose a novel human-in-the-loop (HiL) Pareto optimization approach to characterize the trade-off between task performance and the perceived challenge level of motor learning or rehabilitation tasks. We adapt Bayesian multi-criteria optimization to systematically and efficiently perform HiL Pareto characterizations. Our HiL optimization employs a hybrid model that measures performance with a quantitative metric, while the perceived challenge level is captured with a qualitative metric. We demonstrate the feasibility of the proposed HiL Pareto characterization through a user study. Furthermore, we present the utility of the framework through three use cases in the context of a manual skill training task with haptic feedback. First, we demonstrate how the characterized trade-off can be used to design a sample AAN training protocol for a motor learning task and to evaluate the group-level efficacy of the proposed AAN protocol relative to a baseline adaptive assistance protocol. Second, we demonstrate that individual-level comparisons of the trade-offs characterized before and after the training session enable fair evaluation of training progress under different assistance levels. This evaluation method is more general than standard performance evaluations, as it can provide insights even when users cannot perform the task without assistance. Third, we show that the characterized trade-offs also enable fair performance comparisons among different users, as they capture the best possible performance of each user under all feasible assistance levels.
comment: Under review for publication in IEEE Transactions on Haptics
Task-Space Singularity Avoidance for Control Affine Systems Using Control Barrier Functions
Singularities in robotic and dynamical systems arise when the mapping from control inputs to task-space motion loses rank, leading to an inability to determine inputs. This limits the system's ability to generate forces and torques in desired directions and prevents accurate trajectory tracking. This paper presents a control barrier function (CBF) framework for avoiding such singularities in control-affine systems. Singular configurations are identified through the eigenvalues of a state-dependent input-output mapping matrix, and barrier functions are constructed to maintain a safety margin from rank-deficient regions. Conditions for theoretical guarantees on safety are provided as a function of actuator dynamics. Simulations on a planar 2-link manipulator and a magnetically actuated needle demonstrate smooth trajectory tracking while avoiding singular configurations and reducing control input spikes by up to 100x compared to the nominal controller.
Form-Fitting, Large-Area Sensor Mounting for Obstacle Detection
We introduce a low-cost method for mounting sensors onto robot links for large-area sensing coverage that does not require the sensor's positions or orientations to be calibrated before use. Using computer aided design (CAD), a robot skin covering, or skin unit, can be procedurally generated to fit around a nondevelopable surface, a 3D surface that cannot be flattened into a 2D plane without distortion, of a robot. The skin unit embeds mounts for printed circuit boards of any size to keep sensors in fixed and known locations. We demonstrate our method by constructing point cloud images of obstacles within the proximity of a Franka Research 3 robot's operational environment using an array of time of flight (ToF) imagers mounted on a printed skin unit and attached to the robot arm.
comment: Accepted at 2025 Humanoids Workshop on Advances in Contact-Rich Robotics: Rich Tactile-Based Physical Interaction [ConRich]
ROSCell: A ROS2-Based Framework for Automated Formation and Orchestration of Multi-Robot Systems
Modern manufacturing under High-Mix-Low-Volume requirements increasingly relies on flexible and adaptive matrix production systems, which depend on interconnected heterogeneous devices and rapid task reconfiguration. To address these needs, we present ROSCell, a ROS2-based framework that enables the flexible formation and management of a computing continuum across various devices. ROSCell allows users to package existing robotic software as deployable skills and, with simple requests, assemble isolated cells, automatically deploy skill instances, and coordinate their communication to meet task objectives. It provides a scalable and low-overhead foundation for adaptive multi-robot computing in dynamic production environments. Experimental results show that, in the idle state, ROSCell substantially reduces CPU, memory, and network overhead compared to K3s-based solutions on edge devices, highlighting its energy efficiency and cost-effectiveness for large-scale deployment in production settings. The source code, examples, and documentation will be provided on Github.
Learning What Can Be Picked: Active Reachability Estimation for Efficient Robotic Fruit Harvesting
Agriculture remains a cornerstone of global health and economic sustainability, yet labor-intensive tasks such as harvesting high-value crops continue to face growing workforce shortages. Robotic harvesting systems offer a promising solution; however, their deployment in unstructured orchard environments is constrained by inefficient perception-to-action pipelines. In particular, existing approaches often rely on exhaustive inverse kinematics or motion planning to determine whether a target fruit is reachable, leading to unnecessary computation and delayed decision-making. Our approach combines RGB-D perception with active learning to directly learn reachability as a binary decision problem. We then leverage active learning to selectively query the most informative samples for reachability labeling, significantly reducing annotation effort while maintaining high predictive accuracy. Extensive experiments demonstrate that the proposed framework achieves accurate reachability prediction with substantially fewer labeled samples, yielding approximately 6--8% higher accuracy than random sampling and enabling label-efficient adaptation to new orchard configurations. Among the evaluated strategies, entropy- and margin-based sampling outperform Query-by-Committee and standard uncertainty sampling in low-label regimes, while all strategies converge to comparable performance as the labeled set grows. These results highlight the effectiveness of active learning for task-level perception in agricultural robotics and position our approach as a scalable alternative to computation-heavy kinematic reachability analysis. Our code is available through https://github.com/wsu-cyber-security-lab-ai/active-learning.
Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement
We study long-horizon planning in 3D environments from under-specified natural-language goals using only visual observations, focusing on multi-step 3D box rearrangement tasks. Existing approaches typically rely on symbolic planners with brittle relational grounding of states and goals, or on direct action-sequence generation from 2D vision-language models (VLMs). Both approaches struggle with reasoning over many objects, rich 3D geometry, and implicit semantic constraints. Recent advances in 3D VLMs demonstrate strong grounding of natural-language referents to 3D segmentation masks, suggesting the potential for more general planning capabilities. We extend existing 3D grounding models and propose Reactive Action Mask Planner (RAMP-3D), which formulates long-horizon planning as sequential reactive prediction of paired 3D masks: a "which-object" mask indicating what to pick and a "which-target-region" mask specifying where to place it. The resulting system processes RGB-D observations and natural-language task specifications to reactively generate multi-step pick-and-place actions for 3D box rearrangement. We conduct experiments across 11 task variants in warehouse-style environments with 1-30 boxes and diverse natural-language constraints. RAMP-3D achieves 79.5% success rate on long-horizon rearrangement tasks and significantly outperforms 2D VLM-based baselines, establishing mask-based reactive policies as a promising alternative to symbolic pipelines for long-horizon planning.
Bio-Inspired Event-Based Visual Servoing for Ground Robots
Biological sensory systems are inherently adaptive, filtering out constant stimuli and prioritizing relative changes, likely enhancing computational and metabolic efficiency. Inspired by active sensing behaviors across a wide range of animals, this paper presents a novel event-based visual servoing framework for ground robots. Utilizing a Dynamic Vision Sensor (DVS), we demonstrate that by applying a fixed spatial kernel to the asynchronous event stream generated from structured logarithmic intensity-change patterns, the resulting net event flux analytically isolates specific kinematic states. We establish a generalized theoretical bound for this event rate estimator and show that linear and quadratic spatial profiles isolate the robot's velocity and position-velocity product, respectively. Leveraging these properties, we employ a multi-pattern stimulus to directly synthesize a nonlinear state-feedback term entirely without traditional state estimation. To overcome the inescapable loss of linear observability at equilibrium inherent in event sensing, we propose a bio-inspired active sensing limit-cycle controller. Experimental validation on a 1/10-scale autonomous ground vehicle confirms the efficacy, extreme low-latency, and computational efficiency of the proposed direct-sensing approach.
Quadrature Oscillation System for Coordinated Motion in Crawling Origami Robot ICRA 2026
Origami-inspired robots offer rapid, accessible design and manufacture with diverse functionalities. In particular, origami robots without conventional electronics have the unique advantage of functioning in extreme environments such as ones with high radiation or large magnetic fields. However, the absence of sophisticated control systems limits these robots to simple autonomous behaviors. In our previous studies, we developed a printable, electronics-free, and self-sustained oscillator that generates simple complementary square-wave signals. Our study presents a quadrature oscillation system capable of generating four square-wave signals a quarter-cycle out of phase, enabling four distinct states. Such control signals are important in various engineering and robotics applications, such as orchestrating limb movements in bio-inspired robots. We demonstrate the practicality and value of this oscillation system by designing and constructing an origami crawling robot that utilizes the quadrature oscillator to achieve coordinated locomotion. Together, the oscillator and robot illustrate the potential for more complex control and functions in origami robotics, paving the way for more electronics-free, rapid-design origami robots with advanced autonomous behaviors.
comment: 8 pages, 11 figures, Accepted to ICRA 2026
Engagement-Zone-Aware Input-Constrained Guidance for Safe Target Interception in Contested Environments
We address target interception in contested environments in the presence of multiple defenders whose interception capability is limited by finite ranges. Conventional methods typically impose conservative stand-off constraints based on maximum engagement distance and neglect the interceptors' actuator limitations. Instead, we formulate safety constraints using defender-induced engagement zones. To account for actuator limits, the vehicle model is augmented with input saturation dynamics. A time-varying safe-set tightening parameter is introduced to compensate for transient constraint violations induced by actuator dynamics. To ensure scalable safety enforcement in multi-defender scenarios, a smooth aggregate safety function is constructed using a log-sum-exp operator combining individual threat measures associated with each defender's capability. A smooth switching guidance strategy is then developed to coordinate interception and safety objectives. The attacker pursues the target when sufficiently distant from threat boundaries and progressively activates evasive motion as the EZ boundaries are approached. The resulting controller relies only on relative measurements and does not require knowledge of defender control inputs, thus facilitating a fully distributed and scalable implementation. Rigorous analysis provides sufficient conditions guaranteeing target interception, practical safety with respect to all defender engagement zones, and satisfaction of actuator bounds. An input-constrained guidance law based on conservative stand-off distance is also developed to quantify the conservatism of maximum-range-based safety formulations. Simulations with stationary and maneuvering defenders demonstrate that the proposed formulation yields shorter interception paths and reduced interception time compared with conventional methods while maintaining safety throughout the engagement.
LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset
In real-world domains such as self-driving, generalization to rare scenarios remains a fundamental challenge. To address this, we introduce a new dataset designed for end-to-end driving that focuses on long-tail driving events. We provide multi-view video data, trajectories, high-level instructions, and detailed reasoning traces, facilitating in-context learning and few-shot generalization. The resulting benchmark for multimodal models, such as VLMs and VLAs, goes beyond safety and comfort metrics by evaluating instruction following and semantic coherence between model outputs. The multilingual reasoning traces in English, Spanish, and Chinese are from domain experts with diverse cultural backgrounds. Thus, our dataset is a unique resource for studying how different forms of reasoning affect driving competence. Our dataset is available at: https://hf.co/datasets/kit-mrt/kitscenes-longtail
comment: 21 pages
Tightly-Coupled Radar-Visual-Inertial Odometry
Visual-Inertial Odometry (VIO) is a staple for reliable state estimation on constrained and lightweight platforms due to its versatility and demonstrated performance. However, pertinent challenges regarding robust operation in dark, low-texture, obscured environments complicate the use of such methods. Alternatively, Frequency Modulated Continuous Wave (FMCW) radars, and by extension Radar-Inertial Odometry (RIO), offer robustness to these visual challenges, albeit at the cost of reduced information density and worse long-term accuracy. To address these limitations, this work combines the two in a tightly coupled manner, enabling the resulting method to operate robustly regardless of environmental conditions or trajectory dynamics. The proposed method fuses image features, radar Doppler measurements, and Inertial Measurement Unit (IMU) measurements within an Iterated Extended Kalman Filter (IEKF) in real-time, with radar range data augmenting the visual feature depth initialization. The method is evaluated through flight experiments conducted in both indoor and outdoor environments, as well as through challenges to both exteroceptive modalities (such as darkness, fog, or fast flight), thoroughly demonstrating its robustness. The implementation of the proposed method is available at: https://github.com/ntnu-arl/radvio
comment: 8 pages, 9 figures, Accepted to the 2026 European Control Conference (ECC)
Small-Scale Testbeds for Connected and Automated Vehicles and Robot Swarms: Challenges and a Roadmap
This article proposes a roadmap to address the current challenges in small-scale testbeds for Connected and Automated Vehicles (CAVs) and robot swarms. The roadmap is a joint effort of participants in the workshop "1st Workshop on Small-Scale Testbeds for Connected and Automated Vehicles and Robot Swarms," held on June 2 at the IEEE Intelligent Vehicles Symposium (IV) 2024 in Jeju, South Korea. The roadmap contains three parts: 1) enhancing accessibility and diversity, especially for underrepresented communities, 2) sharing best practices for the development and maintenance of testbeds, and 3) connecting testbeds through an abstraction layer to support collaboration. The workshop features eight invited speakers, four contributed papers [1]-[4], and a presentation of a survey paper on testbeds [5]. The survey paper provides an online comparative table of more than 25 testbeds, available at https://bassamlab.github.io/testbeds-survey. The workshop's own website is available at https://cpm-remote.lrt.unibw-muenchen.de/iv24-workshop.
comment: Published version
Scalable Screw-Theoretic Synthesis for PDE-Based Dynamic Modeling of Multibody Flexible Manipulators
This paper presents a novel and scalable screw-theoretic multibody synthesis framework for PDE-based dynamic modeling of serial robotic manipulators with an arbitrary number of flexible links in three-dimensional space. The proposed approach systematically constructs screw-theoretic PDE models for individual flexible links and rigorously enforces holonomic joint constraints through interaction forces. The dynamics of each link are formulated using a set of dual screws expressed in body-fixed coordinates: one describing the motion of the body-fixed frame relative to the inertial frame, a second relating the body-fixed frame to the undeformed configuration, and a third capturing elastic deformations. By expressing the system energy and applying variational principles, the governing dynamics of each link had been previously derived in a unified manner. Synthesizing the individual link models yields an infinitely scalable multibody representation capable of capturing both local (subsystem-level) and global (system-level) dynamics. The framework explicitly recovers all dynamic states, including the motion of each body-fixed frame and the distributed deformation fields of the flexible links. For computational tractability and mathematical rigor, the resulting governing equations are formulated as a semi-explicit index-1 differential-algebraic system. Furthermore, by applying separation of variables, the PDE model is recast as an abstract Cauchy problem, and well-posedness of the resulting system is established.
LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment CVPR 2026
We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban environments. While prior work LoD-Loc v2 achieves localization through semantic building silhouette alignment with low-detail city models, it suffers from two key limitations: poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k images with precise instance building annotations. This enables trained models to exhibit remarkable zero-shot generalization capability. Second, we reformulate the localization paradigm by shifting from semantic to instance silhouette alignment, which significantly reduces pose estimation ambiguity in dense scenes. Extensive experiments demonstrate that LoD-Loc v3 outperforms existing state-of-the-art (SOTA) baselines, achieving superior performance in both cross-scene and dense urban scenarios with a large margin. The project is available at https://nudt-sawlab.github.io/LoD-Locv3/.
comment: Accepted to CVPR 2026
Integrated cooperative localization of heterogeneous measurement swarm: A unified data-driven method
The cooperative localization (CL) problem in heterogeneous robotic systems with different measurement capabilities is investigated in this work. In practice, heterogeneous sensors lead to directed and sparse measurement topologies, whereas most existing CL approaches rely on multilateral localization with restrictive multi-neighbor geometric requirements. To overcome this limitation, we enable pairwise relative localization (RL) between neighboring robots using only mutual measurement and odometry information. A unified data-driven adaptive RL estimator is first developed to handle heterogeneous and unidirectional measurements. Based on the convergent RL estimates, a distributed pose-coupling CL strategy is then designed, which guarantees CL under a weakly connected directed measurement topology, representing the least restrictive condition among existing results. The proposed method is independent of specific control tasks and is validated through a formation control application and real-world experiments.
VL-KnG: Persistent Spatiotemporal Knowledge Graphs from Egocentric Video for Embodied Scene Understanding
Vision-language models (VLMs) demonstrate strong image-level scene understanding but often lack persistent memory, explicit spatial representations, and computational efficiency when reasoning over long video sequences. We present VL-KnG, a training-free framework that constructs spatiotemporal knowledge graphs from monocular video, bridging fine-grained scene graphs and global topological graphs without 3D reconstruction. VL-KnG processes video in chunks, maintains persistent object identity via LLM-based Spatiotemporal Object Association (STOA), and answers queries via Graph-Enhanced Retrieval (GER), a hybrid of GraphRAG subgraph retrieval and SigLIP2 visual grounding. Once built, the knowledge graph eliminates the need to re-process video at query time, enabling constant-time inference regardless of video length. Evaluation across three benchmarks, OpenEQA, NaVQA, and WalkieKnowledge (our newly introduced benchmark), shows that VL-KnG matches or surpasses frontier VLMs on embodied scene understanding tasks at significantly lower query latency, with explainable, graph-grounded reasoning. Real-world robot deployment confirms practical applicability with constant-time scaling.
Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation
Zero-shot object navigation (ZSON) requires robots to locate target objects in unseen environments without task-specific fine-tuning or pre-built maps, a capability crucial for service and household robotics. Existing methods perform well in simulation but struggle in realistic, cluttered environments where heavy occlusions and latent hazards make large portions of the scene unobserved. These approaches typically act on a single inferred scene, making them prone to overcommitment and unsafe behavior under uncertainty. To address these challenges, we propose Schrödinger's Navigator, a belief-aware framework that explicitly reasons over multiple trajectory-conditioned imagined 3D futures at inference time. A trajectory-conditioned 3D world model generates hypothetical observations along candidate paths, maintaining a superposition of plausible scene realizations. An adaptive, occluder-aware trajectory sampling strategy focuses imagination on uncertain regions, while a Future-Aware Value Map (FAVM) aggregates imagined futures to guide robust, proactive action selection. Evaluations in simulation and on a physical Go2 quadruped robot demonstrate that Schrödinger's Navigator outperforms strong ZSON baselines, achieving more robust self-localization, object localization, and safe navigation under severe occlusions and latent hazards. These results highlight the effectiveness of reasoning over imagined 3D futures as a scalable and generalizable strategy for zero-shot navigation in uncertain real-world environments.
Insect-Scale Tailless Robot with Flapping Wings: A Simple Structure and Drive for Yaw Control
Insect-scale micro-aerial vehicles, especially lightweight, flapping-wing robots, are becoming increasingly important for safe motion sensing in spatially constrained environments such as living spaces. However, yaw control using flapping wings is fundamentally more difficult than using rotating wings. In this study, an insect-scale, tailless robot with four paired tilted flapping wings (weighing 1.52 g) was fabricated to enable simultaneous control of four states, including yaw angle. The controllability Gramian was derived to quantify the controllability of the fabricated configuration and to evaluate the effects of the tilted-wing geometry on other control axes. This robot benefits from the simplicity of directly driven piezoelectric actuators without transmission, and lift control is achieved simply by changing the voltage amplitude. However, misalignment or modeling errors in lift force can cause offsets. Therefore, an adaptive controller was designed to compensate for such offsets. Numerical experiments confirm that the proposed controller outperforms a conventional linear quadratic integral controller under unknown offset conditions. Finally, in a tethered and controlled flight experiment, yaw drift was suppressed by combining the tilted-wing arrangement with the proposed controller.
comment: Accepted manuscript
AME-2: Agile and Generalized Legged Locomotion via Attention-Based Neural Map Encoding
Achieving agile and generalized legged locomotion across terrains requires tight integration of perception and control, especially under occlusions and sparse footholds. Existing methods have demonstrated agility on parkour courses but often rely on end-to-end sensorimotor models with limited generalization and interpretability. By contrast, methods targeting generalized locomotion typically exhibit limited agility and struggle with visual occlusions. We introduce AME-2, a unified reinforcement learning (RL) framework for agile and generalized locomotion that incorporates a novel attention-based map encoder in the control policy. This encoder extracts local and global mapping features and uses attention mechanisms to focus on salient regions, producing an interpretable and generalized embedding for RL-based control. We further propose a learning-based mapping pipeline that provides fast, uncertainty-aware terrain representations robust to noise and occlusions, serving as policy inputs. It uses neural networks to convert depth observations into local elevations with uncertainties, and fuses them with odometry. The pipeline also integrates with parallel simulation so that we can train controllers with online mapping, aiding sim-to-real transfer. We validate AME-2 with the proposed mapping pipeline on a quadruped and a biped robot, and the resulting controllers demonstrate strong agility and generalization to unseen terrains in simulation and in real-world experiments.
comment: under review
Evaluating Factor-Wise Auxiliary Dynamics Supervision for Latent Structure and Robustness in Simulated Humanoid Locomotion
We evaluate whether factor-wise auxiliary dynamics supervision produces useful latent structure or improved robustness in simulated humanoid locomotion. DynaMITE -- a transformer encoder with a factored 24-d latent trained by per-factor auxiliary losses during proximal policy optimization (PPO) -- is compared against Long Short-Term Memory (LSTM), plain Transformer, and Multilayer Perceptron (MLP) baselines on a Unitree G1 humanoid across four Isaac Lab tasks. The supervised latent shows no evidence of decodable or functionally separable factor structure: probe R^2 ~ 0 for all five dynamics factors, clamping any subspace changes reward by < 0.05, and standard disentanglement metrics (MIG, DCI, SAP) are near zero. An unsupervised LSTM hidden state achieves higher probe R^2 (up to 0.10). A 2x2 factorial ablation (n = 10 seeds) isolates the contributions of the tanh bottleneck and auxiliary losses: the auxiliary losses show no measurable effect on either in-distribution (ID) reward (+0.03, p = 0.732) or severe out-of-distribution (OOD) reward (+0.03, p = 0.669), while the bottleneck shows a small, consistent advantage in both regimes (ID: +0.16, p = 0.207; OOD: +0.10, p = 0.208). The bottleneck advantage persists under severe combined perturbation but does not amplify, indicating a training-time representation benefit rather than a robustness mechanism. LSTM achieves the best nominal reward on all four tasks (p < 0.03); DynaMITE degrades less under combined-shift stress (2.3% vs. 16.7%), but this difference is attributable to the bottleneck compression, not the auxiliary supervision. For locomotion practitioners: auxiliary dynamics supervision does not produce an interpretable estimator and does not measurably improve reward or robustness beyond what the bottleneck alone provides; recurrent baselines remain the stronger choice for nominal performance.
comment: 17 pages, 9 figures, 25 tables
PA-LVIO: Real-Time LiDAR-Visual-Inertial Odometry and Mapping with Pose-Only Bundle Adjustment
Real-time LiDAR-visual-inertial odometry and mapping is crucial for navigation and planning tasks in intelligent transportation systems. This study presents a pose-only bundle adjustment (PA) LiDAR-visual-inertial odometry (LVIO), named PA-LVIO, to meet the urgent need for real-time navigation and mapping. The proposed PA framework for LiDAR and visual measurements is highly accurate and efficient, and it can derive reliable frame-to-frame constraints within multiple frames. A marginalization-free and frame-to-map (F2M) LiDAR measurement model is integrated into the state estimator to eliminate odometry drifts. Meanwhile, an IMU-centric online spatial-temporal calibration is employed to obtain a pixel-wise LiDAR-camera alignment. With accurate estimated odometry and extrinsics, a high-quality and RGB-rendered point-cloud map can be built. Comprehensive experiments are conducted on both public and private datasets collected by wheeled robot, unmanned aerial vehicle (UAV), and handheld devices with 28 sequences and more than 50 km trajectories. Sufficient results demonstrate that the proposed PA-LVIO yields superior or comparable performance to state-of-the-art LVIO methods, in terms of the odometry accuracy and mapping quality. Besides, PA-LVIO can run in real-time on both the desktop PC and the onboard ARM computer. The codes and datasets are open sourced on GitHub (https://github.com/i2Nav-WHU/PA-LVIO) to benefit the community.
comment: 14 pages, 10 figures
Risk-Aware Obstacle Avoidance Algorithm for Real-Time Applications
Robust navigation in changing marine environments requires autonomous systems capable of perceiving, reasoning, and acting under uncertainty. This study introduces a hybrid risk-aware navigation architecture that integrates probabilistic modeling of obstacles along the vehicle path with smooth trajectory optimization for autonomous surface vessels. The system constructs probabilistic risk maps that capture both obstacle proximity and the behavior of dynamic objects. A risk-biased Rapidly Exploring Random Tree (RRT) planner leverages these maps to generate collision-free paths, which are subsequently refined using B-spline algorithms to ensure trajectory continuity. Three distinct RRT* rewiring modes are implemented based on the cost function: minimizing the path length, minimizing risk, and optimizing a combination of the path length and total risk. The framework is evaluated in experimental scenarios containing both static and dynamic obstacles. The results demonstrate the system's ability to navigate safely, maintain smooth trajectories, and dynamically adapt to changing environmental risks. Compared with conventional LIDAR or vision-only navigation approaches, the proposed method shows improvements in operational safety and autonomy, establishing it as a promising solution for risk-aware autonomous vehicle missions in uncertain and dynamic environments.
nuScenes Revisited: Progress and Challenges in Autonomous Driving
Autonomous Vehicles (AV) and Advanced Driver Assistance Systems (ADAS) have been revolutionized by Deep Learning. As a data-driven approach, Deep Learning relies on vast amounts of driving data, typically labeled in great detail. As a result, datasets, alongside hardware and algorithms, are foundational building blocks for the development of AVs. In this work we revisit one of the most widely used autonomous driving datasets: the nuScenes dataset. nuScenes exemplifies key trends in AV development, being the first dataset to include radar data, to feature diverse urban driving scenes from two continents, and to be collected using a fully autonomous vehicle operating on public roads, while also promoting multi-modal sensor fusion, standardized benchmarks, and a broad range of tasks including perception, localization & mapping, prediction and planning. We provide an unprecedented look into the creation of nuScenes, as well as its extensions nuImages and Panoptic nuScenes, summarizing many technical details that have hitherto not been revealed in academic publications. Furthermore, we trace how the influence of nuScenes impacted a large number of other datasets that were released later and how it defined numerous standards that are used by the community to this day. Finally, we present an overview of both official and unofficial tasks using the nuScenes dataset and review major methodological developments, thereby offering a comprehensive survey of the autonomous driving literature, with a particular focus on nuScenes.
comment: 18 pages, 17 figures
Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis
Equipping humanoid robots with versatile interaction skills typically requires either extensive policy training or explicit human-to-robot motion retargeting. However, learning-based policies face prohibitive data collection costs. Meanwhile, retargeting relies on human-centric pose estimation (e.g., SMPL), introducing a morphology gap. Skeletal scale mismatches result in severe spatial misalignments when mapped to robots, compromising interaction success. In this work, we propose Dream2Act, a robot-centric framework enabling zero-shot interaction through generative video synthesis. Given a third-person image of the robot and target object, our framework leverages video generation models to envision the robot completing the task with morphology-consistent motion. We employ a high-fidelity pose extraction system to recover physically feasible, robot-native joint trajectories from these synthesized dreams, subsequently executed via a general-purpose whole-body controller. Operating strictly within the robot-native coordinate space, Dream2Act avoids retargeting errors and eliminates task-specific policy training. We evaluate Dream2Act on the Unitree G1 across four whole-body mobile interaction tasks: ball kicking, sofa sitting, bag punching, and box hugging. Dream2Act achieves a 37.5% overall success rate, compared to 0% for conventional retargeting. While retargeting fails to establish correct physical contacts due to the morphology gap (with errors compounded during locomotion), Dream2Act maintains robot-consistent spatial alignment, enabling reliable contact formation and substantially higher task completion.
U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences CVPR 2026
Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative frameworks, however, often treat all spatial regions uniformly, overlooking the varying uncertainty across real-world scenes. This uniform generation leads to artifacts in complex or ambiguous regions, limiting realism and temporal stability. In this work, we present U4D, an uncertainty-aware framework for 4D LiDAR world modeling. Our approach first estimates spatial uncertainty maps from a pretrained segmentation model to localize semantically challenging regions. It then performs generation in a "hard-to-easy" manner through two sequential stages: (1) uncertainty-region modeling, which reconstructs high-entropy regions with fine geometric fidelity, and (2) uncertainty-conditioned completion, which synthesizes the remaining areas under learned structural priors. To further ensure temporal coherence, U4D incorporates a mixture of spatio-temporal (MoST) block that adaptively fuses spatial and temporal representations during diffusion. Extensive experiments show that U4D produces geometrically faithful and temporally consistent LiDAR sequences, advancing the reliability of 4D world modeling for autonomous perception and simulation.
comment: CVPR 2026; 20 pages, 7 figures, 11 tables; Code at https://github.com/worldbench/U4D
Background Fades, Foreground Leads: Curriculum-Guided Background Pruning for Efficient Foreground-Centric Collaborative Perception ICRA 2026
Collaborative perception enhances the reliability and spatial coverage of autonomous vehicles by sharing complementary information across vehicles, offering a promising solution to long-tail scenarios that challenge single-vehicle perception. However, the bandwidth constraints of vehicular networks make transmitting the entire feature map impractical. Recent methods, therefore, adopt a foreground-centric paradigm, transmitting only predicted foreground-region features while discarding the background, which encodes essential context. We propose FadeLead, a foreground-centric framework that overcomes this limitation by learning to encapsulate background context into compact foreground features during training. At the core of our design is a curricular learning strategy that leverages background cues early on but progressively prunes them away, forcing the model to internalize context into foreground representations without transmitting background itself. Extensive experiments on both simulated and real-world benchmarks show that FadeLead outperforms prior methods under different bandwidth settings, underscoring the effectiveness of context-enriched foreground sharing.
comment: ICRA 2026
GHOST: Ground-projected Hypotheses from Observed Structure-from-Motion Trajectories
We present a scalable self-supervised approach for segmenting feasible vehicle trajectories from monocular images for autonomous driving in complex urban environments. Leveraging large-scale dashcam videos, we treat recorded ego-vehicle motion as implicit supervision and recover camera trajectories via monocular structure-from-motion, projecting them onto the ground plane to generate spatial masks of traversed regions without manual annotation. These automatically generated labels are used to train a deep segmentation network that predicts motion-conditioned path proposals from a single RGB image at run time, without explicit modeling of road or lane markings. Trained on diverse, unconstrained internet data, the model implicitly captures scene layout, lane topology, and intersection structure, and generalizes across varying camera configurations. We evaluate our approach on NuScenes, demonstrating reliable trajectory prediction, and further show transfer to an electric scooter platform through light fine-tuning. Our results indicate that large-scale ego-motion distillation yields structured and generalizable path proposals beyond the demonstrated trajectory, enabling trajectory hypothesis estimation via image segmentation.
comment: 8 pages, 27 figures, 1 table
Emergent Dexterity via Diverse Resets and Large-Scale Reinforcement Learning
Reinforcement learning in massively parallel physics simulations has driven major progress in sim-to-real robot learning. However, current approaches remain brittle and task-specific, relying on extensive per-task engineering to design rewards, curricula, and demonstrations. Even with this engineering, they often fail on long-horizon, contact-rich manipulation tasks and do not meaningfully scale with compute, as performance quickly saturates when training revisits the same narrow regions of state space. We introduce OmniReset, a simple and scalable framework that enables on-policy reinforcement learning to robustly solve a broad class of dexterous manipulation tasks using a single reward function, fixed algorithm hyperparameters, no curricula, and no human demonstrations. Our key insight is that long-horizon exploration can be dramatically simplified by using simulator resets to systematically expose the RL algorithm to the diverse set of robot-object interactions which underlie dexterous manipulation. OmniReset programmatically generates such resets with minimal human input, converting additional compute directly into broader behavioral coverage and continued performance gains. We show that OmniReset gracefully scales to long-horizon dexterous manipulation tasks beyond the capabilities of existing approaches and is able to learn robust policies over significantly wider ranges of initial conditions than baselines. Finally, we distill OmniReset into visuomotor policies which display robust retrying behavior and substantially higher success rates than baselines when transferred to the real world zero-shot. Project webpage: https://omnireset.github.io
NL2SpaTiaL: Generating Geometric Spatio-Temporal Logic Specifications from Natural Language for Manipulation Tasks
While Temporal Logic provides a rigorous verification framework for robotics, it typically operates on trajectory-level signals and does not natively represent the object-centric geometric relations that are central to manipulation. Spatio-Temporal Logic (SpaTiaL) overcomes this by explicitly capturing geometric spatial requirements, making it a natural formalism for manipulation-task verification. Consequently, translating natural language (NL) into verifiable SpaTiaL specifications is a critical objective. Yet, existing NL-to-Logic methods treat specifications as flat sequences, entangling nested temporal scopes with spatial relations and causing performance to degrade sharply under deep nesting. We propose NL2SpaTiaL, a framework modeling specifications as Hierarchical Logical Trees (HLT). By generating formulas as structured HLTs in a single shot, our approach decouples semantic parsing from syntactic rendering, aligning with human compositional spatial reasoning. To support this, we construct, to the best of our knowledge, the first NL-to-SpaTiaL dataset with explicit hierarchical supervision via a logic-first synthesis pipeline. Experiments with open-weight LLMs demonstrate that our HLT formulation significantly outperforms flat-generation baselines across various logical depths. These results show that explicit HLT structure is critical for scalable NL-to-SpaTiaL translation, ultimately enabling a rigorous ``generate-and-test'' paradigm for verifying candidate trajectories in language-conditioned robotics. Project website: https://sites.google.com/view/nl2spatial
db-LaCAM: Fast and Scalable Multi-Robot Kinodynamic Motion Planning with Discontinuity-Bounded Search and Lightweight MAPF
State-of-the-art multi-robot kinodynamic motion planners struggle to handle more than a few robots due to high computational burden, which limits their scalability and results in slow planning time. In this work, we combine the scalability and speed of modern multi-agent path finding (MAPF) algorithms with the dynamic-awareness of kinodynamic planners to address these limitations. To this end, we propose discontinuity-Bounded LaCAM (db-LaCAM), a planner that utilizes a precomputed set of motion primitives that respect robot dynamics to generate horizon-length motion sequences, while allowing a user-defined discontinuity between successive motions. The planner db-LaCAM is resolution-complete with respect to motion primitives and supports arbitrary robot dynamics. Extensive experiments demonstrate that db-LaCAM scales efficiently to scenarios with up to 50 robots, achieving up to ten times faster runtime compared to state-of-the-art planners, while maintaining comparable solution quality. The approach is validated in both 2D and 3D environments with dynamics such as the unicycle and 3D double integrator. We demonstrate the safe execution of trajectories planned with db-LaCAM in two distinct physical experiments involving teams of flying robots and car-with-trailer robots.
EquiBim: Learning Symmetry-Equivariant Policy for Bimanual Manipulation
Robotic imitation learning has achieved impressive success in learning complex manipulation behaviors from demonstrations. However, many existing robot learning methods do not explicitly account for the physical symmetries of robotic systems, often resulting in asymmetric or inconsistent behaviors under symmetric observations. This limitation is particularly pronounced in dual-arm manipulation, where bilateral symmetry is inherent to both the robot morphology and the structure of many tasks. In this paper, we introduce EquiBim, a symmetry-equivariant policy learning framework for bimanual manipulation that enforces bilateral equivariance between observations and actions during training. Our approach formulates physical symmetry as a group action on both observation and action spaces, and imposes an equivariance constraint on policy predictions under symmetric transformations. The framework is model-agnostic and can be seamlessly integrated into a wide range of imitation learning pipelines with diverse observation modalities and action representations, including point cloud-based and image-based policies, as well as both end-effector-space and joint-space parameterizations. We evaluate EquiBim on RoboTwin, a dual-arm robotic platform with symmetric kinematics, and evaluate it across diverse observation and action configurations in simulation. We further validate the approach on a real-world dual-arm system. Across both simulation and physical experiments, our method consistently improves performance and robustness under distribution shifts. These results suggest that explicitly enforcing physical symmetry provides a simple yet effective inductive bias for bimanual robot learning.
comment: 8 pages, 6 figures
Point What You Mean: Visually Grounded Instruction Policy
Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we introduce the Point-VLA, a plug-and-play policy that augments language instructions with explicit visual cues (e.g., bounding boxes) to resolve referential ambiguity and enable precise object-level grounding. To efficiently scale visually grounded datasets, we further develop an automatic data annotation pipeline requiring minimal human effort. We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs, particularly in cluttered or unseen-object scenarios, with robust generalization. These results demonstrate that Point-VLA effectively resolves object referring ambiguity through pixel-level visual grounding, achieving more generalizable embodied control.
Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling
Robust perception and dynamics modeling are fundamental to real-world robotic policy learning. Recent methods employ video diffusion models (VDMs) to enhance robotic policies, improving their understanding and modeling of the physical world. However, existing approaches overlook the coherent and physically consistent motion representations inherently encoded across frames in VDMs. To this end, we propose Video2Act, a framework that efficiently guides robotic action learning by explicitly integrating spatial and motion-aware representations. Building on the inherent representations of VDMs, we extract foreground boundaries and inter-frame motion variations while filtering out background noise and task-irrelevant biases. These refined representations are then used as additional conditioning inputs to a diffusion transformer (DiT) action head, enabling it to reason about what to manipulate and how to move. To mitigate inference inefficiency, we propose an asynchronous dual-system design, where the VDM functions as the slow System 2 and the DiT head as the fast System 1, working collaboratively to generate adaptive actions. By providing motion-aware conditions to System 1, Video2Act maintains stable manipulation even with low-frequency updates from the VDM. For evaluation, Video2Act surpasses previous state-of-the-art VLA methods by 7.7% in simulation and 21.7% in real-world tasks in terms of average success rate, further exhibiting strong generalization capabilities.
Energy-Aware Reinforcement Learning for Robotic Manipulation of Articulated Components in Infrastructure Operation and Maintenance
With the growth of intelligent civil infrastructure and smart cities, operation and maintenance (O&M) increasingly requires safe, efficient, and energy-conscious robotic manipulation of articulated components, including access doors, service drawers, and pipeline valves. However, existing robotic approaches either focus primarily on grasping or target object-specific articulated manipulation, and they rarely incorporate explicit actuation energy into multi-objective optimisation, which limits their scalability and suitability for long-term deployment in real O&M settings. Therefore, this paper proposes an articulation-agnostic and energy-aware reinforcement learning framework for robotic manipulation in intelligent infrastructure O&M. The method combines part-guided 3D perception, weighted point sampling, and PointNet-based encoding to obtain a compact geometric representation that generalises across heterogeneous articulated objects. Manipulation is formulated as a Constrained Markov Decision Process (CMDP), in which actuation energy is explicitly modelled and regulated via a Lagrangian-based constrained Soft Actor-Critic scheme. The policy is trained end-to-end under this CMDP formulation, enabling effective articulated-object operation while satisfying a long-horizon energy budget. Experiments on representative O&M tasks demonstrate 16%-30% reductions in energy consumption, 16%-32% fewer steps to success, and consistently high success rates, indicating a scalable and sustainable solution for infrastructure O&M manipulation.
comment: 18 pages, 5 figures, 7 tables. This version supersedes all previous preprint versions
Design, Mapping, and Contact Anticipation with 3D-printed Whole-Body Tactile and Proximity Sensors ICRA
Robots operating in dynamic and shared environments benefit from anticipating contact before it occurs. We present GenTact-Prox, a fully 3D-printed artificial skin that integrates tactile and proximity sensing for contact detection and anticipation. The artificial skin platform is modular in design, procedurally generated to fit any robot morphology, and can cover the whole body of a robot. The skin achieved detection ranges of up to 18 cm during evaluation. To characterize how robots perceive nearby space through this skin, we introduce a data-driven framework for mapping the Perisensory Space -- the body-centric volume of space around the robot where sensors provide actionable information for contact anticipation. We demonstrate this approach on a Franka Research 3 robot equipped with five GenTact-Prox units, enabling online object-aware operation and contact prediction.
comment: This work was accepted at the International Conference on Robotics and Automation (ICRA) 2026
ProbeMDE: Uncertainty-Guided Active Proprioception for Monocular Depth Estimation in Surgical Robotics ICRA 2026
Monocular depth estimation (MDE) provides a useful tool for robotic perception, but its predictions are often uncertain and inaccurate in challenging environments such as surgical scenes where textureless surfaces, specular reflections, and occlusions are common. To address this, we propose ProbeMDE, a cost-aware active sensing framework that combines RGB images with sparse proprioceptive measurements for MDE. Our approach utilizes an ensemble of MDE models to predict dense depth maps conditioned on both RGB images and on a sparse set of known depth measurements obtained via proprioception, where the robot has touched the environment in a known configuration. We quantify predictive uncertainty via the ensemble's variance and measure the gradient of the uncertainty with respect to candidate measurement locations. To prevent mode collapse while selecting maximally informative locations to propriocept (touch), we leverage Stein Variational Gradient Descent (SVGD) over this gradient map. We validate our method in both simulated and physical experiments on central airway obstruction surgical phantoms. Our results demonstrate that our approach outperforms baseline methods across standard depth estimation metrics, achieving higher accuracy while minimizing the number of required proprioceptive measurements. Project page: https://brittonjordan.github.io/probe_mde/
comment: 8 pages, 5 figures. Accepted at ICRA 2026. Project page: https://brittonjordan.github.io/probe_mde/
EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards
Video generative models are increasingly used as world models for robotics, where a model generates a future visual rollout conditioned on the current observation and task instruction, and an inverse dynamics model (IDM) converts the generated frames into executable robot actions. However, current video world models lack explicit executability constraints. As a result, visually coherent rollouts may still violate rigid-body and kinematic consistency, producing unstable or infeasible control commands when decoded by an IDM. We refer to this mismatch between visual generation and physically executable control as the executability gap. While this gap can be mitigated at inference time using techniques such as rejection sampling, such approaches are inefficient due to the high cost of video generation. In this paper, we leverage the executability gap as a training signal and introduce Executable Video Alignment (EVA), a reinforcement-learning post-training framework for aligning video world models. EVA trains an inverse dynamics model on real robot trajectories and repurposes it as a reward model that evaluates generated videos through the action sequences they induce, encouraging smooth motions measured by velocity, acceleration, and jerk while penalizing actions that violate embodiment constraints. Importantly, the reward remains informative even when generated videos contain severe visual artifacts, since such artifacts typically translate into unstable or out-of-bound actions. Experiments on the RoboTwin benchmark and a real bimanual robot show that EVA reduces embodiment-specific artifacts in generated rollouts and improves downstream task execution success.
comment: Project page: https://eva-project-page.github.io/
Parametric Design of a Cable-Driven Coaxial Spherical Parallel Mechanism for Ultrasound Scans
Haptic interfaces play a critical role in medical teleoperation by enabling surgeons to interact with remote environments through realistic force and motion feedback. Achieving high fidelity in such systems requires balancing the trade-offs among workspace, dexterity, stiffness, inertia, and bandwidth, particularly in applications demanding pure rotational motion. This paper presents the design methodology and kinematic analysis of a Cable-Driven Coaxial Spherical Parallel Mechanism (CDC-SPM) developed to address these challenges. The proposed approach focuses on the mechanical design and parametric synthesis of the mechanism to meet task-specific requirements in medical applications. In particular, the design enables the relocation of the center of rotation to an external point corresponding to the tool-tissue interaction, while ensuring appropriate workspace coverage and collision avoidance. The proposed cable-driven interface design allows for reducing the mass placed at the robot arm end-effector, thereby minimizing inertial loads, enhancing stiffness, and improving dynamic responsiveness. Through parallel and coaxial actuation, the mechanism achieves decoupled rotational degrees of freedom with isotropic force and torque transmission. A prototype is developed to validate the mechanical feasibility and kinematic behavior of the proposed mechanism. These results demonstrate the suitability of the proposed mechanism design for future integration into haptic interfaces for medical applications such as ultrasound imaging.
Physically Accurate Rigid-Body Dynamics in Particle-Based Simulation IROS 2026
Robotics demands simulation that can reason about the diversity of real-world physical interactions, from rigid to deformable objects and fluids. Current simulators address this by stitching together multiple subsolvers for different material types, resulting in a compositional architecture that complicates physical reasoning. Particle-based simulators offer a compelling alternative, representing all materials through a single unified formulation that enables seamless cross-material interactions. Among particle-based simulators, position-based dynamics (PBD) is a popular solver known for its computational efficiency and visual plausibility. However, its lack of physical accuracy has limited its adoption in robotics. To leverage the benefits of particle-based solvers while meeting the physical fidelity demands of robotics, we introduce PBD-R, a revised PBD formulation that enforces physically accurate rigid-body dynamics through a novel momentum-conservation constraint and a modified velocity update. Additionally, we introduce a solver-agnostic benchmark with analytical solutions to evaluate physical accuracy. Using this benchmark, we show that PBD-R significantly outperforms PBD and achieves competitive accuracy with MuJoCo while requiring less computation.
comment: Submitted to IROS 2026
Delay-Aware Diffusion Policy: Bridging the Observation-Execution Gap in Dynamic Tasks
As a robot senses and selects actions, the world keeps changing. This inference delay creates a gap of tens to hundreds of milliseconds between the observed state and the state at execution. In this work, we take the natural generalization from zero delay to measured delay during training and inference. We introduce Delay-Aware Diffusion Policy (DA-DP), a framework for explicitly incorporating inference delays into policy learning. DA-DP corrects zero-delay trajectories to their delay-compensated counterparts, and augments the policy with delay conditioning. We empirically validate DA-DP on a variety of tasks, robots, and delays and find its success rate more robust to delay than delay-unaware methods. DA-DP is architecture agnostic and transfers beyond diffusion policies, offering a general pattern for delay-aware imitation learning. More broadly, DA-DP encourages evaluation protocols that report performance as a function of measured latency, not just task difficulty.
RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Interactive Environmental Learning in Physical Embodied Systems
Embodied intelligence aims to enable robots to learn, reason, and generalize robustly across complex real-world environments. However, existing approaches often struggle with partial observability, fragmented spatial reasoning, and inefficient integration of heterogeneous memories, limiting their capacity for long-horizon adaptation. To address this, we introduce RoboMemory, a brain-inspired framework that unifies Spatial, Temporal, Episodic, and Semantic memory within a parallelized architecture for efficient long-horizon planning and interactive learning. Its core innovations are a dynamic spatial knowledge graph for scalable, consistent memory updates and a closed-loop planner with a critic module for adaptive decision-making. Extensive experiments on EmbodiedBench show that RoboMemory, instantiated with Qwen2.5-VL-72B-Ins, improves the average success rate by 26.5% over its strong baseline and even surpasses the closed-source SOTA, Claude-3.5-Sonnet. Real-world trials further confirm its capability for cumulative learning, with performance consistently improving over repeated tasks. Our results position RoboMemory as a scalable foundation for memory-augmented embodied agents, bridging insights from cognitive neuroscience with practical robotic autonomy.
A Real-Time Control Barrier Function-Based Safety Filter for Motion Planning with Arbitrary Road Boundary Constraints SC60802
We present a real-time safety filter for motion planning, including those that are learning-based, using Control Barrier Functions (CBFs) to provide formal guarantees for collision avoidance with road boundaries. A key feature of our approach is its ability to directly incorporate road geometries of arbitrary shape that are represented as polylines without resorting to conservative overapproximations. We formulate the safety filter as a constrained optimization problem as a Quadratic Program (QP), which achieves safety by making minimal, necessary adjustments to the control actions issued by the nominal motion planner. We validate our safety filter through extensive numerical experiments across a variety of traffic scenarios featuring complex road boundaries. The results confirm its reliable safety and high computational efficiency (execution frequency up to 40 Hz). Code reproducing our experimental results and a video demonstration are available at github.com/bassamlab/SigmaRL.
comment: Published version, see https://doi.org/10.1109/ITSC60802.2025.11423203
Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies
Vision-Language-Action (VLA) models have significant potential to enable general-purpose robotic systems for a range of vision-language tasks. However, the performance of VLA-based robots is highly sensitive to the precise wording of language instructions, and it remains difficult to predict when such robots will fail. To improve the robustness of VLAs to different wordings, we present Q-DIG (Quality Diversity for Diverse Instruction Generation), which performs red-teaming by scalably identifying diverse natural language task descriptions that induce failures while remaining task-relevant. Q-DIG integrates Quality Diversity (QD) techniques with Vision-Language Models (VLMs) to generate a broad spectrum of adversarial instructions that expose meaningful vulnerabilities in VLA behavior. Our results across multiple simulation benchmarks show that Q-DIG finds more diverse and meaningful failure modes compared to baseline methods, and that fine-tuning VLAs on the generated instructions improves task success rates. Furthermore, results from a user study highlight that Q-DIG generates prompts judged to be more natural and human-like than those from baselines. Finally, real-world evaluations of Q-DIG prompts show results consistent with simulation, and fine-tuning VLAs on the generated prompts further success rates on unseen instructions. Together, these findings suggest that Q-DIG is a promising approach for identifying vulnerabilities and improving the robustness of VLA-based robots. Our anonymous project website is at qdigvla.github.io.
Co-Designing a Peer Social Robot for Young Newcomers' Language and Cultural Learning
Community literacy programs supporting young newcomer children in Canada face limited staffing and scarce one-to-one time, which constrains personalized English and cultural learning support. This paper reports on a co-design study with United for Literacy tutors that informed Maple, a table-top, peer-like Socially Assistive Robot (SAR) designed as a practice partner within tutor-mediated sessions. From shadowing and co-design interviews, we derived newcomer-specific requirements and added them in an integrated prototype that uses short story-based activities, multi-modal scaffolding and embedded quizzes that support attention while producing tutor-actionable formative signals. We contribute system design implications for tutor-in-the-loop SARs supporting language socialization in community settings and outline directions for child-centered evaluation in authentic programs.
Symmetry-Guided Memory Augmentation for Efficient Locomotion Learning
Training reinforcement learning (RL) policies for legged locomotion often requires extensive environment interactions, which are costly and time-consuming. We propose Symmetry-Guided Memory Augmentation (SGMA), a framework that improves training efficiency by combining structured experience augmentation with memory-based context inference. Our method leverages robot and task symmetries to generate additional, physically consistent training experiences without requiring extra interactions. To avoid the pitfalls of naive augmentation, we extend these transformations to the policy's memory states, enabling the agent to retain task-relevant context and adapt its behavior accordingly. We evaluate the approach on quadruped and humanoid robots in simulation, as well as on a real quadruped platform. Across diverse locomotion tasks involving joint failures and payload variations, our method achieves efficient policy training while maintaining robust performance, demonstrating a practical route toward data-efficient RL for legged robots.
Dynamic Neural Potential Field: Online Trajectory Optimization in the Presence of Moving Obstacles
Generalist robot policies must operate safely and reliably in everyday human environments such as homes, offices, and warehouses, where people and objects move unpredictably. We present Dynamic Neural Potential Field (NPField-GPT), a learning-enhanced model predictive control (MPC) framework that couples classical optimization with a Transformer-based predictor of footprint-aware repulsive potentials. Given an occupancy sub-map, robot footprint, and optional dynamic-obstacle cues, our NPField-GPT model forecasts a horizon of differentiable potentials that are injected into a sequential quadratic MPC program via L4CasADi, yielding real-time, constraint-aware trajectory optimization. We additionally study two baselines: NPField-StaticMLP, where a dynamic scene is treated as a sequence of static maps; and NPField-DynamicMLP, which predicts the future potential sequence in parallel with an MLP. In dynamic indoor scenarios from BenchMR and on a Husky UGV in office corridors, NPField-GPT produces more efficient and safer trajectories under motion changes, while StaticMLP/DynamicMLP offer lower latency. We also compare with the CIAO* and MPPI baselines. Across methods, the Transformer+MPC synergy preserves the transparency and stability of model-based planning while learning only the part that benefits from data: spatiotemporal collision risk. Code and trained models are available at https://github.com/CognitiveAISystems/Dynamic-Neural-Potential-Field
TacVLA: Contact-Aware Tactile Fusion for Robust Vision-Language-Action Manipulation
Vision-Language-Action (VLA) models have demonstrated significant advantages in robotic manipulation. However, their reliance on vision and language often leads to suboptimal performance in tasks involving visual occlusion, fine-grained manipulation, and physical contact. To address these challenges, we propose TacVLA, a fine-tuned VLA model by incorporating tactile modalities into the transformer-based policy to enhance fine-grained manipulation capabilities. Specifically, we introduce a contact-aware gating mechanism that selectively activates tactile tokens only when contact is detected, enabling adaptive multimodal fusion while avoiding irrelevant tactile interference. The fused visual, language, and tactile tokens are jointly processed within the transformer architecture to strengthen cross-modal grounding during contact-rich interaction. Extensive experiments on constraint-locked disassembly, in-box picking and robustness evaluations demonstrate that our model outperforms baselines, improving the performance by averaging 20% success rate in disassembly, 60% in in-box picking and 2.1x improvement in scenarios with visual occlusion. Videos are available at https://sites.google.com/view/tacvla and code will be released.
comment: 9 pages, 7 figures
Reward Evolution with Graph-of-Thoughts: A Bi-Level Language Model Framework for Reinforcement Learning
Designing effective reward functions remains a major challenge in reinforcement learning (RL), often requiring considerable human expertise and iterative refinement. Recent advances leverage Large Language Models (LLMs) for automated reward design, but these approaches are limited by hallucinations, reliance on human feedback, and challenges with handling complex, multi-step tasks. In this work, we introduce Reward Evolution with Graph-of-Thoughts (RE-GoT), a novel bi-level framework that enhances LLMs with structured graph-based reasoning and integrates Visual Language Models (VLMs) for automated rollout evaluation. RE-GoT first decomposes tasks into text-attributed graphs, enabling comprehensive analysis and reward function generation, and then iteratively refines rewards using visual feedback from VLMs without human intervention. Extensive experiments on 10 RoboGen and 4 ManiSkill2 tasks demonstrate that RE-GoT consistently outperforms existing LLM-based baselines. On RoboGen, our method improves average task success rates by 32.25%, with notable gains on complex multi-step tasks. On ManiSkill2, RE-GoT achieves an average success rate of 93.73% across four diverse manipulation tasks, significantly surpassing prior LLM-based approaches and even exceeding expert-designed rewards. Our results indicate that combining LLMs and VLMs with graph-of-thoughts reasoning provides a scalable and effective solution for autonomous reward evolution in RL.
ManiDreams: An Open-Source Library for Robust Object Manipulation via Uncertainty-aware Task-specific Intuitive Physics
Dynamics models, whether simulators or learned world models, have long been central to robotic manipulation, but most focus on minimizing prediction error rather than confronting a more fundamental challenge: real-world manipulation is inherently uncertain. We argue that robust manipulation under uncertainty is fundamentally an integration problem: uncertainties must be represented, propagated, and constrained within the planning loop, not merely suppressed during training. We present and open-source ManiDreams, a modular framework for uncertainty-aware manipulation planning over intuitive physics models. It realizes this integration through composable abstractions for distributional state representation, backend-agnostic dynamics prediction, and declarative constraint specification for action optimization. The framework explicitly addresses three sources of uncertainty: perceptual, parametric, and structural. It wraps any base policy with a sample-predict-constrain loop that evaluates candidate actions against distributional outcomes, adding robustness without retraining. Experiments on ManiSkill tasks show that ManiDreams maintains robust performance under various perturbations where the RL baseline degrades significantly. Runnable examples on pushing, picking, catching, and real-world deployment demonstrate flexibility across different policies, optimizers, physics backends, and executors. The framework is publicly available at https://github.com/Rice-RobotPI-Lab/ManiDreams
comment: 9 pages, 10 figures. Project page at https://rice-robotpi-lab.github.io/ManiDreams/
Multiagent Systems
Behavioral Heterogeneity as Quantum-Inspired Representation
Driver heterogeneity is often reduced to labels or discrete regimes, compressing what is inherently dynamic into static categories. We introduce quantum-inspired representation that models each driver as an evolving latent state, presented as a density matrix with structured mathematical properties. Behavioral observations are embedded via non-linear Random Fourier Features, while state evolution blends temporal persistence of behavior with context-dependent profile activation. We evaluate our approach on empirical driving data, Third Generation Simulation Data (TGSIM), showing how driving profiles are extracted and analyzed.
Planning over MAPF Agent Dependencies via Multi-Dependency PIBT
Modern Multi-Agent Path Finding (MAPF) algorithms must plan for hundreds to thousands of agents in congested environments within a second, requiring highly efficient algorithms. Priority Inheritance with Backtracking (PIBT) is a popular algorithm capable of effectively planning in such situations. However, PIBT is constrained by its rule-based planning procedure and lacks generality because it restricts its search to paths that conflict with at most one other agent. This limitation also applies to Enhanced PIBT (EPIBT), a recent extension of PIBT. In this paper, we describe a new perspective on solving MAPF by planning over agent dependencies. Taking inspiration from PIBT's priority inheritance logic, we define the concept of agent dependencies and propose Multi-Dependency PIBT (MD-PIBT) that searches over agent dependencies. MD-PIBT is a general framework where specific parameterizations can reproduce PIBT and EPIBT. At the same time, alternative configurations yield novel planning strategies that are not expressible by PIBT or EPIBT. Our experiments demonstrate that MD-PIBT effectively plans for as many as 10,000 homogeneous agents under various kinodynamic constraints, including pebble motion, rotation motion, and differential drive robots with speed and acceleration limits. We perform thorough evaluations on different variants of MAPF and find that MD-PIBT is particularly effective in MAPF with large agents.
Designing Agentic AI-Based Screening for Portfolio Investment
We introduce a new agentic artificial intelligence (AI) platform for portfolio management. Our architecture consists of three layers. First, two large language model (LLM) agents are assigned specialized tasks: one agent screens for firms with desirable fundamentals, while a sentiment analysis agent screens for firms with desirable news. Second, these agents deliberate to generate and agree upon buy and sell signals from a large portfolio, substantially narrowing the pool of candidate assets. Finally, we apply a high-dimensional precision matrix estimation procedure to determine optimal portfolio weights. A defining theoretical feature of our framework is that the number of assets in the portfolio is itself a random variable, realized through the screening process. We introduce the concept of sensible screening and establish that, under mild screening errors, the squared Sharpe ratio of the screened portfolio consistently estimates its target. Empirically, our method achieves superior Sharpe ratios relative to an unscreened baseline portfolio and to conventional screening approaches, evaluated on S&P 500 data over the period 2020--2024.
Privacy-Aware Smart Cameras: View Coverage via Socially Responsible Coordination
Coordination of view coverage via privacy-aware smart cameras is key to a more socially responsible urban intelligence. Rather than maximizing view coverage at any cost or over relying on expensive cryptographic techniques, we address how cameras can coordinate to legitimately monitor public spaces while excluding privacy-sensitive regions by design. This article proposes a decentralized framework in which interactive smart cameras coordinate to autonomously select their orientation via collective learning, while eliminating privacy violations via soft and hard constraint satisfaction. The approach scales to hundreds up to thousands of cameras without any centralized control. Experimental evidence shows 18.42% higher coverage efficiency and 85.53% lower privacy violation than baselines and other state-of-the-art approaches. This significant advance further unravels practical guidelines for operators and policymakers: how the field of view, spatial placement, and budget of cameras operating by ethically-aligned artificial intelligence jointly influence coverage efficiency and privacy protection in large-scale and sensitive urban environments.
comment: This work has been submitted to the IEEE for possible publication
Dual-Gated Epistemic Time-Dilation: Autonomous Compute Modulation in Asynchronous MARL
While Multi-Agent Reinforcement Learning (MARL) algorithms achieve unprecedented successes across complex continuous domains, their standard deployment strictly adheres to a synchronous operational paradigm. Under this paradigm, agents are universally forced to execute deep neural network inferences at every micro-frame, regardless of immediate necessity. This dense throughput acts as a fundamental barrier to physical deployment on edge-devices where thermal and metabolic budgets are highly constrained. We propose Epistemic Time-Dilation MAPPO (ETD-MAPPO), augmented with a Dual-Gated Epistemic Trigger. Instead of depending on rigid frame-skipping (macro-actions), agents autonomously modulate their execution frequency by interpreting aleatoric uncertainty (via Shannon entropy of their policy) and epistemic uncertainty (via state-value divergence in a Twin-Critic architecture). To format this, we structure the environment as a Semi-Markov Decision Process (SMDP) and build the SMDP-Aligned Asynchronous Gradient Masking Critic to ensure proper credit assignment. Empirical findings demonstrate massive improvements (> 60% relative baseline acquisition leaps) over current temporal models. By assessing LBF, MPE, and the 115-dimensional state space of Google Research Football (GRF), ETD correctly prevented premature policy collapse. Remarkably, this unconstrained approach leads to emergent Temporal Role Specialization, reducing computational overhead by a statistically dominant 73.6% entirely during off-ball execution without deteriorating centralized task dominance.
comment: 14 pages, 5 figures. Code available at: https://github.com/xaiqo/edtmappo. Related materials available on Zenodo: 10.5281/zenodo.19206838
Engagement-Zone-Aware Input-Constrained Guidance for Safe Target Interception in Contested Environments
We address target interception in contested environments in the presence of multiple defenders whose interception capability is limited by finite ranges. Conventional methods typically impose conservative stand-off constraints based on maximum engagement distance and neglect the interceptors' actuator limitations. Instead, we formulate safety constraints using defender-induced engagement zones. To account for actuator limits, the vehicle model is augmented with input saturation dynamics. A time-varying safe-set tightening parameter is introduced to compensate for transient constraint violations induced by actuator dynamics. To ensure scalable safety enforcement in multi-defender scenarios, a smooth aggregate safety function is constructed using a log-sum-exp operator combining individual threat measures associated with each defender's capability. A smooth switching guidance strategy is then developed to coordinate interception and safety objectives. The attacker pursues the target when sufficiently distant from threat boundaries and progressively activates evasive motion as the EZ boundaries are approached. The resulting controller relies only on relative measurements and does not require knowledge of defender control inputs, thus facilitating a fully distributed and scalable implementation. Rigorous analysis provides sufficient conditions guaranteeing target interception, practical safety with respect to all defender engagement zones, and satisfaction of actuator bounds. An input-constrained guidance law based on conservative stand-off distance is also developed to quantify the conservatism of maximum-range-based safety formulations. Simulations with stationary and maneuvering defenders demonstrate that the proposed formulation yields shorter interception paths and reduced interception time compared with conventional methods while maintaining safety throughout the engagement.
Small-Scale Testbeds for Connected and Automated Vehicles and Robot Swarms: Challenges and a Roadmap
This article proposes a roadmap to address the current challenges in small-scale testbeds for Connected and Automated Vehicles (CAVs) and robot swarms. The roadmap is a joint effort of participants in the workshop "1st Workshop on Small-Scale Testbeds for Connected and Automated Vehicles and Robot Swarms," held on June 2 at the IEEE Intelligent Vehicles Symposium (IV) 2024 in Jeju, South Korea. The roadmap contains three parts: 1) enhancing accessibility and diversity, especially for underrepresented communities, 2) sharing best practices for the development and maintenance of testbeds, and 3) connecting testbeds through an abstraction layer to support collaboration. The workshop features eight invited speakers, four contributed papers [1]-[4], and a presentation of a survey paper on testbeds [5]. The survey paper provides an online comparative table of more than 25 testbeds, available at https://bassamlab.github.io/testbeds-survey. The workshop's own website is available at https://cpm-remote.lrt.unibw-muenchen.de/iv24-workshop.
comment: Published version
Toward Data Systems That Are Business Semantic Centric and AI Agents Assisted
Contemporary businesses operate in dynamic environments requiring rapid adaptation to achieve goals and maintain competitiveness. Existing data platforms often fall short by emphasizing tools over alignment with business needs, resulting in inefficiencies and delays. To address this gap, I propose the Business Semantics Centric, AI Agents Assisted Data System (BSDS), a holistic system that integrates architecture, workflows, and team organization to ensure data systems are tailored to business priorities rather than dictated by technical constraints. BSDS redefines data systems as dynamic enablers of business success, transforming them from passive tools into active drivers of organizational growth. BSDS has a modular architecture that comprises curated data linked to business entities, a knowledge base for context-aware AI agents, and efficient data pipelines. AI agents play a pivotal role in assisting with data access and system management, reducing human effort, and improving scalability. Complementing this architecture, BSDS incorporates workflows optimized for both exploratory data analysis and production requirements, balancing speed of delivery with quality assurance. A key innovation of BSDS is its incorporation of the human factor. By aligning data team expertise with business semantics, BSDS bridges the gap between technical capabilities and business needs. Validated through real-world implementation, BSDS accelerates time-to-market for data-driven initiatives, enhances cross-functional collaboration, and provides a scalable blueprint for businesses of all sizes. Future research can build on BSDS to explore optimization strategies using complex systems and adaptive network theories, as well as developing autonomous data systems leveraging AI agents.
comment: Published by IEEE Access
Federated Learning for Data-Driven Feedforward Control: A Case Study on Vehicle Lateral Dynamics
In many control systems, tracking accuracy can be enhanced by combining (data-driven) feedforward (FF) control with feedback (FB) control. However, designing effective data-driven FF controllers typically requires large amounts of high-quality data and a dedicated design-of-experiment process. In practice, relevant data are often distributed across multiple systems, which not only introduces technical challenges but also raises regulatory and privacy concerns regarding data transfer. To address these challenges, we propose a framework that integrates Federated Learning (FL) into the data-driven FF control design. Each client trains a data-driven, neural FF controller using local data and provides only model updates to the global aggregation process, avoiding the exchange of raw data. We demonstrate our method through simulation for a vehicle trajectory-tracking task. Therein, a neural FF controller is learned collaboratively using FL. Our results show that the FL-based neural FF controller matches the performance of the centralized neural FF controller while reducing communication overhead and increasing data privacy.
comment: Accepted at ECC 2026
VLM-CAD: VLM-Optimized Collaborative Agent Design Workflow for Analog Circuit Sizing
Vision Language Models (VLMs) have demonstrated remarkable potential in multimodal reasoning, yet they inherently suffer from spatial blindness and logical hallucinations when interpreting densely structured engineering content, such as analog circuit schematics. To address these challenges, we propose a Vision Language Model-Optimized Collaborative Agent Design Workflow for Analog Circuit Sizing (VLM-CAD) designed for robust, step-by-step reasoning over multimodal evidence. VLM-CAD bridges the modality gap by integrating a neuro-symbolic structural parsing module, Image2Net, which transforms raw pixels into explicit topological graphs and structured JSON representations to anchor VLM interpretation in deterministic facts. To ensure the reliability required for engineering decisions, we further propose ExTuRBO, an Explainable Trust Region Bayesian Optimization method. ExTuRBO serves as an explainable grounding engine, employing agent-generated semantic seeds to warm-start local searches and utilizing Automatic Relevance Determination to provide quantified evidence for the VLM's decisions. Experimental results on two complex circuit benchmarks demonstrate that VLM-CAD significantly enhances spatial reasoning accuracy and maintains physics-based explainability. VLM-CAD consistently satisfies complex specification requirements while achieving low power consumption, with a total runtime under 66 minutes, marking a significant step toward robust, explainable multimodal reasoning in specialized technical domains.
comment: submitted to the 34th ACM International Conference on Multimedia (ACMMM 2026)
Evidence-Decision-Feedback: Theory-Driven Adaptive Scaffolding for LLM Agents
Multi-agent LLM architectures offer opportunities for pedagogical agents to help students construct domain knowledge and develop critical-thinking skills, yet many operate on a "one-size-fits-all" basis, limiting their ability to provide personalized support. To address this, we introduce Evidence-Decision-Feedback (EDF), a theoretical framework for adaptive scaffolding using LLMs. EDF integrates elements of intelligent tutoring systems and agentic behavior by organizing interactions around evidentiary inference, pedagogical decision-making, and adaptive feedback. We instantiate EDF through Copa, an agentic collaborative peer agent for STEM+C problem-solving. In an authentic high school classroom study, we show that EDF-guided interactions align feedback with students' demonstrated understanding and task mastery; promote gradual scaffold fading; and support interpretable, evidence-grounded explanations without fostering overreliance.
comment: Accepted as a long paper to the 27th International Conference on AI in Education (AIED26)
Towards Intelligent Geospatial Data Discovery: a knowledge graph-driven multi-agent framework powered by large language models
The rapid growth in the volume, variety, and velocity of geospatial data has created data ecosystems that are highly distributed, heterogeneous, and semantically inconsistent. Existing data catalogs, portals, and infrastructures still rely largely on keyword-based search with limited semantic support, which often fails to capture user intent and leads to weak retrieval performance. To address these challenges, this study proposes a knowledge graph-driven multi-agent framework for intelligent geospatial data discovery, powered by large language models. The framework introduces a unified geospatial metadata ontology as a semantic mediation layer to align heterogeneous metadata standards across platforms and constructs a geospatial metadata knowledge graph to explicitly model datasets and their multidimensional relationships. Building on the structured representation, the framework adopts a multi-agent collaborative architecture to perform intent parsing, knowledge graph retrieval, and answer synthesis, forming an interpretable and closed-loop discovery process from user queries to results. Results from representative use cases and performance evaluation show that the framework substantially improves intent matching accuracy, ranking quality, recall, and discovery transparency compared with traditional systems. This study advances geospatial data discovery toward a more semantic, intent-aware, and intelligent paradigm, providing a practical foundation for next-generation intelligent and autonomous spatial data infrastructures and contributing to the broader vision of Autonomous GIS.
Dynamic Adversarial Resource Allocation: the dDAB Game
This work introduces the dynamic Defender-Attacker Blotto (dDAB) game, extending the classical static Blotto game to a dynamic resource allocation setting over graphs. In the dDAB game, a defender is required to maintain numerical superiority against attacker resources across a set of key nodes in a connected graph. The engagement unfolds as a discrete-time game, where each player reallocates its resources in turn, with resources allowed to move at most one hop per time step. The primary goal is to determine the necessary and sufficient amount of defender resources required to guarantee sustained defense, along with the corresponding strategies. To address the central challenge arising from graph-constrained resource reallocation, we conduct a reachability analysis, starting with simplified settings where attacker resources act as a single cohesive group. We then extend the framework to allow attacker resources to split and merge arbitrarily, and construct defender strategies using superposition principles. A set-based dynamic programming algorithm is developed to compute the optimal strategies, as well as the minimum amount of defender resources to ensure successful defense. The effectiveness of our approach is demonstrated through numerical simulations and hardware experiments on the Georgia Tech Robotarium platform.
comment: The first two authors contributed equally as co-first authors
Agentic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment
The Brain Tumor Reporting and Data System (BT-RADS) standardizes post-treatment MRI response assessment in patients with diffuse gliomas but requires complex integration of imaging trends, medication effects, and radiation timing. This study evaluates an end-to-end multi-agent large language model (LLM) and convolutional neural network (CNN) system for automated BT-RADS classification. A multi-agent LLM system combined with automated CNN-based tumor segmentation was retrospectively evaluated on 509 consecutive post-treatment glioma MRI examinations from a single high-volume center. An extractor agent identified clinical variables (steroid status, bevacizumab status, radiation date) from unstructured clinical notes, while a scorer agent applied BT-RADS decision logic integrating extracted variables with volumetric measurements. Expert reference standard classifications were established by an independent board-certified neuroradiologist. Of 509 examinations, 492 met inclusion criteria. The system achieved 374/492 (76.0%; 95% CI, 72.1%-79.6%) accuracy versus 283/492 (57.5%; 95% CI, 53.1%-61.8%) for initial clinical assessments (+18.5 percentage points; P<.001). Context-dependent categories showed high sensitivity (BT-1b 100%, BT-1a 92.7%, BT-3a 87.5%), while threshold-dependent categories showed moderate sensitivity (BT-3c 74.8%, BT-2 69.2%, BT-4 69.3%, BT-3b 57.1%). For BT-4, positive predictive value was 92.9%. The multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4 detection.
comment: 17 pages, 5 figures, 4 tables, 2 supplementary figures, 3 supplementary tables
Systems and Control (EESS)
Feedback Control of a Recirculating Bioreactor with Electrophoretic Removal of Inhibitory Extracellular DNA
Extracellular DNA accumulation in recirculating bioprocesses inhibits microbial growth and reduces productivity. We consider a continuous bioreactor with a recirculating loop and an electrophoretic filtration unit for selective DNA removal, and develop a feedback control framework combining online state and parameter estimation via an Unscented Kalman Filter with two control strategies: an adaptive Model Predictive Controller that jointly optimizes dilution rate and filtration activation, and a simpler bang--bang filtration policy with lookup-table dilution rate selection. Closed-loop simulations under nominal and perturbed conditions show that the MPC strategy achieves significantly higher cumulative profit while keeping DNA concentration below the inhibition threshold.
Stable Inversion of Discrete-Time Linear Periodically Time-Varying Systems via Cyclic Reformulation
Stable inverse systems for periodically time-varying plants are essential for feedforward control and iterative learning control of multirate and periodic systems, yet existing approaches either require complex-valued Floquet factors and noncausal processing or operate on a block time scale via lifting. This paper proposes a systematic method for constructing stable inverse systems for discrete-time linear periodically time-varying (LPTV) systems that avoids these limitations. The proposed approach proceeds in three steps: (i) cyclic reformulation transforms the LPTV system into an equivalent LTI representation; (ii) the inverse of the resulting LTI system is constructed using standard LTI inversion theory; and (iii) the periodically time-varying inverse matrices are recovered from the block structure of the cycled inverse through parameter extraction. For the fundamental case of relative degree zero, where the output depends directly on the current input, the inverse system is obtained as an explicit closed-form time-varying matrix expression. For systems with periodic relative degree r >= 1, the r-step-delayed inverse is similarly obtained in explicit closed form via the periodic Markov parameters. The stability of the resulting inverse system is characterized by the transmission zeros of the cycled plant, generalizing the minimum phase condition from the LTI case. Numerical examples for both relative degree zero and higher relative degree systems confirm the validity of the stability conditions and demonstrate the effectiveness of the proposed framework, including exact input reconstruction via causal real-valued inverse systems.
comment: Submitted to Automatica
Optimal Control of Switched Systems Governed by Logical Switching Dynamics
This paper investigates the optimal co-design of logical and continuous controls for switched linear systems governed by controlled logical switching dynamics. Unlike traditional switched systems with arbitrary or state-dependent switching, the switching signals here are generated by an internal logical dynamical system and explicitly integrated into the control synthesis. By leveraging the semi-tensor product (STP) of matrices, we embed the coupled logical and continuous dynamics into a unified algebraic state-space representation, transforming the co-design problem into a tractable linear-quadratic framework. We derive Riccati-type backward recursions for both deterministic and stochastic logical dynamics, which yield optimal state-feedback laws for continuous control alongside value-function-based, state-dependent decision rules for logical switching. To mitigate the combinatorial explosion inherent in logical decision-making, a hierarchical algorithm is developed to decouple offline precomputation from efficient online execution. Numerical simulations demonstrate the efficacy of the proposed framework.
comment: 26 pages, 3 figures
Power System Studies Using Open-Access Software
The use of open-access software is an option that can be considered by those interested in power system studies. In addition, the combination of two or more of these tools can expand the capabilities and the fields of application of each tool. This paper proposes the implementation of a flexible and powerful simulation environment based on R/Rstudio for carrying out power system studies. Several simple case studies are presented aimed at showing how the combination of either EMTP/ATP or OpenDSS with R/RStudio can expand the capabilities of each of these tools for performing either steady-state or transient power system studies. Basically, the proposed environment uses RStudio as control center from which each simulation tool (e.g., R, ATP, OpenDSS) can be run. Some procedures for generating information that must be exchanged between RStudio and ATP or RStudio and OpenDSS have been implemented. Such exchanges are bidirectional: ATP and OpenDSS produce simulation results that can be read by RStudio (text files in the case of ATP, comma separated value (CSV) and text files in the case of OpenDSS), while RStudio capabilities are used to generate files that are embedded into the input file to be read by either ATP or OpenDSS. This late option can be used to change either the configuration or some parameters of the test system under study. Finally, one very interesting option illustrated in this paper is the possibility of using machine learning algorithms to predict the performance of the test system.
comment: 55 pages, 57 figures
Rao-Blackwellized Stein Gradient Descent for Joint State-Parameter Estimation
We present a filtering framework for online joint state estimation and parameter identification in nonlinear, time-varying systems. The algorithm uses Rao-Blackwellization technique to infer joint state-parameter posteriors efficiently. In particular, conditional state distributions are computed analytically via Kalman filtering, while model parameters including process and measurement noise covariances are approximated using particle-based Stein Variational Gradient Descent (SVGD), enabling stable real-time inference. We prove a theoretical consistency result by bounding the impact of the SVGD approximated parameter posterior on state estimates, relating the divergence between the true and approximate parameter posteriors to the total variation distance between the resulting state marginals. Performance of the proposed filter is validated on two case studies: a bioreactor with Haldane kinetics and a neural-network-augmented dynamic system. The latter demonstrates the filter's capacity for online neural network training within a dynamical model, showcasing its potential for fully adaptive, data-driven system identification.
comment: 11 pages, 5 figures. Preprint submitted to IEEE Transactions on Automatic Control
JanusBM: A Dual-Fidelity Multi-Zone White-Box Building Modeling Framework
Accurate building energy models are crucial for analyzing sector-coupled energy systems, where buildings interact with electrified heating, energy storage, and advanced control across various scenarios. High-fidelity (HiFi) white-box models that resolve hydronic distribution and emitter dynamics can capture short-term transients, yet their numerical stiffness and computational burden limit long-term simulations and large-scale scenario exploration. Conversely, reduced-order low-fidelity (LoFi) representations enable rapid annual assessments but may fail to capture the hydronic- and control-induced dynamics that govern transient and peak behavior. This paper proposes a dual-fidelity, multi-zone white-box building modeling framework, which is called JanusBM, built on a novel topology-driven modeling tool RoomFlex6D, coupling a HiFi hydronic model and a LoFi ideal-load surrogate that removes explicit hydronic states in Modelica. To ensure applicability and physical consistency across time scales, we introduce a two-stage hybrid validation and calibration pipeline that uses complementary data: the IEA EBC Annex 60 benchmark for energy-scale validation and time-series measurements from real-world experimental buildings for hydronic dynamics-scale calibration. Results show that the generated LoFi models achieve a high degree of consistency with Annex 60 benchmark on the energy scale, and the proposed calibration workflow robustly improves loop-level return water temperature transients and zone-level temperature dynamics. Moreover, the LoFi model achieves orders-of-magnitude faster simulations suited to annual energy analyses, whereas the HiFi model becomes necessary when the required heat differs from the actual delivered heat due to distribution and control limitations, especially in transient and peak-oriented assessments.
Design Guidelines for Nonlinear Kalman Filters via Covariance Compensation
Nonlinear extensions of the Kalman filter (KF), such as the extended Kalman filter (EKF) and the unscented Kalman filter (UKF), are indispensable for state estimation in complex dynamical systems, yet the conditions for a nonlinear KF to provide robust and accurate estimations remain poorly understood. This work proposes a theoretical framework that identifies the causes of failure and success in certain nonlinear KFs and establishes guidelines for their improvement. Central to our framework is the concept of covariance compensation: the deviation between the covariance predicted by a nonlinear KF and that of the EKF. With this definition and detailed theoretical analysis, we derive three design guidelines for nonlinear KFs: (i) invariance under orthogonal transformations, (ii) sufficient covariance compensation beyond the EKF baseline, and (iii) selection of compensation magnitude that favors underconfidence. Both theoretical analysis and empirical validation confirm that adherence to these principles significantly improves estimation accuracy, whereas fixed parameter choices commonly adopted in the literature are often suboptimal. The codes and the proofs for all the theorems in this paper are available at https://github.com/Shida-Jiang/Guidelines-for-Nonlinear-Kalman-Filters.
comment: This manuscript has been accepted by ACC 2026
Experimental Characterisation of Distributed Reactive Power Sharing under Communication-Induced Stress in Parallel Grid-Forming Inverters
Synchronisation of parallel grid-forming inverters is crucial for stable operation of future power systems. This includes accurate and robust reactive power sharing under realistic operating conditions such as impedance mismatch and communication constraints. In this work, reactive power sharing by virtue of a distributed control law is investigated under line impedance mismatch. Furthermore, robustness and transient behaviour of the proposed approach are experimentally evaluated under communication-induced stressors including a fixed 3% packet loss and communication delays ranging from 50 ms to 100 ms, artificially introduced through a software-defined overlay. The study is conducted in a low-voltage laboratory-scale microgrid comprising two parallel grid-forming inverters, an AC load, and a grid-following battery system acting as a reactive power injector. The results show reactive power sharing convergence up to 90 ms communication delay, with a stability boundary between 90 ms and 100 ms, which decreases with increasing integral gain.
Positive Observers Revisited
The paper shows that positive linear systems can be stabilized using positive Luenberger-type observers, contradicting previous conclusions. This is achieved by structuring the observer as monotonically converging upper and lower bounds on the state. Analysis of the closed-loop properties under linear observer feedback gives conditions that cover a larger class than previous observer designs. The results are applied to nonpositive systems by enforcing positivity of the dynamics using feedback from the upper bound observer. The setting is expanded to include stochastic noise, giving conditions for convergence in expectation using feedback from positive observers.
comment: Accepted for publication at the 2026 European Control Conference
Cooperative Bandit Learning in Directed Networks with Arm-Access Constraints
Sequential decision-making under uncertainty often involves multiple agents learning which actions (arms) yield the highest rewards through repeated interaction with a stochastic environment. This setting is commonly modeled by cooperative multi-agent multi-armed bandit problems, where agents explore and share information without centralized coordination. In many realistic systems, agents have heterogeneous capabilities that limit their access to subsets of arms and communicate over asymmetric networks represented by directed graphs. In this work, we study multi-agent multi-armed bandit problems with partial arm access, where agents explore and exploit only the arms available to them while exchanging information with neighbors. We propose a distributed consensus-based upper confidence bound (UCB) algorithm that accounts for both the arm accessibility structure and network asymmetry. Our approach employs a mass-preserving information mixing mechanism, ensuring that reward estimates remain unbiased across the network despite accessibility constraints and asymmetric information flow. Under standard stochastic assumptions, we establish logarithmic regret for every agent, with explicit dependence on network mixing properties and arm accessibility constraints. These results quantify how heterogeneous arm access and directed communication shape cooperative learning performance.
Secure Two-Party Matrix Multiplication from Lattices and Its Application to Encrypted Control
In this study, we propose a two-party computation protocol for approximate matrix multiplication of fixed-point numbers. The proposed protocol is provably secure under standard lattice-based cryptographic assumptions and enables matrix multiplication at a desired approximation level within a single round of communication. We demonstrate the feasibility of the protocol by applying it to the secure implementation of a linear control law. Our evaluation reveals that the client achieves lower online computational complexity compared to the original controller computation, while ensuring the privacy of controller inputs, outputs, and parameters. Furthermore, a numerical example confirms that the proposed method maintains sufficient precision of control inputs even in the presence of approximation and quantization errors.
comment: 6 pages, 3 figures
Equivalence of Finite- and Fixed-time Stability to Asymptotic Stability
In this paper, we present new results on finite- and fixed-time convergence for dynamical systems using LaSalle-like invariance principles. In particular, we provide first and second-order non-smooth Lyapunov-like results for finite- and fixed-time convergence, thereby relaxing the requirement of existence a differentiable, positive definite Lyapunov function. Based on these findings, we show that a dynamical system whose equilibrium point is globally asymptotically stable can be modified through scaling so that the resulting dynamical system has a fixed-time stable equilibrium point. The results in this paper expand our understanding of various convergence rates and strengthen the hypothesis that all the convergence rates are interconnected through a suitable transformation.
comment: Currently under review at an IEEE Conference
Distributed Hybrid Feedback for Global Pose Synchronization of Multiple Rigid Body Systems on $SE(3)$
This paper investigates the problem of pose synchronization for multiple rigid body systems evolving on the matrix Lie group $\SE(3)$. We propose a distributed hybrid feedback control scheme with global asymptotic stability guarantees using relative pose and group velocity measurements. The key idea consists of constructing a new potential function on $\SE(3) \times \mathbb{R}$ with a generalized non-diagonal weighting matrix, and a set of auxiliary scalar variables with continuous-discrete hybrid dynamics. Based on the new potential function and the auxiliary scalar variables, a geometric distributed hybrid feedback designed directly on $\SE(3)$ is proposed to achieve global pose synchronization. Numerical simulation results are presented to illustrate the performance of the proposed distributed hybrid control scheme.
comment: 8 pages, 2 figures
Fleet-Level Battery-Health-Aware Scheduling for Autonomous Mobile Robots
Autonomous mobile robot fleets must coordinate task allocation and charging under limited shared resources, yet most battery aware planning methods address only a single robot. This paper extends degradation cost aware task planning to a multi robot setting by jointly optimizing task assignment, service sequencing, optional charging decisions, charging mode selection, and charger access while balancing degradation across the fleet. The formulation relies on reduced form degradation proxies grounded in the empirical battery aging literature, capturing both charging mode dependent wear and idle state of charge dependent aging; the bilinear idle aging term is linearized through a disaggregated piecewise McCormick formulation. Tight big M values derived from instance data strengthen the LP relaxation. To manage scalability, we propose a hierarchical matheuristic in which a fleet level master problem coordinates assignments, routes, and charger usage, while robot level subproblems whose integer part decomposes into trivially small independent partition selection problems compute route conditioned degradation schedules. Systematic experiments compare the proposed method against three baselines: a rule based nearest available dispatcher, an energy aware formulation that enforces battery feasibility without modeling degradation, and a charger unaware formulation that accounts for degradation but ignores shared charger capacity limits.
Optimal filtering for a giant cavity in waveguide QED systems
In waveguide quantum electrodynamics (QED) systems, a giant cavity can be engineered to interact with quantum fields by multiple distant coupling points so that its non-Markovian dynamics are quite different from traditional quantum optical cavity systems. Towards feedback control this system, this paper designs an optimal filter for the giant cavity systems to estimate its state evolution under continuous quantum measurements. Firstly, the Langevin equation in the Heisenberg picture are derived, which is a linear continuous-time system with both states and inputs delays resulting from the unconventional distant couplings. Compared to existing modeling approaches, this formulation effectively preserves the nonlocal coupling and multiple delay dynamic characteristics inherent in the original system. In particular, the presence of coupling and propagation delays leads to noncommutativity among the system operators at different times, which prevents the direct application of existing quantum filtering methods. To address this issue, an optimal filter is designed, in which the delayed-state covariance matrices are computed. By iteratively evaluating the delayed-state covariance over successive time intervals, the resulting optimal filter can be implemented in an interval-wise backward recursion algorithm. Finally, numerical simulations are conducted to evaluate the tracking performance of the proposed optimal filter for the giant cavity. By comparing between the evolutions of Wigner functions of coherent and cat states and the filter, the effectiveness of the optimal filter is validated.
comment: 11 pages, 4 figures
Universal Formula Families for Safe Stabilization of Single-Input Nonlinear Systems
We develop an optimization-free framework for safe stabilization of single-input control-affine nonlinear systems with a given control Lyapunov function (CLF) and a given control barrier function (CBF), where the desired equilibrium lies in the interior of the safe set. An explicit compatibility condition is derived that is necessary and sufficient for the pointwise simultaneous satisfaction of the CLF and CBF inequalities. When this condition holds, two closed-form continuous state-feedback laws are constructed from the Lie-derivative data of the CLF and CBF via standard universal stabilizer formulas, yielding asymptotic stabilization of the origin and forward invariance of the interior of the safe set, without online quadratic programming. The two laws belong to broader families parametrized by a free nondecreasing function, providing additional design flexibility. When the compatibility condition fails, a safety-prioritizing modification preserves forward invariance and drives the state toward the safe-set boundary until a compatible region is reached, whereupon continuity at the origin and asymptotic stabilization are recovered. The framework produces families of explicit constructive alternatives to CLF-CBF quadratic programming for scalar-input nonlinear systems.
Explicit Model Predictive Control with Quantum Encryption
This paper studies quantum-encrypted explicit MPC for constrained discrete-time linear systems in a cloud-based architecture. A finite-horizon quadratic MPC problem is solved offline to obtain a piecewise-affine controller. Shared quantum keys generated from Bell pairs and protected by quantum key distribution are used to encrypt the online control evaluation between the sensor and actuator. Based on this architecture, we develop a lightweight encrypted explicit MPC protocol, prove exact recovery of the plaintext control action, and characterize its computational efficiency. Numerical results demonstrate lower online complexity than classical encrypted MPC, while security is discussed in terms of confidentiality of plant data and control inputs.
Index-Based Scheduling for a Resource-Constrained Quantum Switch
We consider a quantum switch with a finite number of quantum memory registers that aims to serve multipartite entanglement requests among $N$ users. We propose scheduling policies that aim to optimize the average number of requests served per unit time by efficiently utilizing the switch's available memory. To measure the performance of the scheduling policies, we employ the newly introduced metric of age of entanglement establishment (AoEE). We formulate the scheduling problem in a restless multi-armed bandit (RMAB) framework. We show that the scheduling of entanglement requests is indexable. Subsequently, we find a closed-form expression of the Whittle index for all possible request-age pairs. By modeling the Whittle index of each request as its reward and its cardinality as its cost, we formulate the memory-constrained scheduling problem as a $0$-$1$ knapsack problem and solve it via dynamic programming. Furthermore, we consider two low-complexity sequential greedy policies that leverage two different modified Whittle indices.
Bridging the numerical-physical gap in acoustic holography via end-to-end differentiable structural optimization
Acoustic holography provides a practical means of flexibly controlling acoustic wavefronts. However, high-fidelity shaping of acoustic fields remains constrained by the numerical-physical gap inherent in conventional phase-only designs. These approaches realize a two-dimensional phase-delay profile as a three-dimensional thickness-varying lens, while neglecting wave-matter interactions arising from the lens structure. Here, we introduce an end-to-end, physics-aware differentiable structural optimization framework that directly incorporates three-dimensional lens geometries into the acoustic simulation and optimization loop. Using a novel differentiable relaxation, termed Differentiable Hologram Lens Approximation (DHLA), the lens geometry is treated as a differentiable design variable, ensuring intrinsic consistency between numerical design and physical realization. The resulting Thickness-Only Acoustic Holograms (TOAHs) significantly outperform state-of-the-art phase-only acoustic holograms (POAHs) in field reconstruction fidelity and precision under complex conditions. We further demonstrate the application of the framework to spatially selective neuromodulation in a neuropathic pain mouse model, highlighting its potential for non-invasive transcranial neuromodulation. In summary, by reconciling numerical design with physical realization, this work establishes a robust strategy for high-fidelity acoustic wavefront shaping in complex environments.
Statistical Efficiency of Single- and Multi-step Models for Forecasting and Control
Compounding error, where small prediction mistakes accumulate over time, presents a major challenge in learning-based control. A common remedy is to train multi-step predictors directly instead of rolling out single-step models. However, it is unclear when the benefits of multi-step predictors outweigh the difficulty of learning a more complex model. We provide the first quantitative analysis of this trade-off for linear dynamical systems. We study three predictor classes: (i) single step models, (ii) multi-step models, and (iii) single step models trained with multi-step losses. We show that when the model class is well-specified and accurately captures the system dynamics, single-step models achieve the lowest asymptotic prediction error. On the other hand, when the model class is misspecified due to partial observability, direct multi-step predictors can significantly reduce bias and improve accuracy. We provide theoretical and empirical evidence that these trade-offs persist when predictors are used in closed-loop control.
comment: arXiv admin note: substantial text overlap with arXiv:2504.01766
Information-Driven Active Perception for k-step Predictive Safety Monitoring
This work studies the synthesis of active perception policies for predictive safety monitoring in partially observable stochastic systems. Operating under strict sensing and communication budgets, the proposed monitor dynamically schedules sensor queries to maximize information gain about the safety of future states. The underlying stochastic dynamics are captured by a labeled hidden Markov model (HMM), with safety requirements defined by a deterministic finite automaton (DFA). To enable active information acquisition, we introduce minimizing k-step Shannon conditional entropy of the safety of future states as a planning objective, under the constraint of a limited sensor query budget. Using observable operators, we derive an efficient algorithm to compute the k-step conditional entropy and analyze key properties of the conditional entropy gradient with respect to policy parameters. We validate the effectiveness of the method for predictive safety monitoring through a dynamic congestion game example.
comment: 6 pages, 6 figures, 1 table, submitted to IEEE L-CSS
Self-Supervised Graph Neural Networks for Optimal Substation Reconfiguration
Changing the transmission system topology is an efficient and costless lever to reduce congestion or increase exchange capacities. The problem of finding the optimal switch states within substations is called Optimal Substation Reconfiguration (OSR), and may be framed as a Mixed Integer Linear Program (MILP). Current state-of-the-art optimization techniques come with prohibitive computing times, making them impractical for real-time decision-making. Meanwhile, deep learning offers a promising perspective with drastically smaller computing times, at the price of an expensive training phase and the absence of optimality guarantees. In this work, we frame OSR as an Amortized Optimization problem, where a Graph Neural Network (GNN) model -- our data being graphs -- is trained in a self-supervised way to improve the objective function. We apply our approach to the maximization of the exchange capacity between two areas of a small-scale 12-substations system. Once trained, our GNN model improves the exchange capacity by 10.2% on average compared to the all connected configuration, while a classical MILP solver reaches an average improvement of 15.2% with orders-of-magnitude larger computing times.
WAKE-NET: 3D-Wake-Aware Turbine Layout and Cabling Optimization Framework of Multi-Hub-Height Wind Farms for Grid-Scale and Industrial Power Systems
The global transition towards renewable energy has accelerated the deployment of utility-scale wind farms, increasing the need for accurate performance and economic assessments. Although wind energy offers substantial potential for carbon emission reduction, investment decisions are highly sensitive to predicted annual energy production and economic profitability. Conventionally wind farm analyses often estimate turbine power output based solely on incoming wind conditions, neglecting wake interactions between turbines. These wake effects can significantly reduce downstream turbine performance, leading to overestimation of energy yield and financial returns. This study proposes WAKE-NET a wake-aware optimization framework that incorporates both turbine layout optimization and hub height diversification across turbines of varying capacities. Unlike traditional approaches that assume a uniform hub height or ignore wake dynamics, the proposed methodology accounts for wake-induced power losses in its framework. Results indicate that the benchmark model that neglects wake effects can overestimate annual profits, while the use of multiple hub heights reduces wake overlap and associated power losses. Overall, the findings demonstrate that wake-aware design and hub height diversity improve energy yield accuracy and economic viability, offering a valuable guidance for wind farm developers and investors seeking to invest in renewable energy systems.
Robust and Interpretable Graph Neural Networks for Power Systems State Estimation
This study analyzes Graph Neural Networks (GNNs) for distribution system state estimation (DSSE) by employing an interpretable Graph Neural Additive Network (GNAN) and by utilizing an edge-conditioned message-passing mechanism. The architectures are benchmarked against the standard Graph Attention Network (GAT) architecture. Multiple SimBench grids with topology changes and various measurement penetration rates were used to evaluate performance. Empirically, GNAN trails GAT in accuracy but serves as a useful probe for graph learning when accompanied with the proposed edge attention mechanism. Together, they demonstrate that incorporating information from distant nodes could improve learning depending on the grid topology and available data. This study advances the state-of-the-art understanding of learning on graphs for the state estimation task and contributes toward reliable GNN-based DSSE prediction technologies.
Time-Delay Systems with Discrete and Distributed delays: Discontinuous Initial Conditions and Reachability Sets
Time-invariant finite-dimensional systems, under reasonable continuity assumptions, exhibit the property that if solutions exist for all future times, the set of vectors reachable from a bounded set of initial conditions over bounded time intervals is also bounded. This property can be summarized as follows: forward completeness implies bounded reachability sets. By contrast, this property does not necessarily hold for infinite-dimensional systems in general, and time-delay systems in particular. Sufficient conditions for this property to hold that can be directly tested on the function defining the system dynamics are only known in the case of systems with pointwise (or discrete) delays. This paper develops novel sufficient conditions for the boundedness of the reachability sets of time-delay systems involving mixed pointwise and distributed delays. Broad classes of systems satisfying these conditions are identified.
comment: Submitted to IEEE Transactions on Automatic Control
Underdetermined Library-aided Impedance Estimation with Terminal Smart Meter Data
Smart meters provide relevant information for impedance identification, but they lack global phase alignment and internal network nodes are often unobserved. A few methods for this setting were developed, but they have requirements on data correlation and/or network topology. In this paper, we offer a unifying view of data- and structure-driven identifiability issues, and use this groundwork to propose a method for underdetermined impedance identification. The method can handle intrinsically ambiguous topologies and data; its output is not forcedly a single estimate, but instead a collection of data-compatible impedance assignments. It uses a library of plausible commercial cable types as a prior to refine the solutions, and we show how it can support topology identification workflows built around known georeferenced joints without degree guarantees. The method depends on a small number of non-sensitive parameters and achieves high identification performance on a sizeable benchmark case even with low-size injection/voltage datasets. We identify key steps that can be accelerated via GPU-based parallelization. Finally, we assess the tolerance of the identification to noisy input.
Scalable Impedance Identification of Diverse IBRs via Cluster-Specialized Neural Networks
Modern machine learning approaches typically identify the impedance of a single inverter-based resource (IBR) and assume similar impedance characteristics across devices. In modern power systems, however, IBRs will employ diverse control topologies and algorithms, leading to highly heterogeneous impedance behaviors. Training one model per IBR is inefficient and does not scale. This paper proposes a scalable impedance identification framework for diverse IBRs via cluster-specialized neural networks. First, the dataset is partitioned into multiple clusters with similar feature profiles using the K-means clustering method. Then, each cluster is assigned a specialized feed-forward neural network (FNN) tailored to its characteristics, improving both accuracy and computational efficiency. In deployment, only a small number of measurements are required to predict impedance over a wide range of operating points. The framework is validated on six IBRs with varying control bandwidths, control structures, and operating conditions, and further tested on a previously unseen IBR using only ten measurement points. The results demonstrate high accuracy in both the clustering and prediction stages, confirming the effectiveness and scalability of the proposed method.
comment: This paper is accepted for presenting at IEEE PES General Meeting (PESGM) 2026. All the resources can be found here: https://github.com/ManhqhUMich12/Scalable-Impedance-Identification-of-Diverse-IBRs-via-Cluster-Specialized-Neural-Networ
Privacy-Aware Smart Cameras: View Coverage via Socially Responsible Coordination
Coordination of view coverage via privacy-aware smart cameras is key to a more socially responsible urban intelligence. Rather than maximizing view coverage at any cost or over relying on expensive cryptographic techniques, we address how cameras can coordinate to legitimately monitor public spaces while excluding privacy-sensitive regions by design. This article proposes a decentralized framework in which interactive smart cameras coordinate to autonomously select their orientation via collective learning, while eliminating privacy violations via soft and hard constraint satisfaction. The approach scales to hundreds up to thousands of cameras without any centralized control. Experimental evidence shows 18.42% higher coverage efficiency and 85.53% lower privacy violation than baselines and other state-of-the-art approaches. This significant advance further unravels practical guidelines for operators and policymakers: how the field of view, spatial placement, and budget of cameras operating by ethically-aligned artificial intelligence jointly influence coverage efficiency and privacy protection in large-scale and sensitive urban environments.
comment: This work has been submitted to the IEEE for possible publication
Path Planning and Reinforcement Learning-Driven Control of On-Orbit Free-Flying Multi-Arm Robots
This paper presents a hybrid approach that integrates trajectory optimization (TO) and reinforcement learning (RL) for motion planning and control of free-flying multi-arm robots in on-orbit servicing scenarios. The proposed system integrates TO for generating feasible, efficient paths while accounting for dynamic and kinematic constraints, and RL for adaptive trajectory tracking under uncertainties. The multi-arm robot design, equipped with thrusters for precise body control, enables redundancy and stability in complex space operations. TO optimizes arm motions and thruster forces, reducing reliance on the arms for stabilization and enhancing maneuverability. RL further refines this by leveraging model-free control to adapt to dynamic interactions and disturbances. The experimental results validated through comprehensive simulations demonstrate the effectiveness and robustness of the proposed hybrid approach. Two case studies are explored: surface motion with initial contact and a free-floating scenario requiring surface approximation. In both cases, the hybrid method outperforms traditional strategies. In particular, the thrusters notably enhance motion smoothness, safety, and operational efficiency. The RL policy effectively tracks TO-generated trajectories, handling high-dimensional action spaces and dynamic mismatches. This integration of TO and RL combines the strengths of precise, task-specific planning with robust adaptability, ensuring high performance in the uncertain and dynamic conditions characteristic of space environments. By addressing challenges such as motion coupling, environmental disturbances, and dynamic control requirements, this framework establishes a strong foundation for advancing the autonomy and effectiveness of space robotic systems.
comment: Accepted for publication in The International Journal of Robotics Research (23-Mar-2026)
Human-in-the-Loop Pareto Optimization: Trade-off Characterization for Assist-as-Needed Training and Performance Evaluation
During human motor skill training and physical rehabilitation, there is an inherent trade-off between task difficulty and user performance. Characterizing this trade-off is crucial for evaluating user performance, designing assist-as-needed (AAN) protocols, and assessing the efficacy of training protocols. In this study, we propose a novel human-in-the-loop (HiL) Pareto optimization approach to characterize the trade-off between task performance and the perceived challenge level of motor learning or rehabilitation tasks. We adapt Bayesian multi-criteria optimization to systematically and efficiently perform HiL Pareto characterizations. Our HiL optimization employs a hybrid model that measures performance with a quantitative metric, while the perceived challenge level is captured with a qualitative metric. We demonstrate the feasibility of the proposed HiL Pareto characterization through a user study. Furthermore, we present the utility of the framework through three use cases in the context of a manual skill training task with haptic feedback. First, we demonstrate how the characterized trade-off can be used to design a sample AAN training protocol for a motor learning task and to evaluate the group-level efficacy of the proposed AAN protocol relative to a baseline adaptive assistance protocol. Second, we demonstrate that individual-level comparisons of the trade-offs characterized before and after the training session enable fair evaluation of training progress under different assistance levels. This evaluation method is more general than standard performance evaluations, as it can provide insights even when users cannot perform the task without assistance. Third, we show that the characterized trade-offs also enable fair performance comparisons among different users, as they capture the best possible performance of each user under all feasible assistance levels.
comment: Under review for publication in IEEE Transactions on Haptics
Data-driven online control for real-time optimal economic dispatch and temperature regulation in district heating systems
District heating systems (DHSs) require coordinated economic dispatch and temperature regulation under uncertain operating conditions. Existing DHS operation strategies often rely on disturbance forecasts and nominal models, so their economic and thermal performance may degrade when predictive information or model knowledge is inaccurate. This paper develops a data-driven online control framework for DHS operation by embedding steady-state economic optimality conditions into the temperature dynamics, so that the closed-loop system converges to the economically optimal operating point without relying on disturbance forecasts. Based on this formulation, we develop a Data-Enabled Policy Optimization (DeePO)-based online learning controller and incorporate Adaptive Moment Estimation (ADAM) to improve closed-loop performance. We further establish convergence and performance guarantees for the resulting closed-loop system. Simulations on an industrial-park DHS in Northern China show that the proposed method achieves stable near-optimal operation and strong empirical robustness to both static and time-varying model mismatch under practical disturbance conditions.
Engagement-Zone-Aware Input-Constrained Guidance for Safe Target Interception in Contested Environments
We address target interception in contested environments in the presence of multiple defenders whose interception capability is limited by finite ranges. Conventional methods typically impose conservative stand-off constraints based on maximum engagement distance and neglect the interceptors' actuator limitations. Instead, we formulate safety constraints using defender-induced engagement zones. To account for actuator limits, the vehicle model is augmented with input saturation dynamics. A time-varying safe-set tightening parameter is introduced to compensate for transient constraint violations induced by actuator dynamics. To ensure scalable safety enforcement in multi-defender scenarios, a smooth aggregate safety function is constructed using a log-sum-exp operator combining individual threat measures associated with each defender's capability. A smooth switching guidance strategy is then developed to coordinate interception and safety objectives. The attacker pursues the target when sufficiently distant from threat boundaries and progressively activates evasive motion as the EZ boundaries are approached. The resulting controller relies only on relative measurements and does not require knowledge of defender control inputs, thus facilitating a fully distributed and scalable implementation. Rigorous analysis provides sufficient conditions guaranteeing target interception, practical safety with respect to all defender engagement zones, and satisfaction of actuator bounds. An input-constrained guidance law based on conservative stand-off distance is also developed to quantify the conservatism of maximum-range-based safety formulations. Simulations with stationary and maneuvering defenders demonstrate that the proposed formulation yields shorter interception paths and reduced interception time compared with conventional methods while maintaining safety throughout the engagement.
Utilizing Adversarial Training for Robust Voltage Control: An Adaptive Deep Reinforcement Learning Method
Adversarial training is a defense method that trains machine learning models on intentionally perturbed attack inputs, so they learn to be robust against adversarial examples. This paper develops a robust voltage control framework for distribution networks with high penetration of distributed energy resources (DERs). Conventional voltage control methods are vulnerable to strategic cyber attacks, as they typically consider only random or black-box perturbations. To address this, we formulate white-box adversarial attacks using Projected Gradient Descent (PGD) and train a deep reinforcement learning (DRL) agent adversarially. The resulting policy adapts in real time to high-impact, strategically optimized perturbations. Simulations on DER-rich networks show that the approach maintains voltage stability and operational efficiency under realistic attack scenarios, highlighting the effectiveness of gradient-based adversarial DRL in enhancing robustness and adaptability in modern distribution system control.
comment: 6 pages, Texpas Power and Energy Conference 2026
RIS-aided Wireless Communication with Movable Elements Geometry Impact on Performance
Reconfigurable Intelligent Surfaces (RIS) are known as a promising technology to improve the performance of wireless communication networks, and have been extensively studied. Movable Antennas (MA) are a novel technology that fully exploits the antenna placement for enhancing the system performance. This article aims at evaluating the impact of transmit power and number of antenna elements on the outage probability performance of an MA-enabled RIS structure (MA-RIS), compared to existing Fixed-Position Antenna RIS (FPA-RIS). The change in geometry caused by the movement of antennas and its implications for the effective number of illuminated elements, are studied for 1D and 2D array structures. Our numerical results confirm the performance advantage provided by MA-RIS, achieving 24\% improvement in outage probability, and 2 dB gain in Signal-to-Noise Ratio (SNR), as compared to FPA-RIS.
comment: 5 pages, 4 figures
Artificial intelligence for partial differential equations in computational mechanics: A review
In recent years, Artificial intelligence (AI) has become ubiquitous, empowering various fields, especially integrating artificial intelligence and traditional science (AI for Science: Artificial intelligence for science), which has attracted widespread attention. In AI for Science, using artificial intelligence algorithms to solve partial differential equations (AI for PDEs: Artificial intelligence for partial differential equations) has become a focal point in computational mechanics. The core of AI for PDEs is the fusion of data and partial differential equations (PDEs), which can solve almost any PDEs. Therefore, this article provides a comprehensive review of the research on AI for PDEs, summarizing the existing algorithms and theories. The article discusses the applications of AI for PDEs in computational mechanics, including solid mechanics, fluid mechanics, and biomechanics. The existing AI for PDEs algorithms include those based on Physics-Informed Neural Networks (PINNs), Deep Energy Methods (DEM), Operator Learning, and Physics-Informed Neural Operator (PINO). AI for PDEs represents a new method of scientific simulation that provides approximate solutions to specific problems using large amounts of data, then fine-tuning according to specific physical laws, avoiding the need to compute from scratch like traditional algorithms. Thus, AI for PDEs is the prototype for future foundation models in computational mechanics, capable of significantly accelerating traditional numerical algorithms.
Defining causal mechanism in dual process theory and two types of feedback control
Mental events are considered to supervene on physical events. A supervenient event does not change without a corresponding change in the underlying subvenient physical events. Since wholes and their parts exhibit the same supervenience-subvenience relations, inter-level causation has been expected to serve as a model for mental causation. We proposed an inter-level causation mechanism to construct a model of consciousness and an agent's self-determination. However, a significant gap exists between this mechanism and cognitive functions. Here, we demonstrate how to integrate the inter-level causation mechanism with the widely known dual-process theories. We assume that the supervenience level is composed of multiple supervenient functions (i.e., neural networks), and we argue that inter-level causation can be achieved by controlling the feedback error defined through changing algebraic expressions combining these functions. Using inter-level causation allows for a dual laws model in which each level possesses its own distinct dynamics. In this framework, the feedback error is determined independently by two processes: (1) the selection of equations combining supervenient functions, and (2) the negative feedback error reduction to satisfy the equations through adjustments of neurons and synapses. We interpret these two independent feedback controls as Type 1 and Type 2 processes in the dual process theories. As a result, theories of consciousness, agency, and dual process theory are unified into a single framework, and the characteristic features of Type 1 and Type 2 processes are naturally derived.
A Tutorial on Learning-Based Radio Map Construction: Data, Paradigms, and Physics-Awarenes
The integration of artificial intelligence into next-generation wireless networks necessitates the accurate construction of radio maps (RMs) as a foundational prerequisite for electromagnetic digital twins. A RM provides the digital representation of the wireless propagation environment, mapping complex geographical and topological boundary conditions to critical spatial-spectral metrics that range from received signal strength to full channel state information matrices. This tutorial presents a comprehensive survey of learning-based RM construction, systematically addressing three intertwined dimensions: data, paradigms, and physics-awareness. From the data perspective, we review physical measurement campaigns, ray tracing simulation engines, and publicly available benchmark datasets, identifying their respective strengths and fundamental limitations. From the paradigm perspective, we establish a core taxonomy that categorizes RM construction into source-aware forward prediction and source-agnostic inverse reconstruction, and examine five principal neural architecture families spanning convolutional neural networks, vision transformers, graph neural networks, generative adversarial networks, and diffusion models. We further survey optics-inspired methods adapted from neural radiance fields and 3D Gaussian splatting for continuous wireless radiation field modeling. From the physics-awareness perspective, we introduce a three-level integration framework encompassing data-level feature engineering, loss-level partial differential equation regularization, and architecture-level structural isomorphism. Open challenges including foundation model development, physical hallucination detection, and amortized inference for real-time deployment are discussed to outline future research directions.
Influence Functions for Data Attribution in Linear System Identification and LQR Control
When a controller is designed from an identified model, its performance ultimately depends on the trajectories used for identification, but pinpointing which ones help or hurt remains an open problem. We bring influence functions, a data attribution tool from machine learning, into this setting by chaining two closed form sensitivity analyses across a regularized least squares identification and an infinite horizon LQR pipeline. On the identification side, the quadratic loss admits an exact leave one trajectory out parameter shift and a reusable first order approximation with a Neumann series error bound. On the control side, we implicitly differentiate through the DARE via its discrete Lyapunov structure and compress the cost gradient to a single adjoint Lyapunov solve. The resulting scores track true LOTO retraining with Pearson correlations above 0.99 and speedups of 7 to 60 times on linear systems of dimension 2 to 10.
RDS-DeePC: Robust Data Selection for Data-Enabled Predictive Control via Sensitivity Score
Data Enabled Predictive Control (DeePC) is an established model free approach to predictive control, but it faces two open challenges: computational complexity that scales cubically with dataset size and performance degradation when data are corrupted. This paper introduces Robust Data Selection DeePC (RDS DeePC), a framework that addresses both obstacles through influence function analysis. We derive a sensitivity score quantifying the leverage each trajectory segment exerts on the optimization solution and prove that high sensitivity segments correspond to outliers while low sensitivity segments represent consistent data. Selecting low sensitivity segments thus yields both computational efficiency and automatic outlier filtering without requiring data quality labels. For nonlinear systems, we extend the framework via a two stage online selection approach accelerated by the LiSSA algorithm. Experiments on four systems of increasing complexity including a DC motor, an inverted pendulum, a planar quadrotor UAV tracking a figure 8 trajectory, and a kinematic bicycle vehicle following a figure 8 path demonstrate that RDS DeePC achieves 94 to 97 percent clean data selection and comparable or better tracking performance under 20 percent data corruption.
Data-Driven Successive Linearization for Optimal Voltage Control
Power distribution systems are increasingly exposed to large voltage fluctuations driven by intermittent renewable generation and time varying loads (e.g., electric vehicles and storage). To address this challenge, a number of advanced controllers have been proposed for voltage regulation. However, these controllers typically rely on fixed linear approximations of voltage dynamics. As a result, the solutions may become infeasible when applied to the actual voltage behavior governed by nonlinear power flow equations, particularly under heavy power injection from distributed energy resources. This paper proposes a data-driven successive linearization approach for voltage control under nonlinear power flow constraints. By leveraging the fact that the deviation between the nonlinear power flow solution and its linearization is bounded by the distance from the operating point, we perform data-driven linearization around the most recent operating point. Convergence of the proposed method to a neighborhood of KKT points is established by exploiting the convexity of the objective function and structural properties of the nonlinear constraints. Case studies show that the proposed approach achieves fast convergence and adapts quickly to changes in net load.
Energy-Aware Reinforcement Learning for Robotic Manipulation of Articulated Components in Infrastructure Operation and Maintenance
With the growth of intelligent civil infrastructure and smart cities, operation and maintenance (O&M) increasingly requires safe, efficient, and energy-conscious robotic manipulation of articulated components, including access doors, service drawers, and pipeline valves. However, existing robotic approaches either focus primarily on grasping or target object-specific articulated manipulation, and they rarely incorporate explicit actuation energy into multi-objective optimisation, which limits their scalability and suitability for long-term deployment in real O&M settings. Therefore, this paper proposes an articulation-agnostic and energy-aware reinforcement learning framework for robotic manipulation in intelligent infrastructure O&M. The method combines part-guided 3D perception, weighted point sampling, and PointNet-based encoding to obtain a compact geometric representation that generalises across heterogeneous articulated objects. Manipulation is formulated as a Constrained Markov Decision Process (CMDP), in which actuation energy is explicitly modelled and regulated via a Lagrangian-based constrained Soft Actor-Critic scheme. The policy is trained end-to-end under this CMDP formulation, enabling effective articulated-object operation while satisfying a long-horizon energy budget. Experiments on representative O&M tasks demonstrate 16%-30% reductions in energy consumption, 16%-32% fewer steps to success, and consistently high success rates, indicating a scalable and sustainable solution for infrastructure O&M manipulation.
comment: 18 pages, 5 figures, 7 tables. This version supersedes all previous preprint versions
Uncertainty and Autarky: Cooperative Game Theory for Stable Local Energy Market Partitioning
Local energy markets empower prosumers to form coalitions for energy trading. However, the optimal partitioning of the distribution grid into such coalitions remains unclear, especially in constrained grids with stochastic production and consumption. This analysis must take into account the interests of both the grid operator and the constituent prosumers. In this work, we present a cooperative game theoretic framework to study distribution grid partitioning into local energy market coalitions under uncertain prosumption and grid constraints. We formulate the optimal stable partitioning problem to balance the interests of the grid operator with that of prosumers. Under deterministic load and generation, we show that the largest market coalition is the optimal stable partition. For the case of stochastic loads and generation, we provide an algorithm to evaluate the optimal stable partition. Numerical experiments are performed on benchmark and real world distribution grids. Our results help in understanding how uncertainty affects local energy market partitioning decisions in constrained distribution grids.
Deep Adaptive Model-Based Design of Experiments
Model-based design of experiments (MBDOE) is essential for efficient parameter estimation in nonlinear dynamical systems. However, conventional adaptive MBDOE requires costly posterior inference and design optimization between each experimental step, precluding real-time applications. We address this by combining Deep Adaptive Design (DAD), which amortizes sequential design into a neural network policy trained offline, with differentiable mechanistic models. For dynamical systems with known governing equations but uncertain parameters, we extend sequential contrastive training objectives to handle nuisance parameters and propose a transformer-based policy architecture that respects the temporal structure of dynamical systems. We demonstrate the approach on four systems of increasing complexity: a fed-batch bioreactor with Monod kinetics, a Haldane bioreactor with uncertain substrate inhibition, a two-compartment pharmacokinetic model with nuisance clearance parameters, and a DC motor for real-time deployment.
Dynamic Output-Feedback Controller Synthesis for Dissipativity and $H_2$ Performance from Noisy Input-State Data
In this paper we propose dynamic output-feedback controller synthesis methods for discrete-time linear time-invariant systems. The synthesis goal is to achieve dissipativity with respect to a given quadratic supply rate or a given $H_2$ performance level. It is assumed that the model of system dynamics is unknown, expect for the disturbance term. Instead, we have a recorded trajectory of the control input and the state, which can be corrupted by an unknown but bounded disturbance. The state data is used only for the purpose of controller synthesis, while the designed controller is output feedback controller, i.e., the full state is not used for control in real time. The presented synthesis method is formulated in terms of linear matrix inequalities parametrized by a scalar variable. Within the considered setting, the synthesis procedure is non-conservative.
comment: 8 pages, 2 figures; $H_2$ controller synthesis method is added and numerical example is expanded
Unconditional Stability Analysis of N-Port Networks Based on Structured Singular Value Computation
In this paper, a novel approach based on robust stability concepts and tools is introduced to evaluate the unconditional stability of microwave active $\textit{n}$-port devices. An efficient calculation of the Structured Singular Value of the $\textit{n}$x$\textit{n}$ scattering matrix is proposed to obtain the stability characteristics of the device. The presented method is validated in two ways. First, it is applied to a referential 4x4 scattering parameter set for independent verification. Second, the method is applied to a 4-port GaAs FET amplifier fabricated in hybrid technology. The results confirm the validity and computational efficiency of the proposed approach.
comment: Updated to the Author Accepted Manuscript (AAM) of the paper included in the Proceedings of the 2024 IEEE Asia-Pacific Microwave Conference (APMC). Only minor formatting differences compared to the previous arXiv version
On the Impact of Voltage Unbalance on Distribution Locational Marginal Prices
Finding clear economic signals for distribution-network operation and expansion is increasingly important as single-phase loads and distributed energy resources escalate. These devices create phase-to-phase imbalances that manifest as voltage unbalance, a power quality issue that accelerates insulation aging in machines and increases network losses, thereby raising costs for operators and consumers. Traditional grid codes address unbalance via disparate hard limits on various indices thresholds that differ across standards, offer no dynamic economic incentive and undermine optimality. This paper proposes instead to treat voltage unbalance as a `soft limit' by adding penalty terms to grid operation costs within a three-phase optimal power flow to reflect the cost of the decrease in lifetime of assets due to being subject to voltage unbalance. This unified approach yields dynamic economic signals unbalance-aware Distribution Locational Marginal Prices (DLMP) that reflect the cost of power quality deviations. A novel mathematical decomposition of DLMP is developed, isolating the energy, loss, congestion, and unbalance components. Case studies conducted on two benchmark networks demonstrate the effectiveness and practical value of the proposed method. The results indicate that unbalance penalties reshape nodal prices, produce unexpected phase-level effects, and even allow scenarios where added load reduces unbalance and lowers costs, while providing planners and market designers with actionable insights to balance investment, operation, and power quality in modern distribution systems.
A Real-Time Control Barrier Function-Based Safety Filter for Motion Planning with Arbitrary Road Boundary Constraints SC60802
We present a real-time safety filter for motion planning, including those that are learning-based, using Control Barrier Functions (CBFs) to provide formal guarantees for collision avoidance with road boundaries. A key feature of our approach is its ability to directly incorporate road geometries of arbitrary shape that are represented as polylines without resorting to conservative overapproximations. We formulate the safety filter as a constrained optimization problem as a Quadratic Program (QP), which achieves safety by making minimal, necessary adjustments to the control actions issued by the nominal motion planner. We validate our safety filter through extensive numerical experiments across a variety of traffic scenarios featuring complex road boundaries. The results confirm its reliable safety and high computational efficiency (execution frequency up to 40 Hz). Code reproducing our experimental results and a video demonstration are available at github.com/bassamlab/SigmaRL.
comment: Published version, see https://doi.org/10.1109/ITSC60802.2025.11423203
Benchmarking State Space Models, Transformers, and Recurrent Networks for US Grid Forecasting
Selecting the right deep learning model for power grid forecasting is challenging, as performance heavily depends on the data available to the operator. This paper presents a comprehensive benchmark of five modern neural architectures: two state space models (PowerMamba, S-Mamba), two Transformers (iTransformer, PatchTST), and a traditional LSTM. We evaluate these models on hourly electricity demand across six diverse US power grids for forecast windows between 24 and 168 hours. To ensure a fair comparison, we adapt each model with specialized temporal processing and a modular layer that cleanly integrates weather covariates. Our results reveal that there is no single best model for all situations. When forecasting using only historical load, PatchTST and the state space models provide the highest accuracy. However, when explicit weather data is added to the inputs, the rankings reverse: iTransformer improves its accuracy three times more efficiently than PatchTST. By controlling for model size, we confirm that this advantage stems from the architecture's inherent ability to mix information across different variables. Extending our evaluation to solar generation, wind power, and wholesale prices further demonstrates that model rankings depend on the forecast task: PatchTST excels on highly rhythmic signals like solar, while state space models are better suited for the chaotic fluctuations of wind and price. Ultimately, this benchmark provides grid operators with actionable guidelines for selecting the optimal forecasting architecture based on their specific data environments.
comment: 11 pages, 2 figures, 8 tables
An Agentic Multi-Agent Architecture for Cybersecurity Risk Management
Getting a real cybersecurity risk assessment for a small organization is expensive -- a NIST CSF-aligned engagement runs $15,000 on the low end, takes weeks, and depends on practitioners who are genuinely scarce. Most small companies skip it entirely. We built a six-agent AI system where each agent handles one analytical stage: profiling the organization, mapping assets, analyzing threats, evaluating controls, scoring risks, and generating recommendations. Agents share a persistent context that grows as the assessment proceeds, so later agents build on what earlier ones concluded -- the mechanism that distinguishes this from standard sequential agent pipelines. We tested it on a 15-person HIPAA-covered healthcare company and compared outputs to independent assessments by three CISSP practitioners -- the system agreed with them 85% of the time on severity classifications, covered 92% of identified risks, and finished in under 15 minutes. We then ran 30 repeated single-agent assessments across five synthetic but sector-realistic organizational profiles in healthcare, fintech, manufacturing, retail, and SaaS, comparing a general-purpose Mistral-7B against a domain fine-tuned model. Both completed every run. The fine-tuned model flagged threats the baseline could not see at all: PHI exposure in healthcare, OT/IIoT vulnerabilities in manufacturing, platform-specific risks in retail. The full multi-agent pipeline, however, failed every one of 30 attempts on a Tesla T4 with its 4,096-token default context window -- context capacity, not model quality, turned out to be the binding constraint.
comment: 15 pages, 1 figure, 2 tables. Submitted to AICTC 2026 (Springer LNCS)
A Control-Theoretic Foundation for Agentic Systems
This paper develops a control-theoretic framework for analyzing agentic systems embedded within feedback control loops, where an AI agent may adapt controller parameters, select among control strategies, invoke external tools, reconfigure decision architectures, and modify control objectives during operation. These capabilities are formalized by interpreting agency as hierarchical runtime decision authority over elements of the control architecture, leading to an augmented closed-loop representation in which physical states, internal memory, tool outputs, interaction signals, and design variables evolve as a coupled dynamical system. A five-level hierarchy of agency is defined, ranging from fixed control laws to runtime synthesis of control architectures and objectives. The analysis shows that increasing agency introduces interacting dynamical mechanisms such as time-varying adaptation, endogenous switching, decision-induced delays, and structural reconfiguration. The framework is developed in both nonlinear and linear settings, providing explicit design constraints for AI-enabled control systems in safety-critical applications.
Ensemble Kalman Inversion for Constrained Nonlinear MPC: An ADMM-Splitting Approach
This work proposes a novel Alternating Direction Method of Multipliers (ADMM)-based Ensemble Kalman Inversion (EKI) algorithm for solving constrained nonlinear model predictive control (NMPC) problems. First, stage-wise nonlinear inequality constraints in the NMPC problem are embedded via an augmented Lagrangian with nonnegative slack variables. We then show that the resulting unconstrained augmented-Lagrangian primal subproblem admits a Bayesian interpretation: under independent Gaussian virtual observations, its minimizers coincide with MAP estimators, enabling solution via EKI. However, since the nonnegativity constraint on the slacks is a hard constraint not naturally encoded by a Gaussian model, our proposed algorithm yields a two-block ADMM scheme that alternates between (i) an inexact primal step that minimizes the augmented-Lagrangian objective (implemented via EKI rollouts), (ii) a nonnegativity projection for the slacks, and (iii) a dual ascent step. To balance exploration and convergence, an annealing schedule tempers sampling covariances while a penalty schedule increases constraint enforcement over outer iterations, encouraging global search early and precise constraint satisfaction later. We evaluate the proposed controller on a 6-DOF UR5e manipulation benchmark in MuJoCo, comparing it against DIAL-MPC (an iterative MPPI variant) as the arm traverses a cluttered tabletop environment.
A Necessary and Sufficient Condition for Local Synchronization in Nonlinear Oscillator Networks
Determining conditions on the coupling strength for the synchronization in networks of interconnected oscillators is a challenging problem in nonlinear dynamics. While sophisticated mathematical methods have been used to derive conditions, these conditions are usually only sufficient and/ or based on numerical methods. We addressed the gap between the sufficient coupling strength and numerically observations using the Lyapunov-Floquet Theory and the Master Stability Function framework. We showed that a positive coupling strength is a necessary and sufficient condition for local synchronization in a network of identical oscillators coupled linearly and in full state fashion. For partial state coupling, we showed that a positive coupling constant results in an asymptotic contraction of the trajectories in the state space, which results in synchronisation for two-dimensional oscillators. We extended the results to networks with non-identical coupling over directed graphs and showed that positive coupling constants is a sufficient condition for synchronisation. These theoretical results are validated using numerical simulations and experimental implementations. Our results contribute to bridging the gap between the theoretically derived sufficient coupling strengths and the numerically observed ones.
comment: 6 pages, 7 figures, Journal
Robotics
ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model
Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.
comment: 10 pages, 5 figures
DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models
Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting'' capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.
UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos CVPR 2026
Dexterous manipulation remains challenging due to the cost of collecting real-robot teleoperation data, the heterogeneity of hand embodiments, and the high dimensionality of control. We present UniDex, a robot foundation suite that couples a large-scale robot-centric dataset with a unified vision-language-action (VLA) policy and a practical human-data capture setup for universal dexterous hand control. First, we construct UniDex-Dataset, a robot-centric dataset over 50K trajectories across eight dexterous hands (6--24 DoFs), derived from egocentric human video datasets. To transform human data into robot-executable trajectories, we employ a human-in-the-loop retargeting procedure to align fingertip trajectories while preserving plausible hand-object contacts, and we operate on explicit 3D pointclouds with human hands masked to narrow kinematic and visual gaps. Second, we introduce the Function-Actuator-Aligned Space (FAAS), a unified action space that maps functionally similar actuators to shared coordinates, enabling cross-hand transfer. Leveraging FAAS as the action parameterization, we train UniDex-VLA, a 3D VLA policy pretrained on UniDex-Dataset and finetuned with task demonstrations. In addition, we build UniDex-Cap, a simple portable capture setup that records synchronized RGB-D streams and human hand poses and converts them into robot-executable trajectories to enable human-robot data co-training that reduces reliance on costly robot demonstrations. On challenging tool-use tasks across two different hands, UniDex-VLA achieves 81% average task progress and outperforms prior VLA baselines by a large margin, while exhibiting strong spatial, object, and zero-shot cross-hand generalization. Together, UniDex-Dataset, UniDex-VLA, and UniDex-Cap provide a scalable foundation suite for universal dexterous manipulation.
comment: Accepted by CVPR 2026
DexDrummer: In-Hand, Contact-Rich, and Long-Horizon Dexterous Robot Drumming
Performing in-hand, contact-rich, and long-horizon dexterous manipulation remains an unsolved challenge in robotics. Prior hand dexterity works have considered each of these three challenges in isolation, yet do not combine these skills into a single, complex task. To further test the capabilities of dexterity, we propose drumming as a testbed for dexterous manipulation. Drumming naturally integrates all three challenges: it involves in-hand control for stabilizing and adjusting the drumstick with the fingers, contact-rich interaction through repeated striking of the drum surface, and long-horizon coordination when switching between drums and sustaining rhythmic play. We present DexDrummer, a hierarchical object-centric bimanual drumming policy trained in simulation with sim-to-real transfer. The framework reduces the exploration difficulty of pure reinforcement learning by combining trajectory planning with residual RL corrections for fast transitions between drums. A dexterous manipulation policy handles contact-rich dynamics, guided by rewards that explicitly model both finger-stick and stick-drum interactions. In simulation, we show our policy can play two styles of music: multi-drum, bimanual songs and challenging, technical exercises that require increased dexterity. Across simulated bimanual tasks, our dexterous, reactive policy outperforms a fixed grasp policy by 1.87x across easy songs and 1.22x across hard songs F1 scores. In real-world tasks, we show song performance across a multi-drum setup. DexDrummer is able to play our training song and its extended version with an F1 score of 1.0.
comment: Website: https://dexdrummer.github.io/
Make Tracking Easy: Neural Motion Retargeting for Humanoid Whole-body Control
Humanoid robots require diverse motor skills to integrate into complex environments, but bridging the kinematic and dynamic embodiment gap from human data remains a major bottleneck. We demonstrate through Hessian analysis that traditional optimization-based retargeting is inherently non-convex and prone to local optima, leading to physical artifacts like joint jumps and self-penetration. To address this, we reformulate the targeting problem as learning data distribution rather than optimizing optimal solutions, where we propose NMR, a Neural Motion Retargeting framework that transforms static geometric mapping into a dynamics-aware learned process. We first propose Clustered-Expert Physics Refinement (CEPR), a hierarchical data pipeline that leverages VAE-based motion clustering to group heterogeneous movements into latent motifs. This strategy significantly reduces the computational overhead of massively parallel reinforcement learning experts, which project and repair noisy human demonstrations onto the robot's feasible motion manifold. The resulting high-fidelity data supervises a non-autoregressive CNN-Transformer architecture that reasons over global temporal context to suppress reconstruction noise and bypass geometric traps. Experiments on the Unitree G1 humanoid across diverse dynamic tasks (e.g., martial arts, dancing) show that NMR eliminates joint jumps and significantly reduces self-collisions compared to state-of-the-art baselines. Furthermore, NMR-generated references accelerate the convergence of downstream whole-body control policies, establishing a scalable path for bridging the human-robot embodiment gap.
comment: Report, 12 pages, 5 figures, 4 tables
Cross-Modal Reinforcement Learning for Navigation with Degraded Depth Measurements
This paper presents a cross-modal learning framework that exploits complementary information from depth and grayscale images for robust navigation. We introduce a Cross-Modal Wasserstein Autoencoder that learns shared latent representations by enforcing cross-modal consistency, enabling the system to infer depth-relevant features from grayscale observations when depth measurements are corrupted. The learned representations are integrated with a Reinforcement Learning-based policy for collision-free navigation in unstructured environments when depth sensors experience degradation due to adverse conditions such as poor lighting or reflective surfaces. Simulation and real-world experiments demonstrate that our approach maintains robust performance under significant depth degradation and successfully transfers to real environments.
comment: Accepted to the 24th European Control Conference (ECC) 2026
Feasibility of Augmented Reality-Guided Robotic Ultrasound with Cone-Beam CT Integration for Spine Procedures
Accurate needle placement in spine interventions is critical for effective pain management, yet it depends on reliable identification of anatomical landmarks and careful trajectory planning. Conventional imaging guidance often relies both on CT and X-ray fluoroscopy, exposing patients and staff to high dose of radiation while providing limited real-time 3D feedback. We present an optical see-through augmented reality (OST-AR)-guided robotic system for spine procedures that provides in situ visualization of spinal structures to support needle trajectory planning. We integrate a cone-beam CT (CBCT)-derived 3D spine model which is co-registered with live ultrasound, enabling users to combine global anatomical context with local, real-time imaging. We evaluated the system in a phantom user study involving two representative spine procedures: facet joint injection and lumbar puncture. Sixteen participants performed insertions under two visualization conditions: conventional screen vs. AR. Results show that AR significantly reduces execution time and across-task placement error, while also improving usability, trust, and spatial understanding and lowering cognitive workload. These findings demonstrate the feasibility of AR-guided robotic ultrasound for spine interventions, highlighting its potential to enhance accuracy, efficiency, and user experience in image-guided procedures.
comment: 8 pages, 7 figures
Closed-Loop Verbal Reinforcement Learning for Task-Level Robotic Planning
We propose a new Verbal Reinforcement Learning (VRL) framework for interpretable task-level planning in mobile robotic systems operating under execution uncertainty. The framework follows a closed-loop architecture that enables iterative policy improvement through interaction with the physical environment. In our framework, executable Behavior Trees are repeatedly refined by a Large Language Model actor using structured natural-language feedback produced by a Vision-Language Model critic that observes the physical robot and execution traces. Unlike conventional reinforcement learning, policy updates in VRL occur directly at the symbolic planning level, without gradient-based optimization. This enables transparent reasoning, explicit causal feedback, and human-interpretable policy evolution. We validate the proposed framework on a real mobile robot performing a multi-stage manipulation and navigation task under execution uncertainty. Experimental results show that the framework supports explainable policy improvements, closed-loop adaptation to execution failures, and reliable deployment on physical robotic systems.
From Singleton Obstacles to Clutter: Translation Invariant Compositional Avoid Sets
This paper studies obstacle avoidance under translation invariant dynamics using an avoid-side travel cost Hamilton Jacobi formulation. For running costs that are zero outside an obstacle and strictly negative inside it, we prove that the value function is non-positive everywhere, equals zero exactly outside the avoid set, and is strictly negative exactly on it. Under translation invariance, this yields a reuse principle: the value of any translated obstacle is obtained by translating a single template value function. We show that the pointwise minimum of translated template values exactly characterizes the union of the translated single-obstacle avoid sets and provides a conservative inner certificate of unavoidable collision in clutter. To reduce conservatism, we introduce a blockwise composition framework in which subsets of obstacles are merged and solved jointly. This yields a hierarchy of conservative certificates from singleton reuse to the exact clutter value, together with monotonicity under block merging and an exactness criterion based on the existence of a common clutter avoiding control. The framework is illustrated on a Dubins car example in a repeated clutter field.
ROBOGATE: Adaptive Failure Discovery for Safe Robot Policy Deployment via Two-Stage Boundary-Focused Sampling
Deploying learned robot manipulation policies in industrial settings requires rigorous pre-deployment validation, yet exhaustive testing across high-dimensional parameter spaces is intractable. We present ROBOGATE, a deployment risk management framework that combines physics-based simulation with a two-stage adaptive sampling strategy to efficiently discover failure boundaries in the operational parameter space. Stage 1 employs Latin Hypercube Sampling (LHS) across an 8-dimensional parameter space to establish a coarse failure landscape from 20,000 uniformly distributed experiments. Stage 2 applies boundary-focused sampling that concentrates 10,000 additional experiments in the 30-70% success rate transition zone, enabling precise failure boundary mapping. Using NVIDIA Isaac Sim with Newton physics, we evaluate a scripted pick-and-place controller on two robot embodiments -- Franka Panda (7-DOF) and UR5e (6-DOF) -- across 30,000 total experiments. Our logistic regression risk model achieves an AUC of 0.780 on the combined dataset (vs. 0.754 for Stage 1 alone), identifies a closed-form failure boundary equation, and reveals four universal danger zones affecting both robot platforms. We further demonstrate the framework on VLA (Vision-Language-Action) model evaluation, where Octo-Small achieves 0.0% success rate on 68 adversarial scenarios versus 100% for the scripted baseline -- a 100-point gap that underscores the challenge of deploying foundation models in industrial settings. ROBOGATE is open-source and runs on a single GPU workstation.
comment: 12 pages, 5 figures, open-source code and 30K failure pattern dataset available at https://github.com/liveplex-cpu/robogate
Programming Manufacturing Robots with Imperfect AI: LLMs as Tuning Experts for FDM Print Configuration Selection
We use fused deposition modeling (FDM) 3D printing as a case study of how manufacturing robots can use imperfect AI to acquire process expertise. In FDM, print configuration strongly affects output quality. Yet, novice users typically rely on default configurations, trial-and-error, or recommendations from generic AI models (e.g., ChatGPT). These strategies can produce complete prints, but they do not reliably meet specific objectives. Experts iteratively tune print configurations using evidence from prior prints. We present a modular closed-loop approach that treats an LLM as a source of tuning expertise. We embed this source of expertise within a Bayesian optimization loop. An approximate evaluator scores each print configuration and returns structured diagnostics, which the LLM uses to propose natural-language adjustments that are compiled into machine-actionable guidance for optimization. On 100 Thingi10k parts, our LLM-guided loop achieves the best configuration on 78% objects with 0% likely-to-fail cases, while single-shot AI model recommendations are rarely best and exhibit 15% likely-to-fail cases. These results suggest that LLMs provide more value as constrained decision modules in evidence-driven optimization loops than as end-to-end oracles for print configuration selection. We expect this result to extend to broader LLM-based robot programming.
FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario CVPR 2026
The increasing demand for augmented reality and robotics is driving the need for articulated object reconstruction with high scalability. However, existing settings for reconstructing from discrete articulation states or casual monocular videos require non-trivial axis alignment or suffer from insufficient coverage, limiting their applicability. In this paper, we introduce FreeArtGS, a novel method for reconstructing articulated objects under free-moving scenario, a new setting with a simple setup and high scalability. FreeArtGS combines free-moving part segmentation with joint estimation and end-to-end optimization, taking only a monocular RGB-D video as input. By optimizing with the priors from off-the-shelf point-tracking and feature models, the free-moving part segmentation module identifies rigid parts from relative motion under unconstrained capture. The joint estimation module calibrates the unified object-to-camera poses and recovers joint type and axis robustly from part segmentation. Finally, 3DGS-based end-to-end optimization is implemented to jointly reconstruct visual textures, geometry, and joint angles of the articulated object. We conduct experiments on two benchmarks and real-world free-moving articulated objects. Experimental results demonstrate that FreeArtGS consistently excels in reconstructing free-moving articulated objects and remains highly competitive in previous reconstruction settings, proving itself a practical and effective solution for realistic asset generation. The project page is available at: https://freeartgs.github.io/
comment: Accepted to CVPR 2026
Do World Action Models Generalize Better than VLAs? A Robustness Study
Robot action planning in the real world is challenging as it requires not only understanding the current state of the environment but also predicting how it will evolve in response to actions. Vision-language-action (VLA), which repurpose large-scale vision-language models for robot action generation using action experts, have achieved notable success across a variety of robotic tasks. Nevertheless, their performance remains constrained by the scope of their training data, exhibiting limited generalization to unseen scenarios and vulnerability to diverse contextual perturbations. More recently, world models have been revisited as an alternative to VLAs. These models, referred to as world action models (WAMs), are built upon world models that are trained on large corpora of video data to predict future states. With minor adaptations, their latent representation can be decoded into robot actions. It has been suggested that their explicit dynamic prediction capacity, combined with spatiotemporal priors acquired from web-scale video pretraining, enables WAMs to generalize more effectively than VLAs. In this paper, we conduct a comparative study of prominent state-of-the-art VLA policies and recently released WAMs. We evaluate their performance on the LIBERO-Plus and RoboTwin 2.0-Plus benchmarks under various visual and language perturbations. Our results show that WAMs achieve strong robustness, with LingBot-VA reaching 74.2% success rate on RoboTwin 2.0-Plus and Cosmos-Policy achieving 82.2% on LIBERO-Plus. While VLAs such as $π_{0.5}$ can achieve comparable robustness on certain tasks, they typically require extensive training with diverse robotic datasets and varied learning objectives. Hybrid approaches that partially incorporate video-based dynamic learning exhibit intermediate robustness, highlighting the importance of how video priors are integrated.
MineRobot: A Unified Framework for Kinematics Modeling and Solving of Underground Mining Robots in Virtual Environments
Underground mining robots are increasingly operated in virtual environments (VEs) for training, planning, and digital-twin applications, where reliable kinematics is essential for avoiding hazardous in-situ trials. Unlike typical open-chain industrial manipulators, mining robots are often closed-chain mechanisms driven by linear actuators and involving planar four-bar linkages, which makes both kinematics modeling and real-time solving challenging. We present \emph{MineRobot}, a unified framework for modeling and solving the kinematics of underground mining robots in VEs. First, we introduce the Mining Robot Description Format (MRDF), a domain-specific representation that parameterizes kinematics for mining robots with native semantics for actuators and loop closures. Second, we develop a topology-processing pipeline that contracts four-bar substructures into generalized joints and, for each actuator, extracts an Independent Topologically Equivalent Path (ITEP), which is classified into one of four canonical types. Third, leveraging ITEP independence, we compose per-type solvers into an actuator-centered sequential forward-kinematics (FK) pipeline. Building on the same decomposition, we formulate inverse kinematics (IK) as a bound-constrained optimization problem and solve it with a Gauss--Seidel-style procedure that alternates actuator-length updates. By converting coupled closed-loop kinematics into a sequence of small topology-aware solves, the framework avoids robot-specific hand derivations and supports efficient computation. Experiments demonstrate that MineRobot provides the real-time performance and robustness required by VE applications.
RAFL: Generalizable Sim-to-Real of Soft Robots with Residual Acceleration Field Learning
Differentiable simulators enable gradient-based optimization of soft robots over material parameters, control, and morphology, but accurately modeling real systems remains challenging due to the sim-to-real gap. This issue becomes more pronounced when geometry is itself a design variable. System identification reduces discrepancies by fitting global material parameters to data; however, when constitutive models are misspecified or observations are sparse, identified parameters often absorb geometry-dependent effects rather than reflect intrinsic material behavior. More expressive constitutive models can improve accuracy but substantially increase computational cost, limiting practicality. We propose a residual acceleration field learning (RAFL) framework that augments a base simulator with a transferable, element-level corrective dynamics field. Operating on shared local features, the model is agnostic to global mesh topology and discretization. Trained end-to-end through a differentiable simulator using sparse marker observations, the learned residual generalizes across shapes. In both sim-to-sim and sim-to-real experiments, our method achieves consistent zero-shot improvements on unseen morphologies, while system identification frequently exhibits negative transfer. The framework also supports continual refinement, enabling simulation accuracy to accumulate during morphology optimization.
MEVIUS2: Practical Open-Source Quadruped Robot with Sheet Metal Welding and Multimodal Perception
Various quadruped robots have been developed to date, and thanks to reinforcement learning, they are now capable of traversing diverse types of rough terrain. In parallel, there is a growing trend of releasing these robot designs as open-source, enabling researchers to freely build and modify robots themselves. However, most existing open-source quadruped robots have been designed with 3D printing in mind, resulting in structurally fragile systems that do not scale well in size, leading to the construction of relatively small robots. Although a few open-source quadruped robots constructed with metal components exist, they still tend to be small in size and lack multimodal sensors for perception, making them less practical. In this study, we developed MEVIUS2, an open-source quadruped robot with a size comparable to Boston Dynamics' Spot, whose structural components can all be ordered through e-commerce services. By leveraging sheet metal welding and metal machining, we achieved a large, highly durable body structure while reducing the number of individual parts. Furthermore, by integrating sensors such as LiDARs and a high dynamic range camera, the robot is capable of detailed perception of its surroundings, making it more practical than previous open-source quadruped robots. We experimentally validated that MEVIUS2 can traverse various types of rough terrain and demonstrated its environmental perception capabilities. All hardware, software, and training environments can be obtained from Supplementary Materials or https://github.com/haraduka/mevius2.
comment: Accepted to IEEE Robotics and Automation Practice, Website - https://haraduka.github.io/mevius2-hardware/
6D Robotic OCT Scanning of Curved Tissue Surfaces
Optical coherence tomography (OCT) is a non-invasive volumetric imaging modality with high spatial and temporal resolution. For imaging larger tissue structures, OCT probes need to be moved to scan the respective area. For handheld scanning, stitching of the acquired OCT volumes requires overlap to register the images. For robotic scanning and stitching, a typical approach is to restrict the motion to translations, as this avoids a full hand-eye calibration, which is complicated by the small field of view of most OCT probes. However, stitching by registration or by translational scanning are limited when curved tissue surfaces need to be scanned. We propose a marker for full six-dimensional hand-eye calibration of a robot mounted OCT probe. We show that the calibration results in highly repeatable estimates of the transformation. Moreover, we evaluate robotic scanning of two phantom surfaces to demonstrate that the proposed calibration allows for consistent scanning of large, curved tissue surfaces. As the proposed approach is not relying on image registration, it does not suffer from a potential accumulation of errors along a scan path. We also illustrate the improvement compared to conventional 3D-translational robotic scanning.
comment: Accepted at IEEE ISBI 2026
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
Vision-Language-Action (VLA) models typically map visual observations and linguistic instructions directly to robotic control signals. This "black-box" mapping forces a single forward pass to simultaneously handle instruction interpretation, spatial grounding, and low-level control, often leading to poor spatial precision and limited robustness in out-of-distribution scenarios. To address these limitations, we propose VP-VLA, a dual-system framework that decouples high-level reasoning and low-level execution via a structured visual prompting interface. Specifically, a "System 2 Planner" decomposes complex instructions into sub-tasks and identifies relevant target objects and goal locations. These spatial anchors are then overlaid directly onto visual observations as structured visual prompts, such as crosshairs and bounding boxes. Guided by these prompts and enhanced by a novel auxiliary visual grounding objective during training, a "System 1 Controller" reliably generates precise low-level execution motions. Experiments on the Robocasa-GR1-Tabletop benchmark and SimplerEnv simulation demonstrate that VP-VLA improves success rates by 5% and 8.3%, surpassing competitive baselines including QwenOFT and GR00T-N1.6.
comment: Project page: https://visualprompt-vla.github.io/
Disengagement Analysis and Field Tests of a Prototypical Open-Source Level 4 Autonomous Driving System
Proprietary Autonomous Driving Systems are typically evaluated through disengagements, unplanned manual interventions to alter vehicle behavior, as annually reported by the California Department of Motor Vehicles. However, the real-world capabilities of prototypical open-source Level 4 vehicles over substantial distances remain largely unexplored. This study evaluates a research vehicle running an Autoware-based software stack across 236 km of mixed traffic. By classifying 30 disengagements across 26 rides with a novel five-level criticality framework, we observed a spatial disengagement rate of 0.127 1/km. Interventions predominantly occurred at lower speeds near static objects and traffic lights. Perception and Planning failures accounted for 40% and 26.7% of disengagements, respectively, largely due to object-tracking losses and operational deadlocks caused by parked vehicles. Frequent, unnecessary interventions highlighted a lack of trust on the part of the safety driver. These results show that while open-source software enables extensive operations, disengagement analysis is vital for uncovering robustness issues missed by standard metrics.
comment: 8 pages, submitted to IEEE for possible publication
Collision-Free Velocity Scheduling for Multi-Agent Systems on Predefined Routes via Inexact-Projection ADMM
In structured multi-agent transportation systems, agents often must follow predefined routes, making spatial rerouting undesirable or impossible. This paper addresses route-constrained multi-agent coordination by optimizing waypoint passage times while preserving each agent's assigned waypoint order and nominal route assignment. A differentiable surrogate trajectory model maps waypoint timings to smooth position profiles and captures first-order tracking lag, enabling pairwise safety to be encoded through distance-based penalties evaluated on a dense temporal grid spanning the mission horizon. The resulting nonlinear and nonconvex velocity-scheduling problem is solved using an inexact-projection Alternating Direction Method of Multipliers (ADMM) algorithm that combines structured timing updates with gradient-based collision-correction steps and avoids explicit integer sequencing variables. Numerical experiments on random-crossing, bottleneck, and graph-based network scenarios show that the proposed method computes feasible and time-efficient schedules across a range of congestion levels and yields shorter mission completion times than a representative hierarchical baseline in the tested bottleneck cases.
IGV-RRT: Prior-Real-Time Observation Fusion for Active Object Search in Changing Environments
Object Goal Navigation (ObjectNav) in temporally changing indoor environments is challenging because object relocation can invalidate historical scene knowledge. To address this issue, we propose a probabilistic planning framework that combines uncertainty-aware scene priors with online target relevance estimates derived from a Vision Language Model (VLM). The framework contains a dual-layer semantic mapping module and a real-time planner. The mapping module includes an Information Gain Map (IGM) built from a 3D scene graph (3DSG) during prior exploration to model object co-occurrence relations and provide global guidance on likely target regions. It also maintains a VLM score map (VLM-SM) that fuses confidence-weighted semantic observations into the map for local validation of the current scene. Based on these two cues, we develop a planner that jointly exploits information gain and semantic evidence for online decision making. The planner biases tree expansion toward semantically salient regions with high prior likelihood and strong online relevance (IGV-RRT), while preserving kinematic feasibility through gradient-based analysis. Simulation and real-world experiments demonstrate that the proposed method effectively mitigates the impact of object rearrangement, achieving higher search efficiency and success rates than representative baselines in complex indoor environments.
Optimal Solutions for the Moving Target Vehicle Routing Problem with Obstacles via Lazy Branch and Price
The Moving Target Vehicle Routing Problem with Obstacles (MT-VRP-O) seeks trajectories for several agents that collectively intercept a set of moving targets. Each target has one or more time windows where it must be visited, and the agents must avoid static obstacles and satisfy speed and capacity constraints. We introduce Lazy Branch-and-Price with Relaxed Continuity (Lazy BPRC), which finds optimal solutions for the MT-VRP-O. Lazy BPRC applies the branch-and-price framework for VRPs, which alternates between a restricted master problem (RMP) and a pricing problem. The RMP aims to select a sequence of target-time window pairings (called a tour) for each agent to follow, from a limited subset of tours. The pricing problem adds tours to the limited subset. Conventionally, solving the RMP requires computing the cost for an agent to follow each tour in the limited subset. Computing these costs in the MT-VRP-O is computationally intensive, since it requires collision-free motion planning between moving targets. Lazy BPRC defers cost computations by solving the RMP using lower bounds on the costs of each tour, computed via motion planning with relaxed continuity constraints. We lazily evaluate the true costs of tours as-needed. We compute a tour's cost by searching for a shortest path on a Graph of Convex Sets (GCS), and we accelerate this search using our continuity relaxation method. We demonstrate that Lazy BPRC runs up to an order of magnitude faster than two ablations.
Sim-to-Real of Humanoid Locomotion Policies via Joint Torque Space Perturbation Injection
This paper proposes a novel alternative to existing sim-to-real methods for training control policies with simulated experiences. Unlike prior methods that typically rely on domain randomization over a fixed finite set of parameters, the proposed approach injects state-dependent perturbations into the input joint torque during forward simulation. These perturbations are designed to simulate a broader spectrum of reality gaps than standard parameter randomization without requiring additional training. By using neural networks as flexible perturbation generators, the proposed method can represent complex, state-dependent uncertainties, such as nonlinear actuator dynamics and contact compliance, that parametric randomization cannot capture. Experimental results demonstrate that the proposed approach enables humanoid locomotion policies to achieve superior robustness against complex, unseen reality gaps in both simulation and real-world deployment.
Directional Mollification for Controlled Smooth Path Generation
Path generation, the problem of producing smooth, executable paths from discrete planning outputs, such as waypoint sequences, is a fundamental step in the control of autonomous robots, industrial robots, and CNC machines, as path following and trajectory tracking controllers impose strict differentiability requirements on their reference inputs to guarantee stability and convergence, particularly for nonholonomic systems. Mollification has been recently proposed as a computationally efficient and analytically tractable tool for path generation, offering formal smoothness and curvature guarantees with advantages over spline interpolation and optimization-based methods. However, this mollification is subject to a fundamental geometric constraint: the smoothed path is confined within the convex hull of the original path, precluding exact waypoint interpolation, even when explicitly required by mission specifications or upstream planners. We introduce directional mollification, a novel operator that resolves this limitation while retaining the analytical tractability of classical mollification. The proposed operator generates infinitely differentiable paths that strictly interpolate prescribed waypoints, converge to the original non-differentiable input with arbitrary precision, and satisfy explicit curvature bounds given by a closed-form expression, addressing the core requirements of path generation for controlled autonomous systems. We further establish a parametric family of path generation operators that contains both classical and directional mollification as special cases, providing a unifying theoretical framework for the systematic generation of smooth, feasible paths from non-differentiable planning outputs.
Partial Attention in Deep Reinforcement Learning for Safe Multi-Agent Control
Attention mechanisms excel at learning sequential patterns by discriminating data based on relevance and importance. This provides state-of-the-art performance in advanced generative artificial intelligence models. This paper applies this concept of an attention mechanism for multi-agent safe control. We specifically consider the design of a neural network to control autonomous vehicles in a highway merging scenario. The environment is modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP). Within a QMIX framework, we include partial attention for each autonomous vehicle, thus allowing each ego vehicle to focus on the most relevant neighboring vehicles. Moreover, we propose a comprehensive reward signal that considers the global objectives of the environment (e.g., safety and vehicle flow) and the individual interests of each agent. Simulations are conducted in the Simulation of Urban Mobility (SUMO). The results show better performance compared to other driving algorithms in terms of safety, driving speed, and reward.
comment: This work has been accepted for publication in the proceedings of the 2026 American Control Conference (ACC), New Orleans, Louisiana, USA
Memory-Efficient Boundary Map for Large-Scale Occupancy Grid Mapping
Determining the occupancy status of locations in the environment is a fundamental task for safety-critical robotic applications. Traditional occupancy grid mapping methods subdivide the environment into a grid of voxels, each associated with one of three occupancy states: free, occupied, or unknown. These methods explicitly maintain all voxels within the mapped volume and determine the occupancy state of a location by directly querying the corresponding voxel that the location falls within. However, maintaining all grid voxels in high-resolution and large-scale scenarios requires substantial memory resources. In this paper, we introduce a novel representation that only maintains the boundary of the mapped volume. Specifically, we explicitly represent the boundary voxels, such as the occupied voxels and frontier voxels, while free and unknown voxels are automatically represented by volumes within or outside the boundary, respectively. As our representation maintains only a closed surface in two-dimensional (2D) space, instead of the entire volume in three-dimensional (3D) space, it significantly reduces memory consumption. Then, based on this 2D representation, we propose a method to determine the occupancy state of arbitrary locations in the 3D environment. We term this method as boundary map. Besides, we design a novel data structure for maintaining the boundary map, supporting efficient occupancy state queries. Theoretical analyses of the occupancy state query algorithm are also provided. Furthermore, to enable efficient construction and updates of the boundary map from the real-time sensor measurements, we propose a global-local mapping framework and corresponding update algorithms. Finally, we will make our implementation of the boundary map open-source on GitHub to benefit the community:https://github.com/hku-mars/BDM.
Can a Robot Walk the Robotic Dog: Triple-Zero Collaborative Navigation for Heterogeneous Multi-Agent Systems
We present Triple Zero Path Planning (TZPP), a collaborative framework for heterogeneous multi-robot systems that requires zero training, zero prior knowledge, and zero simulation. TZPP employs a coordinator--explorer architecture: a humanoid robot handles task coordination, while a quadruped robot explores and identifies feasible paths using guidance from a multimodal large language model. We implement TZPP on Unitree G1 and Go2 robots and evaluate it across diverse indoor and outdoor environments, including obstacle-rich and landmark-sparse settings. Experiments show that TZPP achieves robust, human-comparable efficiency and strong adaptability to unseen scenarios. By eliminating reliance on training and simulation, TZPP offers a practical path toward real-world deployment of heterogeneous robot cooperation. Our code and video are provided at: https://github.com/triple-zeropp/Triple-zero-robot-agent
comment: 8 pages, 2 figures
BiPreManip: Learning Affordance-Based Bimanual Preparatory Manipulation through Anticipatory Collaboration CVPR 2026
Many everyday objects are difficult to directly grasp (e.g., a flat iPad) or manipulate functionally (e.g., opening the cap of a pen lying on a desk). Such tasks require sequential, asymmetric coordination between two arms, where one arm performs preparatory manipulation that enables the other's goal-directed action - for instance, pushing the iPad to the table's edge before picking it up, or lifting the pen body to allow the other hand to remove its cap. In this work, we introduce Collaborative Preparatory Manipulation, a class of bimanual manipulation tasks that demand understanding object semantics and geometry, anticipating spatial relationships, and planning long-horizon coordinated actions between the two arms. To tackle this challenge, we propose a visual affordance-based framework that first envisions the final goal-directed action and then guides one arm to perform a sequence of preparatory manipulations that facilitate the other arm's subsequent operation. This affordance-centric representation enables anticipatory inter-arm reasoning and coordination, generalizing effectively across various objects spanning diverse categories. Extensive experiments in both simulation and the real world demonstrate that our approach substantially improves task success rates and generalization compared to competitive baselines.
comment: Accepted to CVPR 2026
PRM-as-a-Judge: A Dense Evaluation Paradigm for Fine-Grained Robotic Auditing
Current robotic evaluation is still largely dominated by binary success rates, which collapse rich execution processes into a single outcome and obscure critical qualities such as progress, efficiency, and stability. To address this limitation, we propose PRM-as-a-Judge, a dense evaluation paradigm that leverages Process Reward Models (PRMs) to audit policy execution directly from trajectory videos by estimating task progress from observation sequences. Central to this paradigm is the OPD (Outcome-Process-Diagnosis) metric system, which explicitly formalizes execution quality via a task-aligned progress potential. We characterize dense robotic evaluation through two axiomatic properties: macro-consistency, which requires additive and path-consistent aggregation, and micro-resolution, which requires sensitivity to fine-grained physical evolution. Under this formulation, potential-based PRM judges provide a natural instantiation of dense evaluation, with macro-consistency following directly from the induced scalar potential. We empirically validate the micro-resolution property using RoboPulse, a diagnostic benchmark specifically designed for probing micro-scale progress discrimination, where several trajectory-trained PRM judges outperform discriminative similarity-based methods and general-purpose foundation-model judges. Finally, leveraging PRM-as-a-Judge and the OPD metric system, we conduct a structured audit of mainstream policy paradigms across long-horizon tasks, revealing behavioral signatures and failure modes that are invisible to outcome-only metrics.
RTD-RAX: Fast, Safe Trajectory Planning for Systems under Unknown Disturbances
Reachability-based Trajectory Design (RTD) is a provably safe, real-time trajectory planning framework that combines offline reachable-set computation with online trajectory optimization. However, standard RTD implementations suffer from two key limitations: conservatism induced by worst-case reachable-set overapproximations, and an inability to account for real-time disturbances during execution. This paper presents RTD-RAX, a runtime-assurance extension of RTD that utilizes a non-conservative RTD formulation to rapidly generate goal-directed candidate trajectories, and utilizes mixed monotone reachability for fast, disturbance-aware online safety certification. When proposed trajectories fail safety certification under real-time uncertainty, a repair procedure finds nearby safe trajectories that preserve progress toward the goal while guaranteeing safety under real-time disturbances.
Conformal Koopman for Embedded Nonlinear Control with Statistical Robustness: Theory and Real-World Validation ICRA
We propose a fully data-driven, Koopman-based framework for statistically robust control of discrete-time nonlinear systems with linear embeddings. Establishing a connection between the Koopman operator and contraction theory, it offers distribution-free probabilistic bounds on the state tracking error under Koopman modeling uncertainty. Conformal prediction is employed here to rigorously derive a bound on the state-dependent modeling uncertainty throughout the trajectory, ensuring safety and robustness without assuming a specific error prediction structure or distribution. Unlike prior approaches that merely combine conformal prediction with Koopman-based control in an open-loop setting, our method establishes a closed-loop control architecture with formal guarantees that explicitly account for both forward and inverse modeling errors. Also, by expressing the tracking error bound in terms of the control parameters and the modeling errors, our framework offers a quantitative means to formally enhance the performance of arbitrary Koopman-based control. We validate our method both in numerical simulations with the Dubins car and in real-world experiments with a highly nonlinear flapping-wing drone. The results demonstrate that our method indeed provides formal safety guarantees while maintaining accurate tracking performance under Koopman modeling uncertainty.
comment: 8 pages, 6 figures. Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA). The final published version will be available via IEEE Xplore
CataractSAM-2: A Domain-Adapted Model for Anterior Segment Surgery Segmentation and Scalable Ground-Truth Annotation
We present CataractSAM-2, a domain-adapted extension of Meta's Segment Anything Model 2, designed for real-time semantic segmentation of cataract ophthalmic surgery videos with high accuracy. Positioned at the intersection of computer vision and medical robotics, CataractSAM-2 enables precise intraoperative perception crucial for robotic-assisted and computer-guided surgical systems. Furthermore, to alleviate the burden of manual labeling, we introduce an interactive annotation framework that combines sparse prompts with video-based mask propagation. This tool significantly reduces annotation time and facilitates the scalable creation of high-quality ground-truth masks, accelerating dataset development for ocular anterior segment surgeries. We also demonstrate the model's strong zero-shot generalization to glaucoma trabeculectomy procedures, confirming its cross-procedural utility and potential for broader surgical applications. The trained model and annotation toolkit are released as open-source resources, establishing CataractSAM-2 as a foundation for expanding anterior ophthalmic surgical datasets and advancing real-time AI-driven solutions in medical robotics, as well as surgical video understanding.
Auction-Based Task Allocation with Energy-Conscientious Trajectory Optimization for AMR Fleets
This paper presents a hierarchical two-stage framework for multi-robot task allocation and trajectory optimization in asymmetric task spaces: (1) a sequential auction allocates tasks using closed-form bid functions, and (2) each robot independently solves an optimal control problem for energy-minimal trajectories with a physics-based battery model, followed by a collision avoidance refinement step using pairwise proximity penalties. Event-triggered warm-start rescheduling with bounded trigger frequency handles robot faults, priority arrivals, and energy deviations. Across 505 scenarios with 2-20 robots and up to 100 tasks on three factory layouts, both energy- and distance-based auction variants achieve 11.8% average energy savings over nearest-task allocation, with rescheduling latency under 10 ms. The central finding is that bid-metric performance is regime-dependent: in uniform workspaces, distance bids outperform energy bids by 3.5% (p < 0.05, Wilcoxon) because a 15.7% closed-form approximation error degrades bid ranking accuracy to 87%; however, when workspace friction heterogeneity is sufficient (r < 0.85 energy-distance correlation), a zone-aware energy bid outperforms distance bids by 2-2.4%. These results provide practitioner guidance: use distance bids in near-uniform terrain and energy-aware bids when friction variation is significant.
SafePilot: A Framework for Assuring LLM-enabled Cyber-Physical Systems
Large Language Models (LLMs), deep learning architectures with typically over 10 billion parameters, have recently begun to be integrated into various cyber-physical systems (CPS) such as robotics, industrial automation, and autopilot systems. The abstract knowledge and reasoning capabilities of LLMs are employed for tasks like planning and navigation. However, a significant challenge arises from the tendency of LLMs to produce "hallucinations" - outputs that are coherent yet factually incorrect or contextually unsuitable. This characteristic can lead to undesirable or unsafe actions in the CPS. Therefore, our research focuses on assuring the LLM-enabled CPS by enhancing their critical properties. We propose SafePilot, a novel hierarchical neuro-symbolic framework that provides end-to-end assurance for LLM-enabled CPS according to attribute-based and temporal specifications. Given a task and its specification, SafePilot first invokes a hierarchical planner with a discriminator that assesses task complexity. If the task is deemed manageable, it is passed directly to an LLM-based task planner with built-in verification. Otherwise, the hierarchical planner applies a divide-and-conquer strategy, decomposing the task into sub-tasks, each of which is individually planned and later merged into a final solution. The LLM-based task planner translates natural language constraints into formal specifications and verifies the LLM's output against them. If violations are detected, it identifies the flaw, adjusts the prompt accordingly, and re-invokes the LLM. This iterative process continues until a valid plan is produced or a predefined limit is reached. Our framework supports LLM-enabled CPS with both attribute-based and temporal constraints. Its effectiveness and adaptability are demonstrated through two illustrative case studies.
comment: 12 pages, 8 figures
A Framework for Closed-Loop Robotic Assembly, Alignment and Self-Recovery of Precision Optical Systems
Robotic automation has transformed scientific workflows in domains such as chemistry and materials science, yet free-space optics, which is a high precision domain, remains largely manual. Optical systems impose strict spatial and angular tolerances, and their performance is governed by tightly coupled physical parameters, making generalizable automation particularly challenging. In this work, we present a robotics framework for the autonomous construction, alignment, and maintenance of precision optical systems. Our approach integrates hierarchical computer vision systems, optimization routines, and custom-built tools to achieve this functionality. As a representative demonstration, we perform the fully autonomous construction of a tabletop laser cavity from randomly distributed components. The system performs several tasks such as laser beam centering, spatial alignment of multiple beams, resonator alignment, laser mode selection, and self-recovery from induced misalignment and disturbances. By achieving closed-loop autonomy for highly sensitive optical systems, this work establishes a foundation for autonomous optical experiments for applications across technical domains.
GaussianSSC: Triplane-Guided Directional Gaussian Fields for 3D Semantic Completion
We present \emph{GaussianSSC}, a two-stage, grid-native and triplane-guided approach to semantic scene completion (SSC) that injects the benefits of Gaussians without replacing the voxel grid or maintaining a separate Gaussian set. We introduce \emph{Gaussian Anchoring}, a sub-pixel, Gaussian-weighted image aggregation over fused FPN features that tightens voxel--image alignment and improves monocular occupancy estimation. We further convert point-like voxel features into a learned per-voxel Gaussian field and refine triplane features via a triplane-aligned \emph{Gaussian--Triplane Refinement} module that combines \emph{local gathering} (target-centric) and \emph{global aggregation} (source-centric). This directional, anisotropic support captures surface tangency, scale, and occlusion-aware asymmetry while preserving the efficiency of triplane representations. On SemanticKITTI~\cite{behley2019semantickitti}, GaussianSSC improves Stage~1 occupancy by +1.0\% Recall, +2.0\% Precision, and +1.8\% IoU over state-of-the-art baselines, and improves Stage~2 semantic prediction by +1.8\% IoU and +0.8\% mIoU.
MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping CVPR 2026
Active mapping aims to determine how an agent should move to efficiently reconstruct an unknown environment. Most existing approaches rely on greedy next-best-view prediction, resulting in inefficient exploration and incomplete scene reconstruction. To address this limitation, we introduce MAGICIAN, a novel long-term planning framework that maximizes accumulated surface coverage gain through Imagined Gaussians, a scene representation derived from a pre-trained occupancy network with strong structural priors. This representation enables efficient computation of coverage gain for any novel viewpoint via fast volumetric rendering, allowing its integration into a tree-search algorithm for long-horizon planning. We update Imagined Gaussians and refine the planned trajectory in a closed-loop manner. Our method achieves state-of-the-art performance across indoor and outdoor benchmarks with varying action spaces, demonstrating the critical advantage of long-term planning in active mapping.
comment: Accepted at CVPR 2026. Project webpage: https://shiyao-li.github.io/magician/
Trajectory Generation for Underactuated Soft Robot Manipulators using Discrete Elastic Rod Dynamics
Soft robots are well suited for contact-rich tasks due to their compliance, yet this property makes accurate and tractable modeling challenging. Planning motions with dynamically-feasible trajectories requires models that capture arbitrary deformations, remain computationally efficient, and are compatible with underactuation. However, existing approaches balance these properties unevenly: continuum rod models provide physical accuracy but are computationally demanding, while reduced-order approximations improve efficiency at the cost of modeling fidelity. To address this, our work introduces a control-oriented reformulation of Discrete Elastic Rod (DER) dynamics for soft robots, and a method to generate trajectories with these dynamics. The proposed formulation yields a control-affine representation while preserving certain first-principles force-deformation relationships. As a result, the generated trajectories are both dynamically feasible and consistent with the underlying actuation assumptions. We present our trajectory generation framework and validate it experimentally on a pneumatic soft robotic limb. Hardware results demonstrate consistently improved trajectory tracking performance over a constant-curvature-based baseline, particularly under complex actuation conditions.
A vision-language model and platform for temporally mapping surgery from video
Mapping surgery is fundamental to developing operative guidelines and enabling autonomous robotic surgery. Recent advances in artificial intelligence (AI) have shown promise in mapping the behaviour of surgeons from videos, yet current models remain narrow in scope, capturing limited behavioural components within single procedures, and offer limited translational value, as they remain inaccessible to practising surgeons. Here we introduce Halsted, a vision-language model trained on the Halsted Surgical Atlas (HSA), one of the most comprehensive annotated video libraries grown through an iterative self-labelling framework and encompassing over 650,000 videos across eight surgical specialties. To facilitate benchmarking, we publicly release HSA-27k, a subset of the Halsted Surgical Atlas. Halsted surpasses previous state-of-the-art models in mapping surgical activity while offering greater comprehensiveness and computational efficiency. To bridge the longstanding translational gap of surgical AI, we develop the Halsted web platform (https://halstedhealth.ai/) to provide surgeons anywhere in the world with the previously-unavailable capability of automatically mapping their own procedures within minutes. By standardizing unstructured surgical video data and making these capabilities directly accessible to surgeons, our work brings surgical AI closer to clinical deployment and helps pave the way toward autonomous robotic surgery.
Task-Agnostic Exoskeleton Control Supports Elderly Joint Energetics during Hip-Intensive Tasks
Age-related mobility decline is frequently accompanied by a redistribution of joint kinetics, where older adults compensate for reduced ankle function by increasing demand on the hip. Paradoxically, this compensatory shift typically coincides with age-related reductions in maximal hip power. Although robotic exoskeletons can provide immediate energetic benefits, conventional control strategies have limited previous studies in this population to specific tasks such as steady-state walking, which do not fully reflect mobility demands in the home and community. Here, we implement a task-agnostic hip exoskeleton controller that is inherently sensitive to joint power and validate its efficacy in eight older adults. Across a battery of hip-intensive activities that included level walking, ramp ascent, stair climbing, and sit-to-stand transitions, the exoskeleton matched biological power profiles with high accuracy (mean cosine similarity 0.89). Assistance significantly reduced sagittal plane biological positive work by 24.7% at the hip and by 9.3% for the lower limb, while simultaneously augmenting peak total (biological + exoskeleton) hip power and reducing peak biological hip power. These results suggest that hip exoskeletons can potentially enhance endurance through biological work reduction, and increase functional reserve through total power augmentation, serving as a promising biomechanical intervention to support older adults' mobility.
GIFT: Generalizing Intent for Flexible Test-Time Rewards ICRA '26
Robots learn reward functions from user demonstrations, but these rewards often fail to generalize to new environments. This failure occurs because learned rewards latch onto spurious correlations in training data rather than the underlying human intent that demonstrations represent. Existing methods leverage visual or semantic similarity to improve robustness, yet these surface-level cues often diverge from what humans actually care about. We present Generalizing Intent for Flexible Test-Time Rewards (GIFT), a framework that grounds reward generalization in human intent rather than surface cues. GIFT leverages language models to infer high-level intent from user demonstrations by contrasting preferred with non-preferred behaviors. At deployment, GIFT maps novel test states to behaviorally equivalent training states via intent-conditioned similarity, enabling learned rewards to generalize across distribution shifts without retraining. We evaluate GIFT on tabletop manipulation tasks with new objects and layouts. Across four simulated tasks with over 50 unseen objects, GIFT consistently outperforms visual and semantic similarity baselines in test-time pairwise win rate and state-alignment F1 score. Real-world experiments on a 7-DoF Franka Panda robot demonstrate that GIFT reliably transfers to physical settings. Further discussion can be found at https://mit-clear-lab.github.io/GIFT/
comment: To appear at IEEE ICRA '26
Allometric Scaling Laws for Bipedal Robots
Scaling the design of robots up or down remains a fundamental challenge. While biological systems follow well-established isometric and allometric scaling laws relating mass, stride frequency, velocity, and torque, it is unclear how these relationships translate to robotic systems. In this paper, we generate similar allometric scaling laws for bipedal robots across three orders of magnitude in leg length. First, we conduct a review of legged robots from the literature and extract empirical relationships between leg length (L), body length, mass, and speed. These data show that robot mass scales more closely to L^2, in contrast to the L^3 scaling predicted by isometric scaling. We then perform controlled simulation studies in Drake using three variants of real quasi-passive, hip-actuated walkers with different foot geometries and control strategies. We evaluate the performance of each design scaled with leg length, L. Across all robots, walking velocity follows the expected L^(1/2) trend from dynamic similarity. Minimum required torque scales more closely with m*L than the isometric model of m*L^2. Foot geometry scaled proportionally with L^1. These results provide new insight into how robot designs allometrically scale to different sizes, and how that scaling is different from isometric or biological scaling laws.
Learning Sidewalk Autopilot from Multi-Scale Imitation with Corrective Behavior Expansion
Sidewalk micromobility is a promising solution for last-mile transportation, but current learning-based control methods struggle in complex urban environments. Imitation learning (IL) learns policies from human demonstrations, yet its reliance on fixed offline data often leads to compounding errors, limited robustness, and poor generalization. To address these challenges, we propose a framework that advances IL through corrective behavior expansion and multi-scale imitation learning. On the data side, we augment teleoperation datasets with diverse corrective behaviors and sensor augmentations to enable the policy to learn to recover from its own mistakes. On the model side, we introduce a multi-scale IL architecture that captures both short-horizon interactive behaviors and long-horizon goal-directed intentions via horizon-based trajectory clustering and hierarchical supervision. Real-world experiments show that our approach significantly improves robustness and generalization in diverse sidewalk scenarios.
Parallel OctoMapping: A Scalable Framework for Enhanced Path Planning in Autonomous Navigation
Mapping is essential in robotics and autonomous systems because it provides the spatial foundation for path planning. Efficient mapping enables planning algorithms to generate reliable paths while ensuring safety and adapting in real time to complex environments. Fixed-resolution mapping methods often produce overly conservative obstacle representations that lead to suboptimal paths or planning failures in cluttered scenes. To address this issue, we introduce Parallel OctoMapping (POMP), an efficient OctoMap-based mapping technique that maximizes available free space and supports multi-threaded computation. To the best of our knowledge, POMP is the first method that, at a fixed occupancy-grid resolution, refines the representation of free space while preserving map fidelity and compatibility with existing search-based planners. It can therefore be integrated into existing planning pipelines, yielding higher pathfinding success rates and shorter path lengths, especially in cluttered environments, while substantially improving computational efficiency.
Energy-Aware Collaborative Exploration for a UAV-UGV Team
We present an energy-aware collaborative exploration framework for a UAV-UGV team operating in unknown environments, where the UAV's energy constraint is modeled as a maximum flight-time limit. The UAV executes a sequence of energy-bounded exploration tours, while the UGV simultaneously explores on the ground and serves as a mobile charging station. Rendezvous is enforced under a shared time budget so that the vehicles meet at the end of each tour before the UAV reaches its flight-time limit. We construct a sparsely coupled air-ground roadmap using a density-aware layered probabilistic roadmap (PRM) and formulate tour selection over the roadmap as coupled orienteering problems (OPs) to maximize information gain subject to the rendezvous constraint. The resulting tours are constructed over collision-validated roadmap edges. We validate our method through simulation studies, benchmark comparisons, and real-world experiments.
MapForest: A Modular Field Robotics System for Forest Mapping and Invasive Species Localization
Monitoring and controlling invasive tree species across large forests, parks, and trail networks is challenging due to limited accessibility, reliance on manual scouting, and degraded under-canopy GNSS. We present MapForest, a modular field robotics system that transforms multi-modal sensor data into GIS-ready invasive-species maps. Our system features: (i) a compact, platform-agnostic sensing payload that can be rapidly mounted on UAV, bicycle, or backpack platforms, and (ii) a software pipeline comprising LiDAR-inertial mapping, image-based invasive-species detection, and georeferenced map generation. To ensure reliable operation in GNSS-intermittent environments, we enhance a LiDAR-inertial mapping backbone with covariance-aware GNSS factors and robust loss kernels. We train an object detector to detect the Tree-of-Heaven (Ailanthus altissima) from onboard RGB imagery and fuse detections with the reconstructed map to produce geospatial outputs suitable for downstream decision making. We collected a dataset spanning six sites across urban environments, parks, trails, and forests to evaluate individual system modules, and report end-to-end results on two sites containing Tree-of-Heaven. The enhanced mapping module achieved a trajectory deviation error of 1.95 m over a 1.2 km forest traversal, and the Tree-of-Heaven detector achieved an F1 score of 0.653. The datasets and associated tooling are released to support reproducible research in forest mapping and invasive-species monitoring.
comment: 8 pages, 9 figures. Under review
Wake Up to the Past: Using Memory to Model Fluid Wake Effects on Robots IROS 2026
Autonomous aerial and aquatic robots that attain mobility by perturbing their medium, such as multicopters and torpedoes, produce wake effects that act as disturbances for adjacent robots. Wake effects are hard to model and predict due to the chaotic spatio-temporal dynamics of the fluid, entangled with the physical geometry of the robots and their complex motion patterns. Data-driven approaches using neural networks typically learn a memory-less function that maps the current states of the two robots to a force observed by the "sufferer" robot. Such models often perform poorly in agile scenarios: since the wake effect has a finite propagation time, the disturbance observed by a sufferer robot is some function of relative states in the past. In this work, we present an empirical study of the properties a wake-effect predictor must satisfy to accurately model the interactions between two robots mediated by a fluid. We explore seven data-driven models designed to capture the spatio-temporal evolution of fluid wake effects in four different media. This allows us to introspect the models and analyze the reasons why certain features enable improved accuracy in prediction across predictors and fluids. As experimental validation, we develop a planar rectilinear gantry for two spinning monocopters to test in real-world data with feedback control. The conclusion is that support of history of previous states as input and transport delay prediction substantially helps to learn an accurate wake-effect predictor.
comment: 8 pages, 7 figures. Submitted to IROS 2026. Project website: https://sites.google.com/view/wake-up-to-the-past
CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation
"Code-as-Policy" considers how executable code can complement data-intensive Vision-Language-Action (VLA) methods, yet their effectiveness as autonomous controllers for embodied manipulation remains underexplored. We present CaP-X, an open-access framework for systematically studying Code-as-Policy agents in robot manipulation. At its core is CaP-Gym, an interactive environment in which agents control robots by synthesizing and executing programs that compose perception and control primitives. Building on this foundation, CaP-Bench evaluates frontier language and vision-language models across varying levels of abstraction, interaction, and perceptual grounding. Across 12 models, CaP-Bench reveals a consistent trend: performance improves with human-crafted abstractions but degrades as these priors are removed, exposing a dependence on designer scaffolding. At the same time, we observe that this gap can be mitigated through scaling agentic test-time computation--through multi-turn interaction, structured execution feedback, visual differencing, automatic skill synthesis, and ensembled reasoning--substantially improves robustness even when agents operate over low-level primitives. These findings allow us to derive CaP-Agent0, a training-free framework that recovers human-level reliability on several manipulation tasks in simulation and on real embodiments. We further introduce CaP-RL, showing reinforcement learning with verifiable rewards improves success rates and transfers from sim2real with minimal gap. Together, CaP-X provides a principled, open-access platform for advancing embodied coding agents.
Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling
Robust perception and dynamics modeling are fundamental to real-world robotic policy learning. Recent methods employ video diffusion models (VDMs) to enhance robotic policies, improving their understanding and modeling of the physical world. However, existing approaches overlook the coherent and physically consistent motion representations inherently encoded across frames in VDMs. To this end, we propose Video2Act, a framework that efficiently guides robotic action learning by explicitly integrating spatial and motion-aware representations. Building on the inherent representations of VDMs, we extract foreground boundaries and inter-frame motion variations while filtering out background noise and task-irrelevant biases. These refined representations are then used as additional conditioning inputs to a diffusion transformer (DiT) action head, enabling it to reason about what to manipulate and how to move. To mitigate inference inefficiency, we propose an asynchronous dual-system design, where the VDM functions as the slow System 2 and the DiT head as the fast System 1, working collaboratively to generate adaptive actions. By providing motion-aware conditions to System 1, Video2Act maintains stable manipulation even with low-frequency updates from the VDM. For evaluation, Video2Act surpasses previous state-of-the-art VLA methods by 7.7% in simulation and 21.7% in real-world tasks in terms of average success rate, further exhibiting strong generalization capabilities.
VL-Nav: A Neuro-Symbolic Approach for Reasoning-based Vision-Language Navigation
Navigating unseen, large-scale environments based on complex and abstract human instructions remains a formidable challenge for autonomous mobile robots. Addressing this requires robots to infer implicit semantics and efficiently explore large-scale task spaces. However, existing methods, ranging from end-to-end learning to foundation model-based modular architectures, often lack the capability to decompose complex tasks or employ efficient exploration strategies, leading to robot aimless wandering or target recognition failures. To address these limitations, we propose VL-Nav, a neuro-symbolic (NeSy) vision-language navigation system. The proposed system intertwines neural reasoning with symbolic guidance through two core components: (1) a NeSy task planner that leverages a symbolic 3D scene graph and image memory system to enhance the vision language models' (VLMs) neural reasoning capabilities for task decomposition and replanning; and (2) a NeSy exploration system that couples neural semantic cues with the symbolic heuristic function to efficiently gather the task-related information while minimizing unnecessary repeat travel during exploration. Validated on the DARPA TIAMAT Challenge navigation tasks, our system achieved an 83.4% success rate (SR) in indoor environments and 75% in outdoor scenarios. VL-Nav achieved an 86.3% SR in real-world experiments, including a challenging 483-meter run. Finally, we validate the system with complex instructions in a 3D multi-floor scenario.
Semi-Infinite Programming for Collision-Avoidance in Optimal and Model Predictive Control
This paper presents a novel approach for collision avoidance in optimal and model predictive control, in which the environment is represented by a large number of points and the robot as a union of padded polygons. The conditions that none of the points shall collide with the robot can be written in terms of an infinite number of constraints per obstacle point. We show that the resulting semi-infinite programming (SIP) optimal control problem (OCP) can be efficiently tackled through a combination of two methods: local reduction and an external active-set method. Specifically, this involves iteratively identifying the closest point obstacles, determining the lower-level distance minimizer among all feasible robot shape parameters, and solving the upper-level finitely-constrained subproblems. In addition, this paper addresses robust collision avoidance in the presence of ellipsoidal state uncertainties. Enforcing constraint satisfaction over all possible uncertainty realizations extends the dimension of constraint infiniteness. The infinitely many constraints arising from translational uncertainty are handled by local reduction together with the robot shape parameterization, while rotational uncertainty is addressed via a backoff reformulation. A controller implemented based on the proposed method is demonstrated on a real-world robot running at 20Hz, enabling fast and collision-free navigation in tight spaces. An application to 3D collision avoidance is also demonstrated in simulation.
comment: 20 pages, 17 figures
Foundation Models for Trajectory Planning in Autonomous Driving: A Review of Progress and Open Challenges
The emergence of multi-modal foundation models has markedly transformed the technology for autonomous driving, shifting away from conventional and mostly hand-crafted design choices towards unified, foundation-model-based approaches, capable of directly inferring motion trajectories from raw sensory inputs. This new class of methods can also incorporate natural language as an additional modality, with Vision-Language-Action (VLA) models serving as a representative example. In this review, we provide a comprehensive examination of such methods through a unifying taxonomy to critically evaluate their architectural design choices, methodological strengths, and their inherent capabilities and limitations. Our survey covers 37 recently proposed approaches that span the landscape of trajectory planning with foundation models. Furthermore, we assess these approaches with respect to the openness of their source code and datasets, offering valuable information to practitioners and researchers. We provide an accompanying webpage that catalogues the methods based on our taxonomy, available at: https://github.com/fiveai/FMs-for-driving-trajectories
comment: Accepted to TMLR (Survey Certification)
Scalable Multi-Task Learning through Spiking Neural Networks with Adaptive Task-Switching Policy for Intelligent Autonomous Agents
Training resource-constrained autonomous agents on multiple tasks simultaneously is crucial for adapting to diverse real-world environments. Recent works employ reinforcement learning (RL) approach, but they still suffer from sub-optimal multi-task performance due to task interference. State-of-the-art works employ Spiking Neural Networks (SNNs) to improve RL-based multi-task learning and enable low-power/energy operations through network enhancements and spike-driven data stream processing. However, they rely on fixed task-switching intervals during its training, thus limiting its performance and scalability. To address this, we propose SwitchMT, a novel methodology that employs adaptive task-switching for effective, scalable, and simultaneous multi-task learning. SwitchMT employs the following key ideas: (1) leveraging a Deep Spiking Q-Network with active dendrites and dueling structure, that utilizes task-specific context signals to create specialized sub-networks; and (2) devising an adaptive task-switching policy that leverages both rewards and internal dynamics of the network parameters. Experimental results demonstrate that SwitchMT achieves competitive scores in multiple Atari games (i.e., Pong: -8.8, Breakout: 5.6, and Enduro: 355.2) and longer game episodes as compared to the state-of-the-art. These results also highlight the effectiveness of SwitchMT methodology in addressing task interference without increasing the network complexity, enabling intelligent autonomous agents with scalable multi-task learning capabilities.
comment: Accepted at the 63rd ACM/IEEE Design Automation Conference (DAC), July 26-29, 2026 in Long Beach, CA, USA. [Codes: https://github.com/rachmadvwp/SwitchMT]
OmniVTA: Visuo-Tactile World Modeling for Contact-Rich Robotic Manipulation
Contact-rich manipulation tasks, such as wiping and assembly, require accurate perception of contact forces, friction changes, and state transitions that cannot be reliably inferred from vision alone. Despite growing interest in visuo-tactile manipulation, progress is constrained by two persistent limitations: existing datasets are small in scale and narrow in task coverage, and current methods treat tactile signals as passive observations rather than using them to model contact dynamics or enable closed-loop control explicitly. In this paper, we present \textbf{OmniViTac}, a large-scale visuo-tactile-action dataset comprising $21{,}000+$ trajectories across $86$ tasks and $100+$ objects, organized into six physics-grounded interaction patterns. Building on this dataset, we propose \textbf{OmniVTA}, a world-model-based visuo-tactile manipulation framework that integrates four tightly coupled modules: a self-supervised tactile encoder, a two-stream visuo-tactile world model for predicting short-horizon contact evolution, a contact-aware fusion policy for action generation, and a 60Hz reflexive controller that corrects deviations between predicted and observed tactile signals in a closed loop. Real-robot experiments across all six interaction categories show that OmniVTA outperforms existing methods and generalizes well to unseen objects and geometric configurations, confirming the value of combining predictive contact modeling with high-frequency tactile feedback for contact-rich manipulation. All data, models, and code will be made publicly available on the project website at https://mrsecant.github.io/OmniVTA.
comment: TARS Robotics Project Page: https://mrsecant.github.io/OmniVTA
KeySG: Hierarchical Keyframe-Based 3D Scene Graphs
In recent years, 3D scene graphs have emerged as a powerful world representation, offering both geometric accuracy and semantic richness. Combining 3D scene graphs with large language models enables robots to reason, plan, and navigate in complex human-centered environments. However, current approaches for constructing 3D scene graphs are semantically limited to a predefined set of relationships, and their serialization in large environments can easily exceed an LLM's context window. We introduce KeySG, a framework that represents 3D scenes as a hierarchical graph consisting of floors, rooms, objects, and functional elements, where nodes are augmented with multi-modal information extracted from keyframes selected to optimize geometric and visual coverage. The keyframes allow us to efficiently leverage VLMs to extract scene information, alleviating the need to explicitly model relationship edges between objects, enabling more general, task-agnostic reasoning and planning. Our approach can process complex and ambiguous queries while mitigating the scalability issues associated with large scene graphs by utilizing a hierarchical multi-modal retrieval-augmented generation (RAG) pipeline to extract relevant context from the graph. Evaluated across three distinct benchmarks, 3D object semantic segmentation, functional element segmentation, and complex query retrieval, KeySG outperforms prior approaches on most metrics, demonstrating its superior semantic richness and efficiency.
comment: Code and video are available at https://keysg-lab.github.io/
Data Scaling for Navigation in Unknown Environments
Generalization of imitation-learned navigation policies to environments unseen in training remains a major challenge. We address this by conducting the first large-scale study of how data quantity and data diversity affect real-world generalization in end-to-end, map-free visual navigation. Using a curated 4,565-hour crowd-sourced dataset collected across 161 locations in 35 countries, we train policies for point goal navigation and evaluate their closed-loop control performance on sidewalk robots operating in four countries, covering 125 km of autonomous driving. Our results show that large-scale training data enables zero-shot navigation in unknown environments, approaching the performance of policies trained with environment-specific demonstrations. Critically, we find that data diversity is far more important than data quantity. Doubling the number of geographical locations in a training set decreases navigation errors by ~15%, while performance benefit from adding data from existing locations saturates with very little data. We also observe that, with noisy crowd-sourced data, simple regression-based models outperform generative and sequence-based architectures. We release our policies, evaluation setup and example videos at https://lasuomela.github.io/navigation_scaling/.
comment: Robotics and Automation Letters (RA-L) 2026
Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals CVPR 2026
Recent advancements in video generation have enabled the development of ``world models'' capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks. To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives-such as elastic collisions and falling dominos-teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zero-shot generalization to complex, real-world scenarios, including tool manipulation and multi-object causal chains. Our results suggest that by grounding video generation in fundamental physical interactions, models can emerge as implicit neural physics simulators, enabling precise, physics-aware planning without reliance on external engines. We release all datasets, code, model weights, and interactive video demos at our project page.
comment: Camera ready version (CVPR 2026). Code and interactive demos at https://goal-force.github.io/
Spectral Alignment in Forward-Backward Representations via Temporal Abstraction
Forward-backward (FB) representations provide a powerful framework for learning the successor representation (SR) in continuous spaces by enforcing a low-rank factorization. However, a fundamental spectral mismatch often exists between the high-rank transition dynamics of continuous environments and the low-rank bottleneck of the FB architecture, making accurate low-rank representation learning difficult. In this work, we analyze temporal abstraction as a mechanism to mitigate this mismatch. By characterizing the spectral properties of the transition operator, we show that temporal abstraction acts as a low-pass filter that suppresses high-frequency spectral components. This suppression reduces the effective rank of the induced SR while preserving a formal bound on the resulting value function error. Empirically, we show that this alignment is a key factor for stable FB learning, particularly at high discount factors where bootstrapping becomes error-prone. Our results identify temporal abstraction as a principled mechanism for shaping the spectral structure of the underlying MDP and enabling effective long-horizon representations in continuous control.
From 2D to 3D terrain-following area coverage path planning SC 2026
An algorithm for 3D terrain-following area coverage path planning is presented. Multiple adjacent paths are generated that are (i) locally apart from each other by a distance equal to the working width of a machinery, while (ii) simultaneously floating at a projection distance equal to a specific working height above the terrain. The complexities of the algorithm in comparison to its 2D equivalent are highlighted. These include uniformly spaced elevation data generation using an Inverse Distance Weighting-approach and a local search. Area coverage path planning results for real-world 3D data within an agricultural context are presented to validate the algorithm.
comment: 6 pages, 10 figures, 1 table, IEEE ICARSC 2026
Towards a Practical Understanding of Lagrangian Methods in Safe Reinforcement Learning
Safe reinforcement learning addresses constrained optimization problems where maximizing performance must be balanced against safety constraints, and Lagrangian methods are a widely used approach for this purpose. However, the effectiveness of Lagrangian methods depends crucially on the choice of the Lagrange multiplier $λ$, which governs the multi-objective trade-off between return and cost. A common practice is to update the multiplier automatically during training. Although this approach is standard in practice, there remains limited empirical evidence on the optimally achievable trade-off between return and cost as a function of $λ$, and there is currently no systematic benchmark comparing automated update mechanisms to this empirical optimum. Therefore, we study (i) the constraint geometry for eight widely used safety tasks and (ii) the previously overlooked constraint-regime sensitivity of different Lagrange multiplier update mechanisms in safe reinforcement learning. Through the lens of multi-objective analysis, we present empirical Pareto frontiers that offer a complete visualization of the trade-off between return and cost in the underlying optimization problem. Our results reveal the highly sensitive nature of $λ$ and further show that the restrictiveness of the constraint cost can vary across different cost limits within the same task. This highlights the importance of careful cost limit selection across different regions of cost restrictiveness when evaluating safe reinforcement learning methods. We provide a recommended set of cost limits for each evaluated task and offer an open-source code base: https://github.com/lindsayspoor/Lagrangian_SafeRL.
Concept-Based Dictionary Learning for Inference-Time Safety in Vision Language Action Models
Vision Language Action (VLA) models close the perception action loop by translating multimodal instructions into executable behaviors, but this very capability magnifies safety risks: jailbreaks that merely yield toxic text in LLMs can trigger unsafe physical actions in embodied systems. Existing defenses alignment, filtering, or prompt hardening intervene too late or at the wrong modality, leaving fused representations exploitable. We introduce a concept based dictionary learning framework for inference time safety control. By learning sparse, interpretable dictionaries from hidden activations, our method identifies harmful concept directions and attenuates risky components when the estimated risk exceeds a threshold. Experiments on Libero-Harm, BadRobot, RoboPair, and IS-Bench show that our approach achieves state-of-the-art defense performance, cutting attack success rates by over 70\% while maintaining task success. Crucially, the framework is plug-in and model-agnostic, requiring no retraining and integrating seamlessly with diverse VLAs. To our knowledge, this is the first inference time concept based safety method for embodied systems, advancing both interpretability and safe deployment of VLA models.
HortiMulti: A Multi-Sensor Dataset for Localisation and Mapping in Horticultural Polytunnels
Agricultural robotics is gaining increasing relevance in both research and real-world deployment. As these systems are expected to operate autonomously in more complex tasks, the availability of representative real-world datasets becomes essential. While domains such as urban and forestry robotics benefit from large and established benchmarks, horticultural environments remain comparatively under-explored despite the economic significance of this sector. To address this gap, we present HortiMulti, a multimodal, cross-season dataset collected in commercial strawberry and raspberry polytunnels across an entire growing season, capturing substantial appearance variation, dynamic foliage, specular reflections from plastic covers, severe perceptual aliasing, and GNSS-unreliable conditions, all of which directly degrade existing localisation and perception algorithms. The sensor suite includes two 3D LiDARs, four RGB cameras, an IMU, GNSS, and wheel odometry. Ground truth trajectories are derived from a combination of Total Station surveying, AprilTag fiducial markers, and LiDAR-inertial odometry, spanning dense, sparse, and marker-free coverage to support evaluation under both controlled and realistic conditions. We release time-synchronised raw measurements, calibration files, reference trajectories, and baseline benchmarks for visual, LiDAR, and multi-sensor SLAM, with results confirming that current state-of-the-art methods remain inadequate for reliable polytunnel deployment, establishing HortiMulti as a one-stop resource for developing and testing robotic perception systems in horticulture environments.
Differentiable Simulation of Hard Contacts with Soft Gradients for Learning and Control
Contact forces introduce discontinuities into robot dynamics that severely limit the use of simulators for gradient-based optimization. Penalty-based simulators such as MuJoCo, soften contact resolution to enable gradient computation. However, realistically simulating hard contacts requires stiff solver settings, which leads to incorrect simulator gradients when using automatic differentiation. Contrarily, using non-stiff settings strongly increases the sim-to-real gap. We analyze penalty-based simulators to pinpoint why gradients degrade under hard contacts. Building on these insights, we propose DiffMJX, which couples adaptive time integration with penalty-based simulation to substantially improve gradient accuracy. A second challenge is that contact gradients vanish when bodies separate. To address this, we introduce contacts from distance (CFD) which combines penalty-based simulation with straight-through estimation. By applying CFD exclusively in the backward pass, we obtain informative pre-contact gradients while retaining physical realism.
Mixed-Integer vs. Continuous Model Predictive Control for Binary Thrusters: A Comparative Study
Binary on/off thrusters are commonly used for spacecraft attitude and position control during proximity operations. However, their discrete nature poses challenges for conventional continuous control methods. The control of these discrete actuators is either explicitly formulated as a mixed-integer optimization problem or handled in a two-layer approach, where a continuous controller's output is converted to binary commands using analog-to digital modulation techniques such as Delta-Sigma-modulation. This paper provides the first systematic comparison between these two paradigms for binary thruster control, contrasting continuous Model Predictive Control (MPC) with Delta-Sigma modulation against direct Mixed-Integer MPC (MIMPC) approaches. Furthermore, we propose a new variant of MPC for binary actuated systems, which is informed using the state of the Delta-Sigma Modulator. The two variations for the continuous MPC along with the MIMPC are evaluated through extensive simulations using ESA's REACSA platform. Results demonstrate that while all approaches perform similarly in high-thrust regimes, MIMPC achieves superior fuel efficiency in low-thrust conditions. Continuous MPC with modulation shows instabilities at higher thrust levels, while binary informed MPC, which incorporates modulator dynamics, improves robustness and reduces the efficiency gap to the MIMPC. It can be seen from the simulated and real-system experiments that MIMPC offers complete stability and fuel efficiency benefits, particularly for resource-constrained missions, while continuous control methods remain attractive for computationally limited applications.
comment: Accepted to CEAS EuroGNC 2026
A Real-Time System for Scheduling and Managing UAV Delivery in Urban Areas
As urban logistics demand continues to grow, UAV delivery has become a key solution to improve delivery efficiency, reduce traffic congestion, and lower logistics costs. However, to fully leverage the potential of UAV delivery networks, efficient swarm scheduling and management are crucial. In this paper, we propose a real-time scheduling and management system based on the ``Airport-Unloading Station" model, aiming to bridge the gap between high-level scheduling algorithms and low-level execution systems. This system, acting as middleware, accurately translates the requirements from the scheduling layer into specific execution instructions, ensuring that the scheduling algorithms perform effectively in real-world environments. Additionally, we implement three collaborative scheduling schemes involving autonomous ground vehicles (AGVs), unmanned aerial vehicles (UAVs), and ground staff to further optimize overall delivery efficiency. Through extensive experiments, this study demonstrates the rationality and feasibility of the proposed management system, providing practical solution for the commercial application of UAVs delivery in urban. Code: https://github.com/chengji253/UAVDeliverySystem
comment: ROBIO 2025
Learning to Sample: Reinforcement Learning-Guided Sampling for Autonomous Vehicle Motion Planning
Sampling-based motion planning is a well-established approach in autonomous driving, valued for its modularity and analytical tractability. In complex urban scenarios, however, uniform or heuristic sampling often produces many infeasible or irrelevant trajectories. We address this limitation with a hybrid framework that learns where to sample while keeping trajectory generation and evaluation fully analytical and verifiable. A reinforcement learning (RL) agent guides the sampling process toward regions of the action space likely to yield feasible trajectories, while evaluation and final selection remains governed by deterministic feasibility checks and cost functions. We couple the RL sampler with a world model (WM) based on a decodable deep set encoder, enabling both variable numbers of traffic participants and reconstructable latent representations. The approach is evaluated in the CommonRoad (CR) simulation environment and compared against uniform-sampling baselines, showing up to 99% fewer required samples and a runtime reduction of up to 84% while maintaining planning quality in terms of success and collision-free rates. These improvements lead to faster, more reliable decision-making for autonomous vehicles in urban environments.
comment: 8 pages, submitted to the IEEE for possible publication
A Tactile-based Interactive Motion Planner for Robots in Unknown Cluttered Environments
In unknown cluttered environments with densely stacked objects, the free-motion space is extremely barren, posing significant challenges to motion planners. Collision-free planning methods often suffer from catastrophic failures due to unexpected collisions and motion obstructions. To address this issue, this paper proposes an interactive motion planning framework (I-MP), based on a perception-motion loop. This framework empowers robots to autonomously model and reason about contact models, which in turn enables safe expansion of the free-motion space. Specifically, the robot utilizes multimodal tactile perception to acquire stimulus-response signal pairs. This enables real-time identification of objects' mechanical properties and the subsequent construction of contact models. These models are integrated as computational constraints into a reactive planner. Based on fixed-point theorems, the planner computes the spatial state toward the target in real time, thus avoiding the computational burden associated with extrapolating on high-dimensional interaction models. Furthermore, high-dimensional interaction features are linearly superposed in Cartesian space in the form of energy, and the controller achieves trajectory tracking by solving the energy gradient from the current state to the planned state. The experimental results showed that at cruising speeds ranging from 0.01 to 0.07 $m/s$, the robot's initial contact force with objects remained stable at 1.0 +- 0.7 N. In the cabinet scenario test where collision-free trajectories were unavailable, I-MP expanded the free motion space by 37.5 % through active interaction, successfully completing the environmental exploration task.
A User-driven Design Framework for Robotaxi
Robotaxis are emerging as a promising form of urban mobility, but removing human drivers fundamentally reshapes passenger-vehicle interaction and raises new design challenges. To inform robotaxi design based on real-world experience, we conducted 18 semi-structured interviews and autoethnographic ride experiences to examine users' perceptions, experiences, and expectations for robotaxi design. We found that users valued benefits such as increased agency and consistent driving. However, they also encountered challenges such as limited flexibility, insufficient transparency, and emergency handling concerns. Notably, users perceived robotaxis not merely as a mode of transportation, but as autonomous, semi-private transitional spaces, which made users feel less socially intrusive to engage in personal activities. Safety perceptions were polarized: some felt anxiety about reduced control, while others viewed robotaxis as safer than humans due to their cautious, law-abiding nature. Based on the findings, we propose a user-driven design framework spanning hailing, pick-up, traveling, and drop-off phases to support trustworthy, transparent, and accountable robotaxi design.
Inverse-dynamics observer design for a linear single-track vehicle model with distributed tire dynamics
Accurate estimation of the vehicle's sideslip angle and tire forces is essential for enhancing safety and handling performances in unknown driving scenarios. To this end, the present paper proposes an innovative observer that combines a linear single-track model with a distributed representation of the tires and information collected from standard sensors. In particular, by adopting a comprehensive representation of the tires in terms of hyperbolic partial differential equations (PDEs), the proposed estimation strategy exploits dynamical inversion to reconstruct the lumped and distributed vehicle states solely from yaw rate and lateral acceleration measurements. Simulation results demonstrate the effectiveness of the observer in estimating the sideslip angle and tire forces even in the presence of noise and model uncertainties.
comment: 6 pages, 5 figures. Accepted at ECC 2026
Efficient View Planning Guided by Previous-Session Reconstruction for Repeated Plant Monitoring
Repeated plant monitoring is essential for tracking crop growth, and 3D reconstruction enables consistent comparison across monitoring sessions. However, rebuilding a 3D model from scratch in every session is costly and overlooks informative geometry already observed previously. We propose efficient view planning guided by a previous-session reconstruction, which reuses a 3D model from the previous session to improve active perception in the current session. Based on this previous-session reconstruction, our method replaces iterative next-best-view planning with one-shot view planning that selects an informative set of views and computes the globally shortest execution path connecting them. Experiments on real multi-session datasets, including public single-plant scans and a newly collected greenhouse crop-row dataset, show that our method achieves comparable or higher surface coverage with fewer executed views and shorter robot paths than iterative and one-shot baselines.
comment: Submitted for review
Reward Evolution with Graph-of-Thoughts: A Bi-Level Language Model Framework for Reinforcement Learning
Designing effective reward functions remains a major challenge in reinforcement learning (RL), often requiring considerable human expertise and iterative refinement. Recent advances leverage Large Language Models (LLMs) for automated reward design, but these approaches are limited by hallucinations, reliance on human feedback, and challenges with handling complex, multi-step tasks. In this work, we introduce Reward Evolution with Graph-of-Thoughts (RE-GoT), a novel bi-level framework that enhances LLMs with structured graph-based reasoning and integrates Visual Language Models (VLMs) for automated rollout evaluation. RE-GoT first decomposes tasks into text-attributed graphs, enabling comprehensive analysis and reward function generation, and then iteratively refines rewards using visual feedback from VLMs without human intervention. Extensive experiments on 10 RoboGen and 4 ManiSkill2 tasks demonstrate that RE-GoT consistently outperforms existing LLM-based baselines. On RoboGen, our method improves average task success rates by 32.25%, with notable gains on complex multi-step tasks. On ManiSkill2, RE-GoT achieves an average success rate of 93.73% across four diverse manipulation tasks, significantly surpassing prior LLM-based approaches and even exceeding expert-designed rewards. Our results indicate that combining LLMs and VLMs with graph-of-thoughts reasoning provides a scalable and effective solution for autonomous reward evolution in RL.
PhysMem: Self-Evolving Physical Memory for Robot Manipulation
Reliable object manipulation requires understanding physical properties that vary across objects and environments. Vision-language model (VLM) planners can reason about friction and stability in general terms; however, they often cannot predict how a specific ball will roll on a particular surface or which stone will provide a stable foundation without direct experience. We present PhysMem, a memory framework that enables VLM robot planners to learn physical principles from interaction at test time, without updating model parameters. The system records experiences, generates candidate hypotheses, and verifies them through targeted interaction before promoting validated knowledge to guide future decisions. A central design choice is verification before application: the system tests hypotheses against new observations rather than applying retrieved experience directly, reducing rigid reliance on prior experience when physical conditions change. We evaluate PhysMem on three real-world manipulation tasks and simulation benchmarks across four VLM backbones. On a controlled brick insertion task, principled abstraction achieves 76% success compared to 23% for direct experience retrieval, and real-world experiments show consistent improvement over 30-minute deployment sessions.
SVBRD-LLM: Self-Verifying Behavioral Rule Discovery for Autonomous Vehicle Identification
As autonomous vehicles (AVs) are increasingly deployed on public roads, understanding their real-world behaviors is critical for traffic safety analysis and regulatory oversight. However, many data-driven methods lack interpretability and cannot provide verifiable explanations of AV behavior in mixed traffic. This paper proposes SVBRD-LLM, a self-verifying behavioral rule discovery framework that automatically extracts interpretable behavioral rules from real-world traffic videos through zero-shot large language model (LLM) reasoning. The framework first derives vehicle trajectories using YOLOv26-based detection and ByteTrack-based tracking, then computes kinematic features and contextual information. It then employs GPT-5 zero-shot prompting to perform comparative behavioral analysis between AVs and human-driven vehicles (HDVs) across lane-changing and normal driving behaviors, generating 26 structured rule hypotheses that comprises both numerical thresholds and statistical behavioral patterns. These rules are subsequently evaluated through the AV identification task using an independent validation dataset, and iteratively refined through failure case analysis to filter spurious correlations and improve robustness. The resulting rule library contains 20 high-confidence behavioral rules, each including semantic description, quantitative thresholds or behavioral patterns, applicable context, and validation confidence. Experiments conducted on over 1,500 hours of real-world traffic videos from Waymo's commercial operating area demonstrate that the proposed framework achieves 90.0% accuracy and 93.3% F1-score in AV identification, with 98.0% recall. The discovered rules capture key AV traits in smoothness, conservatism, and lane discipline, informing safety assessment, regulatory compliance, and traffic management in mixed traffic. The dataset is available at: svbrd-llm-roadside-video-av.
Exploring Pose-Guided Imitation Learning for Robotic Precise Insertion
Imitation learning is promising for robotic manipulation, but \emph{precise insertion} in the real world remains difficult due to contact-rich dynamics, tight clearances, and limited demonstrations. Many existing visuomotor policies depend on high-dimensional RGB/point-cloud observations, which can be data-inefficient and generalize poorly under pose variations. In this paper, we study pose-guided imitation learning by using object poses in $\mathrm{SE}(3)$ as compact, object-centric observations for precise insertion tasks. First, we propose a diffusion policy for precise insertion that observes the \emph{relative} $\mathrm{SE}(3)$ pose of the source object with respect to the target object and predicts a future relative pose trajectory as its action. Second, to improve robustness to pose estimation noise, we augment the pose-guided policy with RGBD cues. Specifically, we introduce a goal-conditioned RGBD encoder to capture the discrepancy between current and goal observations. We further propose a pose-guided residual gated fusion module, where pose features provide the primary control signal and RGBD features adaptively compensate when pose estimates are unreliable. We evaluate our methods on six real-robot precise insertion tasks and achieve high performance with only $7$--$10$ demonstrations per task. In our setup, the proposed policies succeed on tasks with clearances down to $0.01$~mm and demonstrate improved data efficiency and generalization over existing baselines. Code will be available at https://github.com/sunhan1997/PoseInsert.
Multiagent Systems
Human-Inspired Pavlovian and Instrumental Learning for Autonomous Agent Navigation
Autonomous agents operating in uncertain environments must balance fast responses with goal-directed planning. Classical MF RL often converges slowly and may induce unsafe exploration, whereas MB methods are computationally expensive and sensitive to model mismatch. This paper presents a human-inspired hybrid RL architecture integrating Pavlovian, Instrumental MF, and Instrumental MB components. Inspired by Pavlovian and Instrumental learning from neuroscience, the framework considers contextual radio cues, here intended as georeferenced environmental features acting as CS, to shape intrinsic value signals and bias decision-making. Learning is further modulated by internal motivational drives through a dedicated motivational signal. A Bayesian arbitration mechanism adaptively blends MF and MB estimates based on predicted reliability. Simulation results show that the hybrid approach accelerates learning, improves operational safety, and reduces navigation in high-uncertainty regions compared to standard RL baselines. Pavlovian conditioning promotes safer exploration and faster convergence, while arbitration enables a smooth transition from exploration to efficient, plan-driven exploitation. Overall, the results highlight the benefits of biologically inspired modularity for robust and adaptive autonomous systems under uncertainty.
Partial Attention in Deep Reinforcement Learning for Safe Multi-Agent Control
Attention mechanisms excel at learning sequential patterns by discriminating data based on relevance and importance. This provides state-of-the-art performance in advanced generative artificial intelligence models. This paper applies this concept of an attention mechanism for multi-agent safe control. We specifically consider the design of a neural network to control autonomous vehicles in a highway merging scenario. The environment is modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP). Within a QMIX framework, we include partial attention for each autonomous vehicle, thus allowing each ego vehicle to focus on the most relevant neighboring vehicles. Moreover, we propose a comprehensive reward signal that considers the global objectives of the environment (e.g., safety and vehicle flow) and the individual interests of each agent. Simulations are conducted in the Simulation of Urban Mobility (SUMO). The results show better performance compared to other driving algorithms in terms of safety, driving speed, and reward.
comment: This work has been accepted for publication in the proceedings of the 2026 American Control Conference (ACC), New Orleans, Louisiana, USA
Modal Logic for Distributed Trust
We propose a method for reasoning about trust in multi-agent systems, specifying a language for describing communication protocols and making trust assumptions and derivations. This is given an interpretation in a modal logic for describing the beliefs and communications of agents in a network. We define how information in the network can be shared via forwarding, and how trust between agents can be generalized to trust across networks. We give specifications for the modal logic which can be readily adapted into a lambda calculus of proofs. We show that by nesting modalities, we can describe chains of communication between agents, and establish suitable notions of trust for such chains. We see how this can be applied to trust models in public key infrastructures, as well as other interaction protocols in distributed systems.
comment: 32 pages
Can a Robot Walk the Robotic Dog: Triple-Zero Collaborative Navigation for Heterogeneous Multi-Agent Systems
We present Triple Zero Path Planning (TZPP), a collaborative framework for heterogeneous multi-robot systems that requires zero training, zero prior knowledge, and zero simulation. TZPP employs a coordinator--explorer architecture: a humanoid robot handles task coordination, while a quadruped robot explores and identifies feasible paths using guidance from a multimodal large language model. We implement TZPP on Unitree G1 and Go2 robots and evaluate it across diverse indoor and outdoor environments, including obstacle-rich and landmark-sparse settings. Experiments show that TZPP achieves robust, human-comparable efficiency and strong adaptability to unseen scenarios. By eliminating reliance on training and simulation, TZPP offers a practical path toward real-world deployment of heterogeneous robot cooperation. Our code and video are provided at: https://github.com/triple-zeropp/Triple-zero-robot-agent
comment: 8 pages, 2 figures
A Game-Theoretic Framework for Intelligent EV Charging Network Optimisation in Smart Cities SC 2025
The transition to Electric Vehicles (EVs) demands intelligent, congestion-aware infrastructure planning to balance user convenience, economic viability, and traffic efficiency. We present a joint optimisation framework for EV Charging Station (CS) placement and pricing, explicitly capturing strategic driver behaviour through coupled non-atomic congestion games over road networks and charging facilities. From a Public Authority (PA) perspective, the model minimises social cost, travel times, queuing delays and charging expenses, while ensuring infrastructure profitability. To solve the resulting Mixed-Integer Nonlinear Programme, we propose a scalable two-level approximation method, Joint Placement and Pricing Optimisation under Driver Equilibrium (JPPO-DE), combining driver behaviour decomposition with integer relaxation. Experiments on the benchmark Sioux Falls Transportation Network (TN) demonstrate that our method consistently outperforms single-parameter baselines, effectively adapting to varying budgets, EV penetration levels, and station capacities. It achieves performance improvements of at least 16% over state-of-the-art approaches. A generalisation procedure further extends scalability to larger networks. By accurately modelling traffic equilibria and enabling adaptive, efficient infrastructure design, our framework advances key intelligent transportation system goals for sustainable urban mobility.
comment: This paper has been accepted for publication in the Proceedings of the IEEE 28th International Conference on Intelligent Transportation Systems (ITSC 2025)
Strategic Infrastructure Design via Multi-Agent Congestion Games with Joint Placement and Pricing
Real-world infrastructure planning increasingly involves strategic interactions among autonomous agents competing over congestible, limited resources. Applications such as Electric Vehicle (EV) charging, emergency response, and intelligent transportation require coordinated resource placement and pricing decisions, while anticipating the adaptive behaviour of decentralised, self-interested agents. We propose a novel multi-agent framework for joint placement and pricing under such interactions, formalised as a bi-level optimisation model. The upper level represents a central planner, while the lower level captures agent responses via coupled non-atomic congestion games. Motivated by the EV charging domain, we study a setting where a central planner provisions chargers and road capacity under budget and profitability constraints. The agent population includes both EV drivers and non-charging drivers (NCDs), who respond to congestion, delays, and costs. To solve the resulting NP-hard problem, we introduce ABO-MPN, a double-layer approximation framework that decouples agent types, applies integer adjustment and rounding, and targets high-impact placement and pricing decisions. Experiments on benchmark networks show that our model reduces social cost by up to 40% compared to placement- or pricing-only baselines, and generalises to other MAS-relevant domains.
comment: This paper has been accepted for publication in the Proceedings of the 22nd European Conference on Multi-Agent Systems (EUMAS 2025)
Is AI Ready for Multimodal Hate Speech Detection? A Comprehensive Dataset and Benchmark Evaluation
Hate speech online targets individuals or groups based on identity attributes and spreads rapidly, posing serious social risks. Memes, which combine images and text, have emerged as a nuanced vehicle for disseminating hate speech, often relying on cultural knowledge for interpretation. However, existing multimodal hate speech datasets suffer from coarse-grained labeling and a lack of integration with surrounding discourse, leading to imprecise and incomplete assessments. To bridge this gap, we propose an agentic annotation framework that coordinates seven specialized agents to generate hierarchical labels and rationales. Based on this framework, we construct M^3 (Multi-platform, Multi-lingual, and Multimodal Meme), a dataset of 2,455 memes collected from X, 4chan, and Weibo, featuring fine-grained hate labels and human-verified rationales. Benchmarking state-of-the-art Multimodal Large Language Models reveals that these models struggle to effectively utilize surrounding post context, which often fails to improve or even degrades detection performance. Our finding highlights the challenges these models face in reasoning over memes embedded in real-world discourse and underscores the need for a context-aware multimodal architecture. Our dataset and code are available at https://github.com/mira-ai-lab/M3.
Agentic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment
The Brain Tumor Reporting and Data System (BT-RADS) standardizes post-treatment MRI response assessment in patients with diffuse gliomas but requires complex integration of imaging trends, medication effects, and radiation timing. This study evaluates an end-to-end multi-agent large language model (LLM) and convolutional neural network (CNN) system for automated BT-RADS classification. A multi-agent LLM system combined with automated CNN-based tumor segmentation was retrospectively evaluated on 509 consecutive post-treatment glioma MRI examinations from a single high-volume center. An extractor agent identified clinical variables (steroid status, bevacizumab status, radiation date) from unstructured clinical notes, while a scorer agent applied BT-RADS decision logic integrating extracted variables with volumetric measurements. Expert reference standard classifications were established by an independent board-certified neuroradiologist. Of 509 examinations, 492 met inclusion criteria. The system achieved 374/492 (76.0%; 95% CI, 72.1%-79.6%) accuracy versus 283/492 (57.5%; 95% CI, 53.1%-61.8%) for initial clinical assessments (+18.5 percentage points; P<.001). Context-dependent categories showed high sensitivity (BT-1b 100%, BT-1a 92.7%, BT-3a 87.5%), while threshold-dependent categories showed moderate sensitivity (BT-3c 74.8%, BT-2 69.2%, BT-4 69.3%, BT-3b 57.1%). For BT-4, positive predictive value was 92.9%. The multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4 detection.
comment: 17 pages, 5 figures, 4 tables, 2 supplementary figures, 3 supplementary tables
STRIATUM-CTF: A Protocol-Driven Agentic Framework for General-Purpose CTF Solving
Large Language Models (LLMs) have demonstrated potential in code generation, yet they struggle with the multi-step, stateful reasoning required for offensive cybersecurity operations. Existing research often relies on static benchmarks that fail to capture the dynamic nature of real-world vulnerabilities. In this work, we introduce STRIATUM-CTF (A Search-based Test-time Reasoning Inference Agent for Tactical Utility Maximization in Cybersecurity), a modular agentic framework built upon the Model Context Protocol (MCP). By standardizing tool interfaces for system introspection, decompilation, and runtime debugging, STRIATUM-CTF enables the agent to maintain a coherent context window across extended exploit trajectories. We validate this approach not merely on synthetic datasets, but in a live competitive environment. Our system participated in a university-hosted Capture-the-Flag (CTF) competition in late 2025, where it operated autonomously to identify and exploit vulnerabilities in real-time. STRIATUM-CTF secured First Place, outperforming 21 human teams and demonstrating strong adaptability in a dynamic problem-solving setting. We analyze the agent's decision-making logs to show how MCP-based tool abstraction significantly reduces hallucination compared to naive prompting strategies. These results suggest that standardized context protocols are a critical path toward robust autonomous cyber-reasoning systems.
comment: 8 pages, 7 pages
TrustTrade: Human-Inspired Selective Consensus Reduces Decision Uncertainty in LLM Trading Agents
Large language models (LLMs) are increasingly deployed as autonomous agents in financial trading. However, they often exhibit a hazardous behavioral bias that we term uniform trust, whereby retrieved information is implicitly assumed to be factual and heterogeneous sources are treated as equally informative. This assumption stands in sharp contrast to human decision-making, which relies on selective filtering, cross-validation, and experience-driven weighting of information sources. As a result, LLM-based trading systems are particularly vulnerable to multi-source noise and misinformation, amplifying factual hallucinations and leading to unstable risk-return performance. To bridge this behavioral gap, we introduce TrustTrade (Trust-Rectified Unified Selective Trader), a multi-agent selective consensus framework inspired by human epistemic heuristics. TrustTrade replaces uniform trust with cross-agent consistency by aggregating information from multiple independent LLM agents and dynamically weighting signals based on their semantic and numerical agreement. Consistent signals are prioritized, while divergent, weakly grounded, or temporally inconsistent inputs are selectively discounted. To further stabilize decision-making, TrustTrade incorporates deterministic temporal signals as reproducible anchors and a reflective memory mechanism that adapts risk preferences at test time without additional training. Together, these components suppress noise amplification and hallucination-driven volatility, yielding more stable and risk-aware trading behavior. Across controlled backtesting in high-noise market environments (2024 Q1 and 2026 Q1), the proposed TrustTrade calibrates LLM trading behavior from extreme risk-return regimes toward a human-aligned, mid-risk and mid-return profile.
comment: 24 pages, 7 figures
Energy-Aware Collaborative Exploration for a UAV-UGV Team
We present an energy-aware collaborative exploration framework for a UAV-UGV team operating in unknown environments, where the UAV's energy constraint is modeled as a maximum flight-time limit. The UAV executes a sequence of energy-bounded exploration tours, while the UGV simultaneously explores on the ground and serves as a mobile charging station. Rendezvous is enforced under a shared time budget so that the vehicles meet at the end of each tour before the UAV reaches its flight-time limit. We construct a sparsely coupled air-ground roadmap using a density-aware layered probabilistic roadmap (PRM) and formulate tour selection over the roadmap as coupled orienteering problems (OPs) to maximize information gain subject to the rendezvous constraint. The resulting tours are constructed over collision-validated roadmap edges. We validate our method through simulation studies, benchmark comparisons, and real-world experiments.
Wake Up to the Past: Using Memory to Model Fluid Wake Effects on Robots IROS 2026
Autonomous aerial and aquatic robots that attain mobility by perturbing their medium, such as multicopters and torpedoes, produce wake effects that act as disturbances for adjacent robots. Wake effects are hard to model and predict due to the chaotic spatio-temporal dynamics of the fluid, entangled with the physical geometry of the robots and their complex motion patterns. Data-driven approaches using neural networks typically learn a memory-less function that maps the current states of the two robots to a force observed by the "sufferer" robot. Such models often perform poorly in agile scenarios: since the wake effect has a finite propagation time, the disturbance observed by a sufferer robot is some function of relative states in the past. In this work, we present an empirical study of the properties a wake-effect predictor must satisfy to accurately model the interactions between two robots mediated by a fluid. We explore seven data-driven models designed to capture the spatio-temporal evolution of fluid wake effects in four different media. This allows us to introspect the models and analyze the reasons why certain features enable improved accuracy in prediction across predictors and fluids. As experimental validation, we develop a planar rectilinear gantry for two spinning monocopters to test in real-world data with feedback control. The conclusion is that support of history of previous states as input and transport delay prediction substantially helps to learn an accurate wake-effect predictor.
comment: 8 pages, 7 figures. Submitted to IROS 2026. Project website: https://sites.google.com/view/wake-up-to-the-past
AI-Generated Code Is Not Reproducible (Yet): An Empirical Study of Dependency Gaps in LLM-Based Coding Agents
The rise of Large Language Models (LLMs) as coding agents promises to accelerate software development, but their impact on generated code reproducibility remains largely unexplored. This paper presents an empirical study investigating whether LLM-generated code can be executed successfully in a clean environment with only OS packages and using only the dependencies that the model specifies. We evaluate three state-of-the-art LLM coding agents (Claude Code, OpenAI Codex, and Gemini) across 300 projects generated from 100 standardized prompts in Python, JavaScript, and Java. We introduce a three-layer dependency framework (distinguishing between claimed, working, and runtime dependencies) to quantify execution reproducibility. Our results show that only 68.3% of projects execute out-of-the-box, with substantial variation across languages (Python 89.2%, Java 44.0%). We also find a 13.5 times average expansion from declared to actual runtime dependencies, revealing significant hidden dependencies.
Systems and Control (EESS)
A Portfolio-Level Optimization Framework for Coordinated Market Participation and Operational Scheduling of Hydrogen-Centric Companies
The vision of electrolytic hydrogen as a clean energy vector prompts the emergence of hydrogen-centric companies that must simultaneously engage in electricity, hydrogen, and green certificate markets while operating complex, geographically distributed asset portfolios. This paper proposes a portfolio-level optimization framework tailored for the integrated operational scheduling and market participation of such companies. The model co-optimizes asset scheduling and market decisions across multiple sites, incorporating spatial distribution, technical constraints, and company-level policy requirements. It supports participation in the electricity market, physical and virtual Power Purchase Agreements (PPAs), bundled and unbundled hydrogen markets, and green certificate transactions. The model is applied to three operational scenarios to evaluate the economic and operational impacts of different compliance strategies. Results show that centralized, portfolio-level control unlocks the full flexibility of geographically distributed assets, enabling a 2.42-fold increase in hydrogen production and a 9.4% reduction in daily operational costs, while satisfying all company policy constraints.
Route-Phasing-Split-Encoded Genetic Algorithm for Multi-Satellite On-Orbit Servicing Mission Planning
This article addresses multi-servicer on-orbit servicing mission planning in geosynchronous Earth orbit, where routing decisions are tightly coupled with time-dependent orbital phasing and strict propellant and mission-duration constraints. We propose a Route-Phasing-Split Genetic Algorithm (RPS-GA) that simultaneously optimizes target sequencing, discrete phasing rotation decisions (i.e., the number of phasing revolutions/waiting cycles), and route partitioning across multiple servicing spacecrafts (SSCs). An RPS triplet chromosome encodes route order, phasing rotations, and route splits in a unified structure, enabling split-aware recombination without disrupting feasible multi-servicer route blocks. Feasibility is enforced through a constraint-aware fitness function that ranks feasible solutions based on total $ΔV$, while penalizing propellant and mission duration violations, using aggregate and imbalance penalties. This formulation discourages the concentration of violations on a single servicing spacecraft (SSC). Once a feasible best solution is identified, it is preserved as feasible in subsequent generations, thereby enhancing convergence stability. The framework incorporates split-aware crossover, mutation and a regret-based Large Neighborhood Search for local intensification. Experiments on representative GEO servicing scenarios demonstrate that RPS-GA produces feasible multi-servicer plans with substantially improved fuel efficiency, reducing total $ΔV$ by $24.5\%$, (from $1956.36 \ m/s$ to $ 1476.32\ m/s $) compared with a state-of-the-art LNS-AGA baseline.
From Singleton Obstacles to Clutter: Translation Invariant Compositional Avoid Sets
This paper studies obstacle avoidance under translation invariant dynamics using an avoid-side travel cost Hamilton Jacobi formulation. For running costs that are zero outside an obstacle and strictly negative inside it, we prove that the value function is non-positive everywhere, equals zero exactly outside the avoid set, and is strictly negative exactly on it. Under translation invariance, this yields a reuse principle: the value of any translated obstacle is obtained by translating a single template value function. We show that the pointwise minimum of translated template values exactly characterizes the union of the translated single-obstacle avoid sets and provides a conservative inner certificate of unavoidable collision in clutter. To reduce conservatism, we introduce a blockwise composition framework in which subsets of obstacles are merged and solved jointly. This yields a hierarchy of conservative certificates from singleton reuse to the exact clutter value, together with monotonicity under block merging and an exactness criterion based on the existence of a common clutter avoiding control. The framework is illustrated on a Dubins car example in a repeated clutter field.
DQN Based Joint UAV Trajectory and Association Planning in NTN Assisted Networks
Advanced Air Mobility (AAM) has emerged as a key pillar of next-generation transportation systems, encompassing a wide range of uncrewed aerial vehicle (UAV) applications. To enable AAM, maintaining reliable and efficient communication links between UAVs and control centers is essential. At the same time, the highly dynamic nature of wireless networks, combined with the limited onboard energy of UAVs, makes efficient trajectory planning and network association crucial. Existing terrestrial networks often fail to provide ubiquitous coverage due to frequent handovers and coverage gaps. To address these challenges, geostationary Earth orbit (GEO) satellites offer a promising complementary solution for extending UAV connectivity beyond terrestrial boundaries. This work proposes an integrated GEO terrestrial network architecture to ensure seamless UAV connectivity. Leveraging artificial intelligence (AI), a deep Q network (DQN) based algorithm is developed for joint UAV trajectory and association planning (JUTAP), aiming to minimize energy consumption, handover frequency, and disconnectivity. Simulation results validate the effectiveness of the proposed algorithm within the integrated GEO terrestrial framework.
Sample-based detectability and moving horizon state estimation of continuous-time systems
In this paper we propose a detectability condition for nonlinear continuous-time systems with irregular/infrequent output measurements, namely a sample-based version of incremental integral input/output-to-state stability (i-iIOSS). We provide a sufficient condition for an i-iIOSS system to be sample-based i-iIOSS. This condition is also exploited to analyze the relationship between sample-based i-iIOSS and sample-based observability for linear systems, such that previously established sampling strategies for linear systems can be used to guarantee sample-based i-iIOSS. Furthermore, we present a sample-based moving horizon estimation scheme, for which robust stability can be shown. Finally, we illustrate the applicability of the proposed estimation scheme through a biomedical simulation example.
End-to-End Differentiable Predictive Control with Guaranteed Constraint Satisfaction and feasibility for Building Demand Response
The high energy consumption of buildings presents a critical need for advanced control strategies like Demand Response (DR). Differentiable Predictive Control (DPC) has emerged as a promising method for learning explicit control policies, yet conventional DPC frameworks are hindered by three key limitations: the use of simplistic dynamics models with limited expressiveness, a decoupled training paradigm that fails to optimize for closed-loop performance, and a lack of practical safety guarantees under realistic assumptions. To address these shortcomings, this paper proposes a novel End-to-End Differentiable Predictive Control (E2E-DPC) framework. Our approach utilizes an Encoder-Only Transformer to model the complex system dynamics and employs a unified, performance-oriented loss to jointly train the model and the control policy. Crucially, we introduce an online tube-based constraint tightening method that provides theoretical guarantees for recursive feasibility and constraint satisfaction without requiring complex offline computation of terminal sets. The framework is validated in a high-fidelity EnergyPlus simulation, controlling a multi-zone building for a DR task. The results demonstrate that the proposed method with guarantees achieves near-perfect constraint satisfaction - a reduction of over 99% in violations compared to the baseline - at the cost of only a minor increase in electricity expenditure. This work provides a deployable, performance-driven control solution for building energy management and establishes a new pathway for developing verifiable learning-based control systems under milder assumptions.
comment: 15 pages, 4 figures
Input Convex Encoder-Only Transformer for Fast and Gradient-Stable MPC in Building Demand Response
Learning-based Model Predictive Control (MPC) has emerged as a powerful strategy for building demand response. However, its practical deployment is often hindered by the non-convex optimization problems induced by standard neural network models. These problems lead to long solver times and suboptimal solutions, making real-time control over long horizons challenging. While Input Convex Neural Networks (ICNNs), such as Input-Convex Long Short-Term Memorys (IC-LSTMs), are developed to address the convexity issue, their recurrent architectures suffer from high computational cost and gradient instability as the prediction horizon increases. To overcome these limitations, this paper introduces the Input-Convex Encoder-only Transformer (IC-EoT), a novel architecture that synergizes the parallel processing capabilities of the Transformer with the guaranteed tractability of input convexity. The IC-EoT was developed and evaluated in a high-fidelity co-simulation framework using the Energym Python library to interface with the EnergyPlus building simulator, and compared against its recurrent convex counterpart (IC-LSTM) and standard non-convex models. The results demonstrate that the IC-EoT is structurally immune to the gradient instability that affects recurrent ICNNs while maintaining comparable predictive accuracy. More critically, it substantially reduces MPC solver times; this speed advantage grows with the prediction horizon, with the IC-EoT proving 2.7 to 8.3 times faster than the IC-LSTM across horizons spanning from one to eight hours. This leap in computational efficiency makes the IC-EoT a robust and practical solution, enabling effective, real-time MPC for building energy management under realistic horizon decision-making scenarios.
comment: 15 pages, 11 figures
BOOST-RPF: Boosted Sequential Trees for Radial Power Flow
Accurate power flow analysis is critical for modern distribution systems, yet classical solvers face scalability issues, and current machine learning models often struggle with generalization. We introduce BOOST-RPF, a novel method that reformulates voltage prediction from a global graph regression task into a sequential path-based learning problem. By decomposing radial networks into root-to-leaf paths, we leverage gradient-boosted decision trees (XGBoost) to model local voltage-drop regularities. We evaluate three architectural variants: Absolute Voltage, Parent Residual, and Physics-Informed Residual. This approach aligns the model architecture with the recursive physics of power flow, ensuring size-agnostic application and superior out-of-distribution robustness. Benchmarked against the Kerber Dorfnetz grid and the ENGAGE suite, BOOST-RPF achieves state-of-the-art results with its Parent Residual variant which consistently outperforms both analytical and neural baselines in standard accuracy and generalization tasks. While global Multi-Layer Perceptrons (MLPs) and Graph Neural Networks (GNNs) often suffer from performance degradation under topological shifts, BOOST-RPF maintains high precision across unseen feeders. Furthermore, the framework displays linear $O(N)$ computational scaling and significantly increased sample efficiency through per-edge supervision, offering a scalable and generalizable alternative for real-time distribution system operator (DSO) applications.
Interaction-Aware Predictive Environmental Control Barrier Function for Emergency Lane Change
Safety-critical motion planning in mixed traffic remains challenging for autonomous vehicles, especially when it involves interactions between the ego vehicle (EV) and surrounding vehicles (SVs). In dense traffic, the feasibility of a lane change depends strongly on how SVs respond to the EV motion. This paper presents an interaction-aware safety framework that incorporates such interactions into a control barrier function (CBF)-based safety assessment. The proposed method predicts near-future vehicle positions over a finite horizon, thereby capturing reactive SV behavior and embedding it into the CBF-based safety constraint. To address uncertainty in the SV response model, a robust extension is developed by treating the model mismatch as a bounded disturbance and incorporating an online uncertainty estimate into the barrier condition. Compared with classical environmental CBF methods that neglect SV reactions, the proposed approach provides a less conservative and more informative safety representation for interactive traffic scenarios, while improving robustness to uncertainty in the modeled SV behavior.
comment: 7 pages, 3 figures, submitted to 2026 CDC- L-CSS combined submission
Performance Analysis of Tri-Sector Reflector Antennas for HAPS-Based Cellular Networks
The increasing demand for ubiquitous, highcapacity mobile connectivity has driven cellular systems to explore beyond-terrestrial deployments. In this paper, we present a system-level performance evaluation of fifth-generation (5G) non-terrestrial network (NTN) enabled by high-altitude platform station (HAPS)-based base stations (BSs) equipped with tri-sectoral reflector antennas against fourth-generation (4G) terrestrial network (TN) and 5G TN deployments in a multicell dense urban environment. Using the simulation results comprising the average effective downlink signal-to-interference-plus-noise ratio (SINR) and the average user throughput, along with the subsequent interference analysis, we demonstrate that the reflector-based HAPS architecture is primarily constrained by inter-cell interference, while the combination of reflector configuration and deployment altitude represents a key design parameter.
Collision-Free Velocity Scheduling for Multi-Agent Systems on Predefined Routes via Inexact-Projection ADMM
In structured multi-agent transportation systems, agents often must follow predefined routes, making spatial rerouting undesirable or impossible. This paper addresses route-constrained multi-agent coordination by optimizing waypoint passage times while preserving each agent's assigned waypoint order and nominal route assignment. A differentiable surrogate trajectory model maps waypoint timings to smooth position profiles and captures first-order tracking lag, enabling pairwise safety to be encoded through distance-based penalties evaluated on a dense temporal grid spanning the mission horizon. The resulting nonlinear and nonconvex velocity-scheduling problem is solved using an inexact-projection Alternating Direction Method of Multipliers (ADMM) algorithm that combines structured timing updates with gradient-based collision-correction steps and avoids explicit integer sequencing variables. Numerical experiments on random-crossing, bottleneck, and graph-based network scenarios show that the proposed method computes feasible and time-efficient schedules across a range of congestion levels and yields shorter mission completion times than a representative hierarchical baseline in the tested bottleneck cases.
Ctrl-A: Control-Driven Online Data Augmentation
We introduce ControlAugment (Ctrl-A), an automated data augmentation algorithm for image-vision tasks, which incorporates principles from control theory for online adjustment of augmentation strength distributions during model training. Ctrl-A eliminates the need for initialization of individual augmentation strengths. Instead, augmentation strength distributions are dynamically, and individually, adapted during training based on a control-loop architecture and what we define as relative operation response curves. Using an operation-dependent update procedure provides Ctrl-A with the potential to suppress augmentation styles that negatively impact model performance, alleviating the need for manually engineering augmentation policies for new image-vision tasks. Experiments on the CIFAR-10, CIFAR-100, and SVHN-core benchmark datasets using the common WideResNet-28-10 architecture demonstrate that Ctrl-A is highly competitive with existing state-of-the-art data augmentation strategies.
comment: 17 pages (11 pages main manuscript), 8 figures (5 in main manuscript)
Partial Attention in Deep Reinforcement Learning for Safe Multi-Agent Control
Attention mechanisms excel at learning sequential patterns by discriminating data based on relevance and importance. This provides state-of-the-art performance in advanced generative artificial intelligence models. This paper applies this concept of an attention mechanism for multi-agent safe control. We specifically consider the design of a neural network to control autonomous vehicles in a highway merging scenario. The environment is modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP). Within a QMIX framework, we include partial attention for each autonomous vehicle, thus allowing each ego vehicle to focus on the most relevant neighboring vehicles. Moreover, we propose a comprehensive reward signal that considers the global objectives of the environment (e.g., safety and vehicle flow) and the individual interests of each agent. Simulations are conducted in the Simulation of Urban Mobility (SUMO). The results show better performance compared to other driving algorithms in terms of safety, driving speed, and reward.
comment: This work has been accepted for publication in the proceedings of the 2026 American Control Conference (ACC), New Orleans, Louisiana, USA
LSAI: A Large Small AI Model Codesign Framework for Agentic Robot Scenarios
The development of Artificial Intelligence (AI) has enabled agentic robots an appealing paradigm for various applications, such as research and rescue in complex environment. In this context, the next wireless communication technology facilitates robot cooperation for efficient environment sensing and exploration. However, traditional AI solutions cannot always provide reasonable resource utilization decisions, which makes it challenging to achieve both accurate and low-latency research and rescue. To address this issue, we propose a, LSAI, a large small AI model codesign framework to achieve highly accurate and real-time robot cooperation with deep interaction between large AI model and small AI model. We first propose an attention-based model aggregation for LAI construction. It can assist agentic robots in accurately sensing physical environments. Next, we design an adaptive model splitting and update algorithm to enable the robots to perform accurate path planning for high-efficiency environment sensing with low energy consumption. Finally, we demonstrate the effectiveness of our proposed LSAI framework. The simulation results indicate that our solution achieves sensing accuracy of up to 20.4% while reducing sensing cooperation latency by an average of 17.9% compared to traditional AI solutions.
comment: 7 pages
Simple Trajectory Smoothing for UAV Reference Path Planning Based on Decoupling, Spatial Modeling and Linear Programming
A method for trajectory smoothing for UAV reference path planning is presented. It is derived based on the dynamics of a Dubins airplane model, and involves a decoupling step, spatial modeling and linear programming. The decoupling step enables algebraic control laws for flight-path angle and speed control. Only for roll angle control an optimization step is applied, involving the solution of a small linear program. Two variations are discussed. They differ by reference centerline tracking and the introduction of a path shaping constraint. The benefit of natural dimensionality reduction for spatial modeling is discussed. The simplicity of the overall method is highlighted. An extension to acrobative flight is outlined, which comes at the cost of a model approximation, however at the gain of maintaining the general model structure. An extension of the method to tractor path planning along 3D terrain is discussed. The method is validated in simulations.
comment: 7 pages, 6 figures
Full Timescale Hierarchical MPC-MTIP Framework for Hybrid Energy Storage Management in Low-Carbon Industrial Microgrid
Uncertainties in balancing generation and load in low-carbon industrial microgrids (IMGs) make hybrid energy storage systems (HESS) crucial for their stable and economic operation. Existing model predictive control (MPC) techniques typically enforce periodic state of charge (SOC) constraints to maintain long term stability. However, these hard constraints compromise dispatch flexibility near the end of the prediction horizon, preventing sufficient energy release during critical peaks and leading to optimization infeasibility. This paper eliminates the periodic SOC constraints of individual storage units and proposes a novel full-timescale hierarchical MPC scheduling framework. Specifically, comprehensive physical and cost models are established for the HESS composed of flywheel, battery, compressed-air, and hydrogen-methanol energy storage. The control problem is decoupled into a hierarchical MPC architecture. Furthermore, a novel adaptive feedback mechanism based on micro trajectory inverse projection (MTIP) is embedded into the scheduling process, accurately mapping the high frequency dynamic buffering capabilities of lower tier storages into the upper decision space to generate dynamic boundaries. Experiments using 14 consecutive months of second-level data from a real-world IMG validate the effectiveness of the proposed method, demonstrating its significant superiority over existing approaches. By effectively preventing limit violations and deadlocks in lower-tier storages under extreme fluctuations, it achieves a 97.4\% net load smoothing rate and a 62.2\% comprehensive cycle efficiency.
comment: 10 pages,12figures,Journal
RTD-RAX: Fast, Safe Trajectory Planning for Systems under Unknown Disturbances
Reachability-based Trajectory Design (RTD) is a provably safe, real-time trajectory planning framework that combines offline reachable-set computation with online trajectory optimization. However, standard RTD implementations suffer from two key limitations: conservatism induced by worst-case reachable-set overapproximations, and an inability to account for real-time disturbances during execution. This paper presents RTD-RAX, a runtime-assurance extension of RTD that utilizes a non-conservative RTD formulation to rapidly generate goal-directed candidate trajectories, and utilizes mixed monotone reachability for fast, disturbance-aware online safety certification. When proposed trajectories fail safety certification under real-time uncertainty, a repair procedure finds nearby safe trajectories that preserve progress toward the goal while guaranteeing safety under real-time disturbances.
Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications
In this paper, we employ multiple UAVs to accelerate data transmissions from ground users (GUs) to a remote base station (BS) via the UAVs' relay communications. The UAVs' intermittent information exchanges typically result in delays in acquiring the complete system state and hinder their effective collaboration. To maximize the overall throughput, we first propose a delay-tolerant multi-agent deep reinforcement learning (MADRL) algorithm that integrates a delay-penalized reward to encourage information sharing among UAVs, while jointly optimizing the UAVs' trajectory planning, network formation, and transmission control strategies. Additionally, considering information loss due to unreliable channel conditions, we further propose a spatio-temporal attention based prediction approach to recover the lost information and enhance each UAV's awareness of the network state. These two designs are envisioned to enhance the network capacity in UAV-assisted wireless networks with limited communications. The simulation results reveal that our new approach achieves over 50\% reduction in information delay and 75% throughput gain compared to the conventional MADRL. Interestingly, it is shown that improving the UAVs' information sharing will not sacrifice the network capacity. Instead, it significantly improves the learning performance and throughput simultaneously. It is also effective in reducing the need for UAVs' information exchange and thus fostering practical deployment of MADRL in UAV-assisted wireless networks.
Conformal Koopman for Embedded Nonlinear Control with Statistical Robustness: Theory and Real-World Validation ICRA
We propose a fully data-driven, Koopman-based framework for statistically robust control of discrete-time nonlinear systems with linear embeddings. Establishing a connection between the Koopman operator and contraction theory, it offers distribution-free probabilistic bounds on the state tracking error under Koopman modeling uncertainty. Conformal prediction is employed here to rigorously derive a bound on the state-dependent modeling uncertainty throughout the trajectory, ensuring safety and robustness without assuming a specific error prediction structure or distribution. Unlike prior approaches that merely combine conformal prediction with Koopman-based control in an open-loop setting, our method establishes a closed-loop control architecture with formal guarantees that explicitly account for both forward and inverse modeling errors. Also, by expressing the tracking error bound in terms of the control parameters and the modeling errors, our framework offers a quantitative means to formally enhance the performance of arbitrary Koopman-based control. We validate our method both in numerical simulations with the Dubins car and in real-world experiments with a highly nonlinear flapping-wing drone. The results demonstrate that our method indeed provides formal safety guarantees while maintaining accurate tracking performance under Koopman modeling uncertainty.
comment: 8 pages, 6 figures. Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA). The final published version will be available via IEEE Xplore
Auction-Based Task Allocation with Energy-Conscientious Trajectory Optimization for AMR Fleets
This paper presents a hierarchical two-stage framework for multi-robot task allocation and trajectory optimization in asymmetric task spaces: (1) a sequential auction allocates tasks using closed-form bid functions, and (2) each robot independently solves an optimal control problem for energy-minimal trajectories with a physics-based battery model, followed by a collision avoidance refinement step using pairwise proximity penalties. Event-triggered warm-start rescheduling with bounded trigger frequency handles robot faults, priority arrivals, and energy deviations. Across 505 scenarios with 2-20 robots and up to 100 tasks on three factory layouts, both energy- and distance-based auction variants achieve 11.8% average energy savings over nearest-task allocation, with rescheduling latency under 10 ms. The central finding is that bid-metric performance is regime-dependent: in uniform workspaces, distance bids outperform energy bids by 3.5% (p < 0.05, Wilcoxon) because a 15.7% closed-form approximation error degrades bid ranking accuracy to 87%; however, when workspace friction heterogeneity is sufficient (r < 0.85 energy-distance correlation), a zone-aware energy bid outperforms distance bids by 2-2.4%. These results provide practitioner guidance: use distance bids in near-uniform terrain and energy-aware bids when friction variation is significant.
IF-CPS: Influence Functions for Cyber-Physical Systems -- A Unified Framework for Diagnosis, Curation, and Safety Attribution
Neural network controllers trained via behavior cloning are increasingly deployed in cyber-physical systems (CPS), yet practitioners lack tools to trace controller failures back to training data. Existing data attribution methods assume i.i.d.\ data and standard loss targets, ignoring CPS-specific properties: closed-loop dynamics, safety constraints, and temporal trajectory structure. We propose IF-CPS, a modular influence function framework with three CPS-adapted variants: safety influence (attributing constraint violations), trajectory influence (temporal discounting over trajectories), and propagated influence (tracing effects through plant dynamics). We evaluate IF-CPS on six benchmarks across diagnosis, curation, and safety attribution tasks. IF-CPS improves over standard influence functions in the majority of settings, achieving AUROC $1.00$ in Pendulum (5-10\% poisoning), $0.92$ vs.\ $0.50$ in HVAC (10\%), and the strongest constraint-boundary correlation (Spearman $ρ= 0.55$ in Pendulum).
Stochastic Trajectory Influence Functions for LQR: Joint Sensitivity Through Dynamics and Noise Covariance
Model-based controllers learned from data have the biases and noise of their training trajectories, making it important to know which trajectories help or hurt closed-loop performance. Influence functions, widely used in machine learning for data attribution, approximate this effect through first-order parameter-shift surrogates, avoiding costly retraining. Applying them to stochastic LQR, however, is nontrivial because the cost depends on the learned dynamics through the Riccati equation, and the process-noise covariance is estimated from the same residuals. We develop a three-level influence hierarchy that accounts for both channels.
Evaluating Power Flow Manifold from Local Data around a Single Operating Point via Geodesics
The widespread adoption of renewable energy poses a challenge in maintaining a feasible operating point in highly variable scenarios. This paper demonstrates that, within a feasible region of a power system that meets practical stability requirements, the power flow equations define a smooth bijection between nodal voltage phasors (angle and magnitude) and nodal active/reactive power injections. Based on this theoretical foundation, this paper proposes a data-based power flow evaluation method that can imply the associated power flow manifold from a limited number of data points around a single operating point. Using techniques from differential geometry and analytic functions, we represent geodesic curves in the associated power flow manifold as analytic functions at the initial point. Then, a special algebraic structure of the power flow problem is revealed and applied to reduce the computation of all higher-order partial derivatives to that of the first-order ones. Integrating these techniques yields the proposed data-based evaluation method, suggesting that a small number of local measurements around a single operating point is sufficient to imply the entire associated power flow manifold. Numerical cases with arbitrary directional variations are tested, certifying the efficacy of the proposed method.
comment: 10 pages,11 figures, submitted to IEEE Transactions on Power Systems
Emission reduction potential of freeway stop-and-go wave smoothing
Real-world potential of stop-and-go wave smoothing at scale remains largely unquantified. Smoothing freeway traffic waves requires creating a gap so the wave can dissipate, but the gap suggested is often too large and impractical. We propose a counterfactual wave smoothing benchmark that reconstructs a smooth and feasible trajectory from each empirical trajectory by solving a quadratic program with fixed boundary conditions and a maximum allowable gap constraint. We estimate the emission reduction potential from trajectories using the MOVES model. Applying the framework to nine weeks of weekday peak traffic data, featuring rich day-to-day stop-and-go wave dynamics, from the I-24 MOTION testbed, we find meaningful reduction potential under a 0.1-mile maximum gap: average CO2 reductions of 7.92% to 12.04% across lanes, with concurrent reductions of 14.30% to 28.91% CO, 23.15% to 29.42% HC, and 24.37% to 30.98% NOx. Our analysis also quantifies the trade-off between maximum allowable gap opening and emissions benefits.
A Model Predictive Control Approach to Dual-Axis Agrivoltaic Panel Tracking
Agrivoltaic systems--photovoltaic (PV) panels installed above agricultural land--have emerged as a promising dual-use solution to address competing land demands for food and energy production. In this paper, we propose a model predictive control (MPC) approach to dual-axis agrivoltaic panel tracking control that dynamically adjusts panel positions in real time to maximize power production and crop yield given solar irradiance and ambient temperature measurements. We apply convex relaxations and shading factor approximations to reformulate the MPC optimization problem as a convex second-order cone program that determines the PV panel position adjustments away from the sun-tracking trajectory. Through case studies, we demonstrate our approach, exploring the Pareto front between i) an approach that maximizes power production without considering crop needs and ii) crop yield with no agrivoltaics. We also conduct a case study exploring the impact of forecast error on MPC performance. We find that dynamically adjusting agrivoltaic panel position helps us actively manage the trade-offs between power production and crop yield, and that active panel control enables the agrivoltaic system to achieve land equivalent ratio values of up to 1.897.
comment: 10 pages
L2O-CCG: Adversarial Learning with Set Generalization for Adaptive Robust Optimization
The adversarial subproblem in two-stage adaptive robust optimization (ARO), which identifies the worst-case uncertainty realization, is a major computational bottleneck. This difficulty is exacerbated when the recourse value function is non-concave and the uncertainty set shifts across applications. Existing approaches typically exploit specific structural assumptions on the value function or the uncertainty set geometry to reformulate this subproblem, but degrade when these assumptions are violated or the geometry changes at deployment. To address this challenge, we propose L2O-CCG, a bi-level framework that enables the integration of structure-aware adversarial solvers within the constraint-and-column generation (CCG) algorithm. As one instantiation, we develop a generalizable adversarial learning method, which replaces solver-based adversarial search with a learned proximal gradient optimizer that can generalize across uncertainty set geometries without retraining. Here, an inner-level neural network approximates the recourse value function from offline data, while an outer-level pre-trained mapping generates iteration-dependent step sizes for a proximal gradient scheme. We also establish out-of-distribution convergence bounds under uncertainty set parameter shifts, showing how the trajectory deviation of the learned optimizer is bounded by the uncertainty set shift. We illustrate performance of the L2O-CCG method on a building HVAC management task.
Parallel OctoMapping: A Scalable Framework for Enhanced Path Planning in Autonomous Navigation
Mapping is essential in robotics and autonomous systems because it provides the spatial foundation for path planning. Efficient mapping enables planning algorithms to generate reliable paths while ensuring safety and adapting in real time to complex environments. Fixed-resolution mapping methods often produce overly conservative obstacle representations that lead to suboptimal paths or planning failures in cluttered scenes. To address this issue, we introduce Parallel OctoMapping (POMP), an efficient OctoMap-based mapping technique that maximizes available free space and supports multi-threaded computation. To the best of our knowledge, POMP is the first method that, at a fixed occupancy-grid resolution, refines the representation of free space while preserving map fidelity and compatibility with existing search-based planners. It can therefore be integrated into existing planning pipelines, yielding higher pathfinding success rates and shorter path lengths, especially in cluttered environments, while substantially improving computational efficiency.
Stability-Preserving Online Adaptation of Neural Closed-loop Maps
The growing complexity of modern control tasks calls for controllers that can react online as objectives and disturbances change, while preserving closed-loop stability. Recent approaches for improving the performance of nonlinear systems while preserving closed-loop stability rely on time-invariant recurrent neural-network controllers, but offer no principled way to update the controller during operation. Most importantly, switching from one stabilizing policy to another can itself destabilize the closed-loop. We address this problem by introducing a stability-preserving update mechanism for nonlinear, neural-network-based controllers. Each controller is modeled as a causal operator with bounded $\ell_p$-gain, and we derive gain-based conditions under which the controller may be updated online. These conditions yield two practical update schemes, time-scheduled and state-triggered, that guarantee the closed-loop remains $\ell_p$-stable after any number of updates. Our analysis further shows that stability is decoupled from controller optimality, allowing approximate or early-stopped controller synthesis. We demonstrate the approach on nonlinear systems with time-varying objectives and disturbances, and show consistent performance improvements over static and naive online baselines while guaranteeing stability.
Data-Driven Synthesis of Robust Positively Invariant Sets from Noisy Data
This paper develops a method to construct robust positively invariant (RPI) tube sets from finite noisy input-state data of an unknown linear time-invariant (LTI) system, yielding tubes that can be directly embedded in tube-based robust data-driven predictive control. Data-consistency uncertainty sets are constructed under process/measurement noise with polytopic/ellipsoidal bounds. In the measurement-noise case, we provide a deterministic and data-consistent procedure to certify the induced residual bound from data. Based on these sets, a robustly stabilizing state-feedback gain is certified via a common quadratic contraction, which in turn enables constructive polyhedral/ellipsoidal RPI tube computation. Numerical examples quantify the conservatism induced by noisy data and the employed certification step.
comment: 8 pages, 2 figures
Finite-time Convergent Control Barrier Functions with Feasibility Guarantees
This paper studies the problem of finite-time convergence to a prescribed safe set for nonlinear systems whose initial states violate the safety constraints. Existing Control Lyapunov-Barrier Functions (CLBFs) can enforce recovery to the safe set but may suffer from the issue of chattering and they do not explicitly consider control bounds. To address these limitations, we propose a new Control Barrier Function (CBF) formulation that guarantees finite-time convergence to the safe set while ensuring feasibility under control constraints. Specifically, we strengthen the initially violated safety constraint by introducing a parameter which enables the exploitation of the asymptotic property of a CBF to converge to the safe set in finite time. Furthermore, the conditions for the existence of such a CBF under control bounds to achieve finite-time convergence are derived via reachability analysis and constraint comparison, providing a systematic approach for parameter design. A case study on 2D obstacle avoidance is presented to demonstrate the effectiveness and advantages of the proposed method.
Semi-Infinite Programming for Collision-Avoidance in Optimal and Model Predictive Control
This paper presents a novel approach for collision avoidance in optimal and model predictive control, in which the environment is represented by a large number of points and the robot as a union of padded polygons. The conditions that none of the points shall collide with the robot can be written in terms of an infinite number of constraints per obstacle point. We show that the resulting semi-infinite programming (SIP) optimal control problem (OCP) can be efficiently tackled through a combination of two methods: local reduction and an external active-set method. Specifically, this involves iteratively identifying the closest point obstacles, determining the lower-level distance minimizer among all feasible robot shape parameters, and solving the upper-level finitely-constrained subproblems. In addition, this paper addresses robust collision avoidance in the presence of ellipsoidal state uncertainties. Enforcing constraint satisfaction over all possible uncertainty realizations extends the dimension of constraint infiniteness. The infinitely many constraints arising from translational uncertainty are handled by local reduction together with the robot shape parameterization, while rotational uncertainty is addressed via a backoff reformulation. A controller implemented based on the proposed method is demonstrated on a real-world robot running at 20Hz, enabling fast and collision-free navigation in tight spaces. An application to 3D collision avoidance is also demonstrated in simulation.
comment: 20 pages, 17 figures
Robust Dynamic Pricing and Admission Control with Fairness Guarantees
Dynamic pricing is commonly used to regulate congestion in shared service systems. This paper is motivated by the fact that in the presence of users with varying price sensitivity (responsiveness), conventional monotonic pricing can lead to unfair outcomes by disproportionately excluding price-elastic users, particularly under high or uncertain demand. We therefore develop a fairness-oriented mechanism under demand uncertainty. The paper's contributions are twofold. First, we show that when fairness is imposed as a hard state constraint, the optimal (revenue maximizing) pricing policy is generally non-monotonic in demand. This structural result departs fundamentally from standard surge pricing rules and reveals that price reduction under heavy load may be necessary to maintain equitable access. Second, we address the problem that price elasticity among heterogeneous users is unobservable. To solve it, we develop a robust dynamic pricing and admission control framework that enforces capacity and fairness constraints for all user type distributions consistent with aggregate measurements. By integrating integral High Order Control Barrier Functions (iHOCBFs) with a robust optimization framework under uncertain user-type distribution, we obtain a controller that guarantees forward invariance of safety and fairness constraints while optimizing revenue. Numerical experiments demonstrate improved fairness and revenue performance relative to monotonic surge pricing policies.
The Battle of the Water Futures
The highly anticipated 'Battle of the Water Networks' is back with a new challenge for the water community. This competition will be hosted at the 4th International Joint Conference on Water Distribution Systems Analysis and Computing and Control in the Water Industry (WDSA/CCWI 2026), taking place in Paphos, Cyprus, from May 18-21, 2026. This competition embodies the core mission of Water-Futures and the theme for WDSA/CCWI 2026: "Designing the next generation of urban water (and wastewater) systems." The objective is to design and operate a water distribution system over a long-term horizon under deep uncertainty, with interventions applied in stages. For the first time, this challenge features a staged-design approach, unobservable and unknown uncertainties, and incorporates elements of policymaking and artificial intelligence. The solutions will be assessed using a transparent and inspectable open-source evaluation framework.
On the Impact of Voltage Unbalance on Distribution Locational Marginal Prices
Finding clear economic signals for distribution-network operation and expansion is increasingly important as single-phase loads and distributed energy resources escalate. These devices create phase-to-phase imbalances that manifest as voltage unbalance, a power quality issue that accelerates insulation aging in machines and increases network losses, thereby raising costs for operators and consumers. Traditional grid codes address unbalance via disparate hard limits on various indices thresholds that differ across standards, offer no dynamic economic incentive and undermine optimality. This paper proposes instead to treat voltage unbalance as a `soft limit' by adding penalty terms to grid operation costs within a three-phase optimal power flow to reflect the cost of the decrease in lifetime of assets due to being subject to voltage unbalance. This unified approach yields dynamic economic signals unbalance-aware Distribution Locational Marginal Prices (DLMP) that reflect the cost of power quality deviations. A novel mathematical decomposition of DLMP is developed, isolating the energy, loss, congestion, and unbalance components. Case studies conducted on two benchmark networks demonstrate the effectiveness and practical value of the proposed method. The results indicate that unbalance penalties reshape nodal prices, produce unexpected phase-level effects, and even allow scenarios where added load reduces unbalance and lowers costs, while providing planners and market designers with actionable insights to balance investment, operation, and power quality in modern distribution systems.
From 2D to 3D terrain-following area coverage path planning SC 2026
An algorithm for 3D terrain-following area coverage path planning is presented. Multiple adjacent paths are generated that are (i) locally apart from each other by a distance equal to the working width of a machinery, while (ii) simultaneously floating at a projection distance equal to a specific working height above the terrain. The complexities of the algorithm in comparison to its 2D equivalent are highlighted. These include uniformly spaced elevation data generation using an Inverse Distance Weighting-approach and a local search. Area coverage path planning results for real-world 3D data within an agricultural context are presented to validate the algorithm.
comment: 6 pages, 10 figures, 1 table, IEEE ICARSC 2026
A Systematic Comparison and Evaluation of Building Ontologies for Deploying Data-Driven Analytics in Smart Buildings
Ontologies play a critical role in data exchange, information integration, and knowledge sharing across diverse smart building applications. Yet, semantic differences between the prevailing building ontologies hamper their purpose of bringing data interoperability and restrict the ability to reuse building ontologies in real-world applications. In this paper, we propose and adopt a framework to conduct a systematic comparison and evaluation of four popular building ontologies (Brick Schema, RealEstateCore, Project Haystack and Google's Digital Buildings) from both axiomatic design and assertions in a use case, namely the Terminological Box (TBox) evaluation and the Assertion Box (ABox) evaluation. In the TBox evaluation, we use the SQuaRE-based Ontology Quality Evaluation (OQuaRE) Framework and concede that Project Haystack and Brick Schema are more compact with respect to the ontology axiomatic design. In the ABox evaluation, we apply an empirical study with sample building data that suggests that Brick Schema and RealEstateCore have greater completeness and expressiveness in capturing the main concepts and relations within the building domain. The results implicitly indicate that there is no universal building ontology for integrating Linked Building Data (LBD). We also discuss ontology compatibility and investigate building ontology design patterns (ODPs) to support ontology matching, alignment, and harmonisation.
comment: 32 pages
Discontinuous integro-differential equations and sliding mode control
The paper deals with analysis and design of sliding mode control systems modeled by finite-dimensional integro-differential equations. Filippov method and equivalent control approach are extended to a class of nonlinear discontinuous integro-differential equations and to a class of control systems modeled by infinite-dimensional differential equations in Banach spaces. Sliding mode control algorithms are designed for distributed input delay systems and for a heat control system.
Data-Driven Resilience Assessment against Sparse Sensor Attacks
We develop a data-driven framework for assessing the resilience of linear time-invariant systems against malicious false-data-injection sensor attacks. Leveraging sparse observability, we propose data-driven resilience metrics and derive necessary and sufficient conditions for two data-availability scenarios. For attack-free data, we show that when a rank condition holds, the resilience level can be computed exactly from the data alone, without prior knowledge of the system parameters. We then extend the analysis to the case where only poisoned data are available and show that the resulting assessment is necessarily conservative. For both scenarios, we provide algorithms for computing the proposed metrics and show that they can be computed in polynomial time under an additional spectral condition. A numerical example illustrates the efficacy and limitations of the proposed framework.
comment: Accepted to ACC 2026
Sample-based Moving Horizon Estimation
In this paper, we propose a sample-based moving horizon estimation (MHE) scheme for general nonlinear systems to estimate the current system state using irregularly and/or infrequently available measurements. The cost function of the MHE optimization problem is suitably designed to accommodate these irregular output sequences. We also establish that, under a suitable sample-based detectability condition known as sample-based incremental input/output-to-state stability (i-IOSS), the proposed sample-based MHE achieves robust global exponential stability (RGES). Additionally, for the case of linear systems, we draw connections between sample-based observability and sample-based i-IOSS. This demonstrates that previously established conditions for linear systems to be sample-based observable can be utilized to verify or design sampling strategies that satisfy the conditions to guarantee RGES of the sample-based MHE. Finally, the effectiveness of the proposed sample-based MHE is illustrated through a simulation example.
comment: accepted for presentation at the 24th European Control Conference (ECC), extended online version
Robust reduced-order model predictive control using peak-to-peak analysis of filtered signals
We address the design of a model predictive control (MPC) scheme for large-scale linear systems using reduced-order models (ROMs). Our approach uses a ROM, leverages tools from robust control, and integrates them into an MPC framework to achieve computational tractability with robust constraint satisfaction. Our key contribution is a method to obtain guaranteed bounds on the predicted outputs of the full-order system by predicting a (scalar) error-bounding system alongside the ROM. This bound is then used to formulate a robust ROM-based MPC that guarantees constraint satisfaction and robust performance. Our method is developed step-by-step by (i) analysing the error, (ii) bounding the peak-to-peak gain, an (iii) using filtered signals. We demonstrate our method on a 100-dimensional mass-spring-damper system, achieving over four orders of magnitude reduction in conservatism relative to existing approaches.
comment: Accepted to the European Control Conference 2026
Towards a Practical Understanding of Lagrangian Methods in Safe Reinforcement Learning
Safe reinforcement learning addresses constrained optimization problems where maximizing performance must be balanced against safety constraints, and Lagrangian methods are a widely used approach for this purpose. However, the effectiveness of Lagrangian methods depends crucially on the choice of the Lagrange multiplier $λ$, which governs the multi-objective trade-off between return and cost. A common practice is to update the multiplier automatically during training. Although this approach is standard in practice, there remains limited empirical evidence on the optimally achievable trade-off between return and cost as a function of $λ$, and there is currently no systematic benchmark comparing automated update mechanisms to this empirical optimum. Therefore, we study (i) the constraint geometry for eight widely used safety tasks and (ii) the previously overlooked constraint-regime sensitivity of different Lagrange multiplier update mechanisms in safe reinforcement learning. Through the lens of multi-objective analysis, we present empirical Pareto frontiers that offer a complete visualization of the trade-off between return and cost in the underlying optimization problem. Our results reveal the highly sensitive nature of $λ$ and further show that the restrictiveness of the constraint cost can vary across different cost limits within the same task. This highlights the importance of careful cost limit selection across different regions of cost restrictiveness when evaluating safe reinforcement learning methods. We provide a recommended set of cost limits for each evaluated task and offer an open-source code base: https://github.com/lindsayspoor/Lagrangian_SafeRL.
Observer Design over Hypercomplex Quaternions
We develop observer design over hypercomplex quaternions in a characteristic-polynomial-free framework. Using the standard right-module convention, we derive a right observable companion form and companion polynomial that encode error dynamics through right-eigenvalue similarity classes. We also give an Ackermann-type formula for real-coefficient target polynomials, where polynomial evaluation is similarity-equivariant. The resulting recipes place observer poles directly over quaternions and clarify when companion-coordinate updates and one-shot Ackermann formulas remain valid.
comment: Accepted for presentation at the 24th European Control Conference (ECC 2026), Reykjavik, Iceland. This work was co-funded by the European Union under the project ROBOPROX (reg. no. CZ.02.01.01/00/22 008/0004590)
Differentiable Simulation of Hard Contacts with Soft Gradients for Learning and Control
Contact forces introduce discontinuities into robot dynamics that severely limit the use of simulators for gradient-based optimization. Penalty-based simulators such as MuJoCo, soften contact resolution to enable gradient computation. However, realistically simulating hard contacts requires stiff solver settings, which leads to incorrect simulator gradients when using automatic differentiation. Contrarily, using non-stiff settings strongly increases the sim-to-real gap. We analyze penalty-based simulators to pinpoint why gradients degrade under hard contacts. Building on these insights, we propose DiffMJX, which couples adaptive time integration with penalty-based simulation to substantially improve gradient accuracy. A second challenge is that contact gradients vanish when bodies separate. To address this, we introduce contacts from distance (CFD) which combines penalty-based simulation with straight-through estimation. By applying CFD exclusively in the backward pass, we obtain informative pre-contact gradients while retaining physical realism.
A control-theoretic simplification of adaptive bitrate (ABR) video streaming
Adaptive bitrate streaming (ABR) over the HyperText Transfer Protocol (HTTP), which raises numerous delicate questions, is nowadays almost the only approach to video streaming. This paper presents elementary solutions to three key issues: 1) A straightforward feedforward control strategy for the bitrate and the buffer level via flatness-based control. 2) Closing the loop permits mitigating unavoidable mismatches and disturbances, such as Internet fluctuations. This is adapted from the new HEOL setting, which mixes model-free and flatness-based controls. 3) An easily implementable closed-form estimate of the bandwidth via algebraic identification techniques is derived, perhaps for the first time. It permits handling severe variations in channel capacity. Several computer experiments and metrics for evaluating the Quality of Experience (QoE) are displayed and discussed.
comment: European Control Conference 2026 (ECC26) --- July 7-10, 2026, Reykjavík, Iceland}
Towards Fair and Efficient allocation of Mobility-on-Demand resources through a Karma Economy
Mobility-on-demand systems like ride-hailing have transformed urban transportation, but they have also exacerbated socio-economic inequalities in access to these services, also due to surge pricing strategies. Although several fairness-aware frameworks have been proposed in smart mobility, they often overlook the temporal and situational variability of user urgency that shapes real-world transportation demands. This paper introduces a non-monetary, Karma-based mechanism that models endogenous urgency, allowing user time-sensitivity to evolve in response to system conditions as well as external factors. We develop a theoretical framework maintaining the efficiency and fairness guarantees of classical Karma economies, while accommodating this realistic user behavior modeling. Applied to a simplified simulated mobility-on-demand scenario, we provide a proof-of-concept illustration of the proposed framework, showing that it exhibits promising behavior in terms of system efficiency and equitable resource allocation, while acknowledging that a full treatment of realistic MoD complexity remains an important direction for future work.
comment: 6 pages, 3 figures. ACCEPTED at the 2026 European Control Conference (ECC)
Mixed-Integer vs. Continuous Model Predictive Control for Binary Thrusters: A Comparative Study
Binary on/off thrusters are commonly used for spacecraft attitude and position control during proximity operations. However, their discrete nature poses challenges for conventional continuous control methods. The control of these discrete actuators is either explicitly formulated as a mixed-integer optimization problem or handled in a two-layer approach, where a continuous controller's output is converted to binary commands using analog-to digital modulation techniques such as Delta-Sigma-modulation. This paper provides the first systematic comparison between these two paradigms for binary thruster control, contrasting continuous Model Predictive Control (MPC) with Delta-Sigma modulation against direct Mixed-Integer MPC (MIMPC) approaches. Furthermore, we propose a new variant of MPC for binary actuated systems, which is informed using the state of the Delta-Sigma Modulator. The two variations for the continuous MPC along with the MIMPC are evaluated through extensive simulations using ESA's REACSA platform. Results demonstrate that while all approaches perform similarly in high-thrust regimes, MIMPC achieves superior fuel efficiency in low-thrust conditions. Continuous MPC with modulation shows instabilities at higher thrust levels, while binary informed MPC, which incorporates modulator dynamics, improves robustness and reduces the efficiency gap to the MIMPC. It can be seen from the simulated and real-system experiments that MIMPC offers complete stability and fuel efficiency benefits, particularly for resource-constrained missions, while continuous control methods remain attractive for computationally limited applications.
comment: Accepted to CEAS EuroGNC 2026
A Real-Time System for Scheduling and Managing UAV Delivery in Urban Areas
As urban logistics demand continues to grow, UAV delivery has become a key solution to improve delivery efficiency, reduce traffic congestion, and lower logistics costs. However, to fully leverage the potential of UAV delivery networks, efficient swarm scheduling and management are crucial. In this paper, we propose a real-time scheduling and management system based on the ``Airport-Unloading Station" model, aiming to bridge the gap between high-level scheduling algorithms and low-level execution systems. This system, acting as middleware, accurately translates the requirements from the scheduling layer into specific execution instructions, ensuring that the scheduling algorithms perform effectively in real-world environments. Additionally, we implement three collaborative scheduling schemes involving autonomous ground vehicles (AGVs), unmanned aerial vehicles (UAVs), and ground staff to further optimize overall delivery efficiency. Through extensive experiments, this study demonstrates the rationality and feasibility of the proposed management system, providing practical solution for the commercial application of UAVs delivery in urban. Code: https://github.com/chengji253/UAVDeliverySystem
comment: ROBIO 2025
A Tactile-based Interactive Motion Planner for Robots in Unknown Cluttered Environments
In unknown cluttered environments with densely stacked objects, the free-motion space is extremely barren, posing significant challenges to motion planners. Collision-free planning methods often suffer from catastrophic failures due to unexpected collisions and motion obstructions. To address this issue, this paper proposes an interactive motion planning framework (I-MP), based on a perception-motion loop. This framework empowers robots to autonomously model and reason about contact models, which in turn enables safe expansion of the free-motion space. Specifically, the robot utilizes multimodal tactile perception to acquire stimulus-response signal pairs. This enables real-time identification of objects' mechanical properties and the subsequent construction of contact models. These models are integrated as computational constraints into a reactive planner. Based on fixed-point theorems, the planner computes the spatial state toward the target in real time, thus avoiding the computational burden associated with extrapolating on high-dimensional interaction models. Furthermore, high-dimensional interaction features are linearly superposed in Cartesian space in the form of energy, and the controller achieves trajectory tracking by solving the energy gradient from the current state to the planned state. The experimental results showed that at cruising speeds ranging from 0.01 to 0.07 $m/s$, the robot's initial contact force with objects remained stable at 1.0 +- 0.7 N. In the cabinet scenario test where collision-free trajectories were unavailable, I-MP expanded the free motion space by 37.5 % through active interaction, successfully completing the environmental exploration task.
Inverse-dynamics observer design for a linear single-track vehicle model with distributed tire dynamics
Accurate estimation of the vehicle's sideslip angle and tire forces is essential for enhancing safety and handling performances in unknown driving scenarios. To this end, the present paper proposes an innovative observer that combines a linear single-track model with a distributed representation of the tires and information collected from standard sensors. In particular, by adopting a comprehensive representation of the tires in terms of hyperbolic partial differential equations (PDEs), the proposed estimation strategy exploits dynamical inversion to reconstruct the lumped and distributed vehicle states solely from yaw rate and lateral acceleration measurements. Simulation results demonstrate the effectiveness of the observer in estimating the sideslip angle and tire forces even in the presence of noise and model uncertainties.
comment: 6 pages, 5 figures. Accepted at ECC 2026
A Goal-Oriented Approach for Active Object Detection with Exploration-Exploitation Balance
Active object detection, which aims to identify objects of interest through controlled camera movements, plays a pivotal role in real-world visual perception for autonomous robotic applications, such as manufacturing tasks (e.g., assembly operations) performed in unknown environments. A dual control for exploration and exploitation (DCEE) algorithm is presented within goal-oriented control systems to achieve efficient active object detection, leveraging active learning by incorporating variance-based uncertainty estimation in the cost function. This novel method employs an exploration-exploitation balanced cost function to actively guide the selection of the next viewpoint. Specifically, active object detection is achieved through the development of a reward function that encodes knowledge about the confidence variation of objects as a function of viewpoint position within a given domain. By identifying the unknown parameters of this function, the system generates an optimal viewpoint planning strategy. DCEE integrates parameter estimation of the reward function and view planning, ensuring a balanced trade-off between the exploitation of learned knowledge and active exploration during the planning process. Moreover, it demonstrates remarkable adaptability across diverse scenarios, effectively handling LEGO brick detection at varying locations. Importantly, the algorithm maintains consistent configuration settings and a fixed number of parameters across various scenarios, underscoring its efficiency and robustness. To validate the proposed approach, extensive numerical studies, high-fidelity virtual simulations, and real-world experiments under various scenarios were conducted. The results confirm the effectiveness of DCEE in active object detection, showcasing superior performance compared to existing methods, including model predictive control (MPC) and entropy approaches.
comment: 12 pages, 14 figures
Tilt-based Aberration Estimation in Transmission Electron Microscopy
Transmission electron microscopes (TEMs) enable atomic-scale imaging but suffer from aberrations caused by lens imperfections and environmental conditions, reducing image quality. These aberrations can be compensated by adjusting electromagnetic lenses, but this requires accurate estimates of the aberration coefficients, which can drift over time. This paper introduces a method for the estimation of aberrations in TEM by leveraging the relationship between an induced tilt of the electron beam and the resulting image shift. The method uses a Kalman filter (KF) to estimate the aberration coefficients from a sequence of image shifts, while accounting for the drift of the aberrations over time. The applied tilt sequence is optimized by minimizing the trace of the predicted error covariance in the KF, which corresponds to the A-optimality criterion in experimental design. We show that this optimization can be performed offline, as the cost criterion is independent of the actual measurements. The resulting non-convex optimization problem is solved using a gradient-based, receding-horizon approach with multi-starts. Additionally, we develop an approach to estimate specimen-dependent noise properties using expectation maximization (EM), which are then used to tailor the tilt pattern optimization to the specific specimen being imaged. The proposed method is validated on a real TEM set-up with several optimized tilt patterns. The results show that optimized patterns significantly outperform naive approaches and that the aberration and drift model accurately captures the underlying physical phenomena. A direct comparison with the widely used Zemlin tableau shows that the proposed method achieves comparable or higher image quality on amorphous specimens, while additionally extending to non-amorphous specimens where the Zemlin tableau cannot operate.
comment: Preprint (revised version). This manuscript is under peer review. Please cite the published version when available
Joint Price and Power MPC for Peak Power Reduction at Workplace EV Charging Stations
Demand charge, a utility fee based on an electricity customer's peak power consumption, often constitutes a significant portion of costs for commercial electric vehicle (EV) charging station operators. This paper explores control methods to reduce peak power consumption at workplace EV charging stations in a joint price and power optimization framework. We optimize a menu of price options to incentivize users to select controllable charging service. Using this framework, we propose a model predictive control approach to reduce both demand charge and overall operator costs. Through a Monte Carlo simulation, we find that our algorithm outperforms a state-of-the-art benchmark optimization strategy and can significantly reduce station operator costs.
comment: 2026 American Control Conference
Robotics
CounterScene: Counterfactual Causal Reasoning in Generative World Models for Safety-Critical Closed-Loop Evaluation
Generating safety-critical driving scenarios requires understanding why dangerous interactions arise, rather than merely forcing collisions. However, existing methods rely on heuristic adversarial agent selection and unstructured perturbations, lacking explicit modeling of interaction dependencies and thus exhibiting a realism--adversarial trade-off. We present CounterScene, a framework that endows closed-loop generative BEV world models with structured counterfactual reasoning for safety-critical scenario generation. Given a safe scene, CounterScene asks: what if the causally critical agent had behaved differently? To answer this, we introduce causal adversarial agent identification to identify the critical agent and classify conflict types, and develop a conflict-aware interactive world model in which a causal interaction graph is used to explicitly model dynamic inter-agent dependencies. Building on this structure, stage-adaptive counterfactual guidance performs minimal interventions on the identified agent, removing its spatial and temporal safety margins while allowing risk to emerge through natural interaction propagation. Extensive experiments on nuScenes demonstrate that CounterScene achieves the strongest adversarial effectiveness while maintaining superior trajectory realism across all horizons, improving long-horizon collision rate from 12.3% to 22.7% over the strongest baseline with better realism (ADE 1.88 vs.2.09). Notably, this advantage further widens over longer rollouts, and CounterScene generalizes zero-shot to nuPlan with state-of-the-art realism.
comment: 28 pages, 7 figures
Cortical Policy: A Dual-Stream View Transformer for Robotic Manipulation ICLR 2026
View transformers process multi-view observations to predict actions and have shown impressive performance in robotic manipulation. Existing methods typically extract static visual representations in a view-specific manner, leading to inadequate 3D spatial reasoning ability and a lack of dynamic adaptation. Taking inspiration from how the human brain integrates static and dynamic views to address these challenges, we propose Cortical Policy, a novel dual-stream view transformer for robotic manipulation that jointly reasons from static-view and dynamic-view streams. The static-view stream enhances spatial understanding by aligning features of geometrically consistent keypoints extracted from a pretrained 3D foundation model. The dynamic-view stream achieves adaptive adjustment through position-aware pretraining of an egocentric gaze estimation model, computationally replicating the human cortical dorsal pathway. Subsequently, the complementary view representations of both streams are integrated to determine the final actions, enabling the model to handle spatially-complex and dynamically-changing tasks under language conditions. Empirical evaluations on RLBench, the challenging COLOSSEUM benchmark, and real-world tasks demonstrate that Cortical Policy outperforms state-of-the-art baselines substantially, validating the superiority of dual-stream design for visuomotor control. Our cortex-inspired framework offers a fresh perspective for robotic manipulation and holds potential for broader application in vision-based robot control.
comment: Published as a conference paper at ICLR 2026. 10 pages, 4 figures. Appendix included
Dreaming the Unseen: World Model-regularized Diffusion Policy for Out-of-Distribution Robustness
Diffusion policies excel at visuomotor control but often fail catastrophically under severe out-of-distribution (OOD) disturbances, such as unexpected object displacements or visual corruptions. To address this vulnerability, we introduce the Dream Diffusion Policy (DDP), a framework that deeply integrates a diffusion world model into the policy's training objective via a shared 3D visual encoder. This co-optimization endows the policy with robust state-prediction capabilities. When encountering sudden OOD anomalies during inference, DDP detects the real-imagination discrepancy and actively abandons the corrupted visual stream. Instead, it relies on its internal "imagination" (autoregressively forecasted latent dynamics) to safely bypass the disruption, generating imagined trajectories before smoothly realigning with physical reality. Extensive evaluations demonstrate DDP's exceptional resilience. Notably, DDP achieves a 73.8% OOD success rate on MetaWorld (vs. 23.9% without predictive imagination) and an 83.3% success rate under severe real-world spatial shifts (vs. 3.3% without predictive imagination). Furthermore, as a stress test, DDP maintains a 76.7% real-world success rate even when relying entirely on open-loop imagination post-initialization.
comment: Under review
OrbitStream: Training-Free Adaptive 360-degree Video Streaming via Semantic Potential Fields
Adaptive 360° video streaming for teleoperation faces dual challenges: viewport prediction under uncertain gaze patterns and bitrate adaptation over volatile wireless channels. While data-driven and Deep Reinforcement Learning (DRL) methods achieve high Quality of Experience (QoE), their "black-box" nature and reliance on training data can limit deployment in safety-critical systems. To address this, we propose OrbitStream, a training-free framework that combines semantic scene understanding with robust control theory. We formulate viewport prediction as a Gravitational Viewport Prediction (GVP) problem, where semantic objects generate potential fields that attract user gaze. Furthermore, we employ a Saturation-Based Proportional-Derivative (PD) Controller for buffer regulation. On object-rich teleoperation traces, OrbitStream achieves a 94.7\% zero-shot viewport prediction accuracy without user-specific profiling, approaching trajectory-extrapolation baselines ($\sim$98.5\%). Across 3,600 Monte Carlo simulations on diverse network traces, OrbitStream yields a mean QoE of 2.71. It ranks second among 12 evaluated algorithms, close to the top-performing BOLA-E (2.80) while outperforming FastMPC (1.84). The system exhibits an average decision latency of 1.01 ms with minimal rebuffering events. By providing competitive QoE with interpretability and zero training overhead, OrbitStream demonstrates that physics-based control, combined with semantic modeling, offers a practical solution for 360° streaming in teleoperation.
Geometrically Plausible Object Pose Refinement using Differentiable Simulation
State-of-the-art object pose estimation methods are prone to generating geometrically infeasible pose hypotheses. This problem is prevalent in dexterous manipulation, where estimated poses often intersect with the robotic hand or are not lying on a support surface. We propose a multi-modal pose refinement approach that combines differentiable physics simulation, differentiable rendering and visuo-tactile sensing to optimize object poses for both spatial accuracy and physical consistency. Simulated experiments show that our approach reduces the intersection volume error between the object and robotic hand by 73\% when the initial estimate is accurate and by over 87\% under high initial uncertainty, significantly outperforming standard ICP-based baselines. Furthermore, the improvement in geometric plausibility is accompanied by a concurrent reduction in translation and orientation errors. Achieving pose estimation that is grounded in physical reality while remaining faithful to multi-modal sensor inputs is a critical step toward robust in-hand manipulation.
HyReach: Vision-Guided Hybrid Manipulator Reaching in Unseen Cluttered Environments
As robotic systems increasingly operate in unstructured, cluttered, and previously unseen environments, there is a growing need for manipulators that combine compliance, adaptability, and precise control. This work presents a real-time hybrid rigid-soft continuum manipulator system designed for robust open-world object reaching in such challenging environments. The system integrates vision-based perception and 3D scene reconstruction with shape-aware motion planning to generate safe trajectories. A learning-based controller drives the hybrid arm to arbitrary target poses, leveraging the flexibility of the soft segment while maintaining the precision of the rigid segment. The system operates without environment-specific retraining, enabling direct generalization to new scenes. Extensive real-world experiments demonstrate consistent reaching performance with errors below 2 cm across diverse cluttered setups, highlighting the potential of hybrid manipulators for adaptive and reliable operation in unstructured environments.
comment: 8 pages, 5 figures, 5 tables
Bayesian Active Object Recognition and 6D Pose Estimation from Multimodal Contact Sensing
We present an active tactile exploration framework for joint object recognition and 6D pose estimation. The proposed method integrates wrist force/torque sensing, GelSight tactile sensing, and free-space constraints within a Bayesian inference framework that maintains a belief over object class and pose during active tactile exploration. By combining contact and non-contact evidence, the framework reduces ambiguity and improves robustness in the joint class-pose estimation problem. To enable efficient inference in the large hypothesis space, we employ a customized particle filter that progressively samples particles based on new observations. The inferred belief is further used to guide active exploration by selecting informative next touches under reachability constraints. For effective data collection, a motion planning and control framework is developed to plan and execute feasible paths for tactile exploration, handle unexpected contacts and GelSight-surface alignment with tactile servoing. We evaluate the framework in simulation and on a Franka Panda robot using 11 YCB objects. Results show that incorporating tactile and free-space information substantially improves recognition and pose estimation accuracy and stability, while reducing the number of action cycles compared with force/torque-only baselines. Code, dataset, and supplementary material will be made available online.
DyGeoVLN: Infusing Dynamic Geometry Foundation Model into Vision-Language Navigation
Vision-language Navigation (VLN) requires an agent to understand visual observations and language instructions to navigate in unseen environments. Most existing approaches rely on static scene assumptions and struggle to generalize in dynamic, real-world scenarios. To address this challenge, we propose DyGeoVLN, a dynamic geometry-aware VLN framework. Our method infuses a dynamic geometry foundation model into the VLN framework through cross-branch feature fusion to enable explicit 3D spatial representation and visual-semantic reasoning. To efficiently compress historical token information in long-horizon, dynamic navigation, we further introduce a novel pose-free and adaptive-resolution token-pruning strategy. This strategy can remove spatio-temporal redundant tokens to reduce inference cost. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on multiple benchmarks and exhibits strong robustness in real-world environments.
Evaluating Factor-Wise Auxiliary Dynamics Supervision for Latent Structure and Robustness in Simulated Humanoid Locomotion
We evaluate whether factor-wise auxiliary dynamics supervision produces useful latent structure or improved robustness in simulated humanoid locomotion. DynaMITE -- a transformer encoder with a factored 24-d latent trained by per-factor auxiliary losses during proximal policy optimization (PPO) -- is compared against Long Short-Term Memory (LSTM), plain Transformer, and Multilayer Perceptron (MLP) baselines on a Unitree G1 humanoid across four Isaac Lab tasks. The supervised latent shows no evidence of decodable or functionally separable factor structure: probe R^2 ~ 0 for all five dynamics factors, clamping any subspace changes reward by < 0.05, and standard disentanglement metrics (MIG, DCI, SAP) are near zero. An unsupervised LSTM hidden state achieves higher probe R^2 (up to 0.10). A 2x2 factorial ablation (n = 10 seeds) isolates the contributions of the tanh bottleneck and auxiliary losses: the auxiliary losses show no measurable effect on either in-distribution (ID) reward (+0.03, p = 0.732) or severe out-of-distribution (OOD) reward (+0.03, p = 0.669), while the bottleneck shows a small, consistent advantage in both regimes (ID: +0.16, p = 0.207; OOD: +0.10, p = 0.208). The bottleneck advantage persists under severe combined perturbation but does not amplify, indicating a training-time representation benefit rather than a robustness mechanism. LSTM achieves the best nominal reward on all four tasks (p < 0.03); DynaMITE degrades less under combined-shift stress (2.3% vs. 16.7%), but this difference is attributable to the bottleneck compression, not the auxiliary supervision. For locomotion practitioners: auxiliary dynamics supervision does not produce an interpretable estimator and does not measurably improve reward or robustness beyond what the bottleneck alone provides; recurrent baselines remain the stronger choice for nominal performance.
comment: 17 pages, 9 figures, 25 tables
GAPG: Geometry Aware Push-Grasping Synergy for Goal-Oriented Manipulation in Clutter ICRA 2026
Grasping target objects is a fundamental skill for robotic manipulation, but in cluttered environments with stacked or occluded objects, a single-step grasp is often insufficient. To address this, previous work has introduced pushing as an auxiliary action to create graspable space. However, these methods often struggle with both stability and efficiency because they neglect the scene's geometric information, which is essential for evaluating grasp robustness and ensuring that pushing actions are safe and effective. To this end, we propose a geometry-aware push-grasp synergy framework that leverages point cloud data to integrate grasp and push evaluation. Specifically, the grasp evaluation module analyzes the geometric relationship between the gripper's point cloud and the points enclosed within its closing region to determine grasp feasibility and stability. Guided by this, the push evaluation module predicts how pushing actions influence future graspable space, enabling the robot to select actions that reliably transform non-graspable states into graspable ones. By jointly reasoning about geometry in both grasping and pushing, our framework achieves safer, more efficient, and more reliable manipulation in cluttered settings. Our method is extensively tested in simulation and real-world environments in various scenarios. Experimental results demonstrate that our model generalizes well to real-world scenes and unseen objects.
comment: Accepted to ICRA 2026
Architecture for Multi-Unmanned Aerial Vehicles based Autonomous Precision Agriculture Systems
The use of unmanned aerial vehicles (UAVs) in precision agriculture has seen a huge increase recently. As such, systems that aim to apply various algorithms on the field need a structured framework of abstractions. This paper defines the various tasks of the UAVs in precision agriculture and model them into an architectural framework. The presented architecture is built on the context that there will be minimal physical intervention to do the tasks defined with multiple coordinated and cooperative UAVs. Various tasks such as image processing, path planning, communication, data acquisition, and field mapping are employed in the architecture to provide an efficient system. Besides, different limitation for applying Multi-UAVs in precision agriculture has been considered in designing the architecture. The architecture provides an autonomous end-to-end solution, starting from mission planning, data acquisition and image processing framework that is highly efficient and can enable farmers to comprehensively deploy UAVs onto their lands. Simulation and field tests shows that the architecture offers a number of advantages that include fault-tolerance, robustness, developer and user-friendliness.
Affordance-Guided Enveloping Grasp Demonstration Toward Non-destructive Disassembly of Pinch-Infeasible Mating Parts
Robotic disassembly of complex mating components often renders pinch grasping infeasible, necessitating multi-fingered enveloping grasps. However, visual occlusions and geometric constraints complicate teaching appropriate grasp motions when relying solely on 2D camera feeds. To address this, we propose an affordance-guided teleoperation method that pre-generates enveloping grasp candidates via physics simulation. These Affordance Templates (ATs) are visualized with a color gradient reflecting grasp quality to augment operator perception. Simulations demonstrate the method's generality across various components. Real-robot experiments validate that AT-based visual augmentation enables operators to effectively select and teach enveloping grasp strategies for real-world disassembly, even under severe visual and geometric constraints.
comment: 6 pages, 7 figures
Dynamic Control Barrier Function Regulation with Vision-Language Models for Safe, Adaptive, and Realtime Visual Navigation
Robots operating in dynamic, unstructured environments must balance safety and efficiency under potentially limited sensing. While control barrier functions (CBFs) provide principled collision avoidance via safety filtering, their behavior is often governed by fixed parameters that can be overly conservative in benign scenes or overly permissive near hazards. We present AlphaAdj, a vision-to-control navigation framework that uses egocentric RGB input to adapt the conservativeness of a CBF safety filter in real time. A vision-language model(VLM) produces a bounded scalar risk estimate from the current camera view, which we map to dynamically update a CBF parameter that modulates how strongly safety constraints are enforced. To address asynchronous inference and non-trivial VLM latency in practice, we combine a geometric, speed-aware dynamic cap and a staleness-gated fusion policy with lightweight implementation choices that reduce end-to-end inference overhead. We evaluate AlphaAdj across multiple static and dynamic obstacle scenarios in a variety of environments, comparing against fixed-parameter and uncapped ablations. Results show that AlphaAdj maintains collision-free navigation while improving efficiency (in terms of path length and time to goal) by up to 18.5% relative to fixed settings and improving robustness and success rate relative to an uncapped baseline.
Anatomical Prior-Driven Framework for Autonomous Robotic Cardiac Ultrasound Standard View Acquisition ICRA 2026
Cardiac ultrasound diagnosis is critical for cardiovascular disease assessment, but acquiring standard views remains highly operator-dependent. Existing medical segmentation models often yield anatomically inconsistent results in images with poor textural differentiation between distinct feature classes, while autonomous probe adjustment methods either rely on simplistic heuristic rules or black-box learning. To address these issues, our study proposed an anatomical prior (AP)-driven framework integrating cardiac structure segmentation and autonomous probe adjustment for standard view acquisition. A YOLO-based multi-class segmentation model augmented by a spatial-relation graph (SRG) module is designed to embed AP into the feature pyramid. Quantifiable anatomical features of standard views are extracted. Their priors are fitted to Gaussian distributions to construct probabilistic APs. The probe adjustment process of robotic ultrasound scanning is formalized as a reinforcement learning (RL) problem, with the RL state built from real-time anatomical features and the reward reflecting the AP matching. Experiments validate the efficacy of the framework. The SRG-YOLOv11s improves mAP50 by 11.3% and mIoU by 6.8% on the Special Case dataset, while the RL agent achieves a 92.5% success rate in simulation and 86.7% in phantom experiments.
comment: Accepted for publication at the IEEE ICRA 2026. 8 pages, 5 figures, 3 tables
VisFly-Lab: Unified Differentiable Framework for First-Order Reinforcement Learning of Quadrotor Control
First-order reinforcement learning with differentiable simulation is promising for quadrotor control, but practical progress remains fragmented across task-specific settings. To support more systematic development and evaluation, we present a unified differentiable framework for multi-task quadrotor control. The framework is wrapped, extensible, and equipped with deployment-oriented dynamics, providing a common interface across four representative tasks: hovering, tracking, landing, and racing. We also present the suite of first-order learning algorithms, where we identify two practical bottlenecks of standard first-order training: limited state coverage caused by horizon initialization and gradient bias caused by partially non-differentiable rewards. To address these issues, we propose Amended Backpropagation Through Time (ABPT), which combines differentiable rollout optimization, a value-based auxiliary objective, and visited-state initialization to improve training robustness. Experimental results show that ABPT yields the clearest gains in tasks with partially non-differentiable rewards, while remaining competitive in fully differentiable settings. We further provide proof-of-concept real-world deployments showing initial transferability of policies learned in the proposed framework beyond simulation.
DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control
Vision-Language-Action (VLA) models have emerged as a promising paradigm for robot learning, but their representations are still largely inherited from static image-text pretraining, leaving physical dynamics to be learned from comparatively limited action data. Generative video models, by contrast, encode rich spatiotemporal structure and implicit physics, making them a compelling foundation for robotic manipulation. But their potentials are not fully explored in the literature. To bridge the gap, we introduce DiT4DiT, an end-to-end Video-Action Model that couples a video Diffusion Transformer with an action Diffusion Transformer in a unified cascaded framework. Instead of relying on reconstructed future frames, DiT4DiT extracts intermediate denoising features from the video generation process and uses them as temporally grounded conditions for action prediction. We further propose a dual flow-matching objective with decoupled timesteps and noise scales for video prediction, hidden-state extraction, and action inference, enabling coherent joint training of both modules. Across simulation and real-world benchmarks, DiT4DiT achieves state-of-the-art results, reaching average success rates of 98.6% on LIBERO and 50.8% on RoboCasa GR1 while using substantially less training data. On the Unitree G1 robot, it also delivers superior real-world performance and strong zero-shot generalization. Importantly, DiT4DiT improves sample efficiency by over 10x and speeds up convergence by up to 7x, demonstrating that video generation can serve as an effective scaling proxy for robot policy learning. We release code and models at https://dit4dit.github.io/.
comment: https://dit4dit.github.io/
Unified Generation-Refinement Planning: Bridging Guided Flow Matching and Sampling-Based MPC for Social Navigation
Robust robot planning in dynamic, human-centric environments remains challenging due to multimodal uncertainty, the need for real-time adaptation, and safety requirements. Optimization-based planners enable explicit constraint handling but can be sensitive to initialization and struggle in dynamic settings. Learning-based planners capture multimodal solution spaces more naturally, but often lack reliable constraint satisfaction. In this paper, we introduce a unified generation-refinement framework that combines reward-guided conditional flow matching (CFM) with model predictive path integral (MPPI) control. Our key idea is a bidirectional information exchange between generation and optimization: reward-guided CFM produces diverse, informed trajectory priors for MPPI refinement, while the optimized MPPI trajectory warm-starts the next CFM generation step. Using autonomous social navigation as a motivating application, we demonstrate that the proposed approach improves the trade-off between safety, task performance, and computation time, while adapting to dynamic environments in real-time. The source code is publicly available at https://cfm-mppi.github.io.
Large Reward Models: Generalizable Online Robot Reward Generation with Vision-Language Models
Reinforcement Learning (RL) has shown great potential in refining robotic manipulation policies, yet its efficacy remains strongly bottlenecked by the difficulty of designing generalizable reward functions. In this paper, we propose a framework for online policy refinement by adapting foundation VLMs into online reward generators. We develop a robust, scalable reward model based on a state-of-the-art VLM, trained on a large-scale, multi-source dataset encompassing real-world robot trajectories, human-object interactions, and diverse simulated environments. Unlike prior approaches that evaluate entire trajectories post-hoc, our method leverages the VLM to formulate a multifaceted reward signal comprising process, completion, and temporal contrastive rewards based on current visual observations. Initializing with a base policy trained via Imitation Learning (IL), we employ these VLM rewards to guide the model to correct sub-optimal behaviors in a closed-loop manner. We evaluate our framework on challenging long-horizon manipulation benchmarks requiring sequential execution and precise control. Crucially, our reward model operates in a purely zero-shot manner within these test environments. Experimental results demonstrate that our method significantly improves the success rate of the initial IL policy within just 30 RL iterations, demonstrating remarkable sample efficiency. This empirical evidence highlights that VLM-generated signals can provide reliable feedback to resolve execution errors, effectively eliminating the need for manual reward engineering and facilitating efficient online refinement for robot learning.
MSACL: Multi-Step Actor-Critic Learning with Lyapunov Certificates for Exponentially Stabilizing Control
For stabilizing control tasks, model-free reinforcement learning (RL) approaches face numerous challenges, particularly regarding the issues of effectiveness and efficiency in complex high-dimensional environments with limited training data. To address these challenges, we propose Multi-Step Actor-Critic Learning with Lyapunov Certificates (MSACL), a novel approach that integrates exponential stability into off-policy maximum entropy reinforcement learning (MERL). In contrast to existing RL-based approaches that depend on elaborate reward engineering and single-step constraints, MSACL adopts intuitive reward design and exploits multi-step samples to enable exploratory actor-critic learning. Specifically, we first introduce Exponential Stability Labels (ESLs) to categorize training samples and propose a $λ$-weighted aggregation mechanism to learn Lyapunov certificates. Based on these certificates, we further design a stability-aware advantage function to guide policy optimization, thereby promoting rapid Lyapunov descent and robust state convergence. We evaluate MSACL across six benchmarks, comprising four stabilizing and two high-dimensional tracking tasks. Experimental results demonstrate its consistent performance improvements over both standard RL baselines and state-of-the-art Lyapunov-based RL algorithms. Beyond rapid convergence, MSACL exhibits robustness against environmental uncertainties and generalization to unseen reference signals. The source code and benchmarking environments are available at \href{https://github.com/YuanZhe-Xing/MSACL}{https://github.com/YuanZhe-Xing/MSACL}.
comment: This work has been submitted to the IEEE for possible publication
Implicit Maximum Likelihood Estimation for Real-time Generative Model Predictive Control ICRA
Diffusion-based models have recently shown strong performance in trajectory planning, as they are capable of capturing diverse, multimodal distributions of complex behaviors. A key limitation of these models is their slow inference speed, which results from the iterative denoising process. This makes them less suitable for real-time applications such as closed-loop model predictive control (MPC), where plans must be generated quickly and adapted continuously to a changing environment. In this paper, we investigate Implicit Maximum Likelihood Estimation (IMLE) as an alternative generative modeling approach for planning. IMLE offers strong mode coverage while enabling inference that is two orders of magnitude faster, making it particularly well suited for real-time MPC tasks. Our results demonstrate that IMLE achieves competitive performance on standard offline reinforcement learning benchmarks compared to the standard diffusion-based planner, while substantially improving planning speed in both open-loop and closed-loop settings. We further validate IMLE in a closed-loop human navigation scenario, operating in real-time, demonstrating how it enables rapid and adaptive plan generation in dynamic environments. Real-world videos and code are available at https://gmpc-imle.github.io/.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026
DYMO-Hair: Generalizable Volumetric Dynamics Modeling for Robot Hair Manipulation ICRA 2026
Hair care is an essential daily activity, yet it remains inaccessible to individuals with limited mobility and challenging for autonomous robot systems due to the fine-grained physical structure and complex dynamics of hair. In this work, we present DYMO-Hair, a model-based robot hair care system. We introduce a novel dynamics learning paradigm that is suited for volumetric quantities such as hair, relying on an action-conditioned latent state editing mechanism, coupled with a compact 3D latent space of diverse hairstyles to improve generalizability. This latent space is pre-trained at scale using a novel hair physics simulator, enabling generalization across previously unseen hairstyles. Using the dynamics model with a Model Predictive Path Integral (MPPI) planner, DYMO-Hair is able to perform visual goal-conditioned hair styling. Experiments in simulation demonstrate that DYMO-Hair's dynamics model outperforms baselines on capturing local deformation for diverse, unseen hairstyles. DYMO-Hair further outperforms baselines in closed-loop hair styling tasks on unseen hairstyles, with an average of 22% lower final geometric error and 42% higher success rate than the state-of-the-art system. Real-world experiments exhibit zero-shot transferability of our system to wigs, achieving consistent success on challenging unseen hairstyles where the state-of-the-art system fails. Together, these results introduce a foundation for model-based robot hair care, advancing toward more generalizable, flexible, and accessible robot hair styling in unconstrained physical environments. More details are available on our project page: https://dymohair.github.io/.
comment: To appear in ICRA 2026. Project page: https://dymohair.github.io/
Towards Unified World Models for Visual Navigation via Memory-Augmented Planning and Foresight
Enabling embodied agents to imagine future states is essential for robust and generalizable visual navigation. Yet, state-of-the-art systems typically rely on modular designs that decouple navigation planning from visual world modeling, which often induces state-action misalignment and weak adaptability in novel or dynamic scenarios. We propose UniWM, a unified, memory-augmented world model that integrates egocentric visual foresight and planning within a single multimodal autoregressive backbone. UniWM explicitly grounds action selection in visually imagined outcomes, tightly aligning prediction with control. Meanwhile, a hierarchical memory mechanism fuses short-term perceptual cues with longer-term trajectory context, supporting stable and coherent reasoning over extended horizons. Extensive experiments on four challenging benchmarks (Go Stanford, ReCon, SCAND, HuRoN) and the 1X Humanoid Dataset show that UniWM improves navigation success rates by up to 30%, substantially reduces trajectory errors against strong baselines, generalizes zero-shot to the unseen TartanDrive dataset, and scales naturally to high-dimensional humanoid control. These results position UniWM as a principled step toward unified, imagination-driven embodied navigation. The code and models are available at https://github.com/F1y1113/UniWM.
comment: 21 pages, 12 figures, code: https://github.com/F1y1113/UniWM
Graph-of-Constraints Model Predictive Control for Reactive Multi-agent Task and Motion Planning ICRA 2026
Sequences of interdependent geometric constraints are central to many multi-agent Task and Motion Planning (TAMP) problems. However, existing methods for handling such constraint sequences struggle with partially ordered tasks and dynamic agent assignments. They typically assume static assignments and cannot adapt when disturbances alter task allocations. To overcome these limitations, we introduce Graph-of-Constraints Model Predictive Control (GoC-MPC), a generalized sequence-of-constraints framework integrated with MPC. GoC-MPC naturally supports partially ordered tasks, dynamic agent coordination, and disturbance recovery. By defining constraints over tracked 3D keypoints, our method robustly solves diverse multi-agent manipulation tasks-coordinating agents and adapting online from visual observations alone, without relying on training data or environment models. Experiments demonstrate that GoC-MPC achieves higher success rates, significantly faster TAMP computation, and shorter overall paths compared to recent baselines, establishing it as an efficient and robust solution for multi-agent manipulation under real-world disturbances. Our supplementary video and code can be found at https://sites.google.com/view/goc-mpc/home .
comment: 8 main content pages, 4 main content figures, camera ready version submitted to IEEE International Conference on Robotics and Automation (ICRA 2026)
Latent Policy Steering with Embodiment-Agnostic Pretrained World Models
The performance of learned robot visuomotor policies is heavily dependent on the size and quality of the training dataset. Although large-scale robot and human datasets are increasingly available, embodiment gaps and mismatched action spaces make them difficult to leverage. Our main insight is that skills performed across different embodiments produce visual similarities in motions that can be captured using off-the-shelf action representations such as optical flow. Moreover, World Models (WMs) can leverage sub-optimal data since they focus on modeling dynamics. In this work, we aim to improve visuomotor policies in low-data regimes by first pretraining a WM using optical flow as an embodiment-agnostic action representation to leverage accessible or easily collected data from multiple embodiments (robots, humans). Given a small set of demonstrations on a target embodiment, we finetune the WM on this data to better align the WM predictions, train a base policy, and learn a robust value function. Using our finetuned WM and value function, our approach evaluates action candidates from the base policy and selects the best one to improve performance. Our approach, which we term Latent Policy Steering (LPS), improves behavior-cloned policies by 10.6% on average across four Robomimic tasks, even though most of the pretraining data comes from the real world. In the real-world experiments, LPS achieves larger gains: 70% relative improvement with 30-50 target-embodiment demonstrations, and 44% relative improvement with 60-100 demonstrations, compared to a behavior-cloned baseline. Qualitative results can be found on the website: https://yiqiwang8177.github.io/LatentPolicySteering/.
Causal World Modeling for Robot Control
This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground-truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.
comment: Project page: https://technology.robbyant.com/lingbot-va Code: https://github.com/robbyant/lingbot-va
Learning collision risk proactively from naturalistic driving data at scale
Accurately and proactively alerting drivers or automated systems to emerging collisions is crucial for road safety, particularly in highly interactive and complex urban environments. Existing methods either require labour-intensive annotation of sparse risk, struggle to consider varying contextual factors, or are tailored to limited scenarios. Here we present the Generalised Surrogate Safety Measure (GSSM), a data-driven approach that learns collision risk from naturalistic driving without the need for crash or risk labels. Trained over multiple datasets and evaluated on 2,591 real-world crashes and near-crashes, a basic GSSM using only instantaneous motion kinematics achieves an area under the precision-recall curve of 0.9, and secures a median time advance of 2.6 seconds to prevent potential collisions. Incorporating additional interaction patterns and contextual factors provides further performance gains. Across interaction scenarios such as rear-end, merging, and turning, GSSM consistently outperforms existing baselines in accuracy and timeliness. These results establish GSSM as a scalable, context-aware, and generalisable foundation to identify risky interactions before they become unavoidable, supporting proactive safety in autonomous driving systems and traffic incident management. Code and experiment data are openly accessible at https://github.com/Yiru-Jiao/GSSM.
comment: Officially published in Nature Machine Intelligence. Equation (15) in the previous versions was wrong, which has been corrected since v4
Fast Path Planning for Autonomous Vehicle Parking with Safety-Guarantee using Hamilton-Jacobi Reachability
We present a fast planning architecture called Hamilton-Jacobi-based bidirectional A* (HJBA*) to solve general tight parking scenarios. The algorithm is a two-layer composed of a high-level HJ-based reachability analysis and a lower-level bidirectional A* search algorithm. In high-level reachability analysis, a backward reachable tube (BRT) concerning vehicle dynamics is computed by the HJ analysis and it intersects with a safe set to get a safe reachable set. The safe set is defined by constraints of positive signed distances for obstacles in the environment and computed by solving QP optimization problems offline. For states inside the intersection set, i.e., the safe reachable set, the computed backward reachable tube ensures they are reachable subjected to system dynamics and input bounds, and the safe set guarantees they satisfy parking safety with respect to obstacles in different shapes. For online computation, randomized states are sampled from the safe reachable set, and used as heuristic guide points to be considered in the bidirectional A* search. The bidirectional A* search is paralleled for each randomized state from the safe reachable set. We show that the proposed two-level planning algorithm is able to solve different parking scenarios effectively and computationally fast for typical parking requests. We validate our algorithm through simulations in large-scale randomized parking scenarios and demonstrate it to be able to outperform other state-of-the-art parking planning algorithms.
comment: accepted by IEEE Transactions on Vehicular Technology
ArtiSG: Functional 3D Scene Graph Construction via Human-demonstrated Articulated Objects Manipulation
3D scene graphs have empowered robots with semantic understanding for navigation and planning. However, current functional scene graphs primarily focus on static element detection, lacking the actionable kinematic information required for physical manipulation, particularly regarding articulated objects. Existing approaches for inferring articulation mechanisms from static observations are prone to visual ambiguity, while methods that estimate parameters from state changes typically rely on constrained settings such as fixed cameras and unobstructed views. Furthermore, inconspicuous functional elements like hidden handles are frequently missed by pure visual perception. To bridge this gap, we present ArtiSG, a framework that constructs functional 3D scene graphs by encoding human demonstrations into structured robotic memory. Our approach leverages a robust data collection pipeline utilizing a portable hardware setup to accurately track 6-DoF manipulation trajectories and estimate articulation axes, even under camera ego-motion. By integrating these kinematic priors into a hierarchical, open-vocabulary graph, our system not only models how articulated objects move but also utilizes physical interaction data to discover implicit elements. Extensive real-world experiments demonstrate that ArtiSG significantly outperforms baselines in functional element recall and articulation estimation precision. Moreover, we show that the constructed graph serves as a reliable robotic memory, effectively guiding robots to perform language-directed manipulation tasks in real-world environments containing diverse articulated objects.
Optimal Solutions for the Moving Target Vehicle Routing Problem via Branch-and-Price with Relaxed Continuity ICAPS 2026
The Moving Target Vehicle Routing Problem (MT-VRP) seeks trajectories for several agents that intercept a set of moving targets, subject to speed, time window, and capacity constraints. We introduce an exact algorithm, Branch-and-Price with Relaxed Continuity (BPRC), for the MT-VRP. The main challenge in a branch-and-price approach for the MT-VRP is the pricing subproblem, which is complicated by moving targets and time-dependent travel costs between targets. Our key contribution is a new labeling algorithm that solves this subproblem by means of a novel dominance criterion tailored for problems with moving targets. Numerical results on instances with up to 25 targets show that our algorithm finds optimal solutions more than an order of magnitude faster than a baseline based on previous work, showing particular strength in scenarios with limited agent capacities.
comment: Accepted to ICAPS 2026
Parallel, Asymptotically Optimal Algorithms for Moving Target Traveling Salesman Problems
The Moving Target Traveling Salesman Problem (MT-TSP) seeks a trajectory that intercepts several moving targets, within a particular time window for each target. When generic nonlinear target trajectories or kinematic constraints on the agent are present, no prior algorithm guarantees convergence to an optimal MT-TSP solution. Therefore, we introduce the Iterated Random Generalized (IRG) TSP framework. The idea behind IRG is to alternate between randomly sampling a set of agent configuration-time points, corresponding to interceptions of targets, and finding a sequence of interception points by solving a generalized TSP (GTSP). This alternation asymptotically converges to the optimum. We introduce two parallel algorithms within the IRG framework. The first algorithm, IRG-PGLNS, solves GTSPs using PGLNS, our parallelized extension of state-of-the-art solver GLNS. The second algorithm, Parallel Communicating GTSPs (PCG), solves GTSPs for several sets of points simultaneously. We present numerical results for three MT-TSP variants: one where intercepting a target only requires coming within a particular distance, another where the agent is a variable-speed Dubins car, and a third where the agent is a robot arm. We show that IRG-PGLNS and PCG converge faster than a baseline based on prior work. We further validate our framework with physical robot experiments.
RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction
Vision-Language-Action (VLA) models have recently advanced robotic manipulation by translating natural-language instructions and visual observations into control actions. However, existing VLAs are primarily trained on successful expert demonstrations and lack structured supervision for failure diagnosis and recovery, limiting robustness in open-world scenarios. To address this limitation, we propose the Robotic Failure Analysis and Correction (RoboFAC) framework. We construct a large-scale failure-centric dataset comprising 9,440 erroneous manipulation trajectories and 78,623 QA pairs across 53 scenes in both simulation and real-world environments, with systematically categorized failure types. Leveraging this dataset, we develop a lightweight multimodal model specialized for task understanding, failure analysis, and failure correction, enabling efficient local deployment while remaining competitive with large proprietary models. Experimental results demonstrate that RoboFAC achieves a 34.1% higher failure analysis accuracy compared to GPT-4o. Furthermore, we integrated RoboFAC as an external supervisor in a real-world VLA control pipeline, yielding a 29.1% relative improvement across four tasks while significantly reducing latency relative to GPT-4o. These results demonstrate that RoboFAC enables systematic failure diagnosis and recovery, significantly enhancing VLA recovery capabilities. Our model and dataset are publicly available at https://github.com/MINT-SJTU/RoboFAC.
StableTracker: Learning to Stably Track Target via Differentiable Simulation
Existing FPV object tracking methods heavily rely on handcrafted modular pipelines, which incur high onboard computation and cumulative errors. While learning-based approaches have mitigated computational delays, most still generate only high-level trajectories (position and yaw). This loose coupling with a separate controller sacrifices precise attitude control; consequently, even if target is localized precisely, accurate target estimation does not ensure that the body-fixed camera is consistently oriented toward the target, it still probably degrades and loses target when tracking high-maneuvering target. To address these challenges, we present StableTracker, a learning-based control policy that enables quadrotors to robustly follow a moving target from arbitrary viewpoints. The policy is trained using backpropagation-through-time via differentiable simulation, allowing the quadrotor to keep a fixed relative distance while maintaining the target at the center of the visual field in both horizontal and vertical directions, thereby functioning as an autonomous aerial camera. We compare StableTracker against state-of-the-art traditional algorithms and learning baselines. Simulation results demonstrate superior accuracy, stability, and generalization across varying safe distances, trajectories, and target velocities. Furthermore, real-world experiments on a quadrotor with an onboard computer validate the practicality of the proposed approach.
Systems and Control (EESS)
Approximate Dynamic Programming for Degradation-aware Market Participation of Battery Energy Storage Systems: Bridging Market and Degradation Timescales
We present an approximate dynamic programming framework for designing degradation-aware market participation policies for battery energy storage systems. The approach employs a tailored value function approximation that reduces the state space to state of charge and battery health, while performing dynamic programming along a pseudo-time axis encoded by state of health. This formulation enables an offline/online computation split that separates long-term degradation dynamics (months to years) from short-term market dynamics (seconds to minutes) -- a timescale mismatch that renders conventional predictive control and dynamic programming approaches computationally intractable. The main computational effort occurs offline, where the value function is approximated via coarse-grained backward induction along the health dimension. Online decisions then reduce to a real-time tractable one-step predictive control problem guided by the precomputed value function. This decoupling allows the integration of high-fidelity physics-informed degradation models without sacrificing real-time feasibility. Backtests on historical market data show that the resulting policy outperforms several benchmark strategies with optimized hyperparameters.
comment: 11 pages, 4 figures
Multidimensional Opinion Dynamics with Confirmation Bias: A Multi-Layer Framework
We study multidimensional opinion dynamics under confirmation bias in social networks. Each agent holds a vector of correlated opinions across multiple topic layers. Peer interaction is modeled through a static, informationally symmetric social channel, while external information enters through a dynamic, informationally asymmetric source channel. Source influence is described by nonnegative state-dependent functions of agent--source opinion mismatch, which captures confirmation bias without hard thresholds. For general Lipschitz source-influence functions, we give sufficient conditions under which the dynamics are contractive and converge to a unique steady state independent of the initial condition. For affine confirmation-bias functions, we show that the steady state can be computed through a finite sign-consistency search and identify a regime in which it admits a closed form. For broader classes of bounded nonlinear source-influence functions, we derive explicit lower and upper bounds on the fixed point. Numerical examples and a study on a real-world adolescent lifestyle network illustrate the role of multidimensional coupling and show that source-design conclusions can change qualitatively when confirmation bias is ignored.
comment: 12 pages, 9 figures. Submitted to IEEE Transactions on Control of Network Systems (TCNS)
Koopman Meets Discrete-Time Control Barrier Functions: A Linear Model Predictive Control Framework
This paper proposes a Koopman-based linear model predictive control (LMPC) framework for safety-critical control of nonlinear discrete-time systems. Existing MPC formulations based on discrete-time control barrier functions (DCBFs) enforce safety through barrier constraints but typically result in computationally demanding nonlinear programming. To address this challenge, we construct a DCBF-augmented dynamical system and employ Koopman operator theory to lift the nonlinear dynamics into a higher-dimensional space where both the system dynamics and the barrier function admit a linear predictor representation. This enables the transformation of the nonlinear safety-constrained MPC problem into a quadratic program (QP). To improve feasibility while preserving safety, a relaxation mechanism with slack variables is introduced for the barrier constraints. The resulting approach combines the modeling capability of Koopman operators with the computational efficiency of QP. Numerical simulations on a navigation task for a robot with nonlinear dynamics demonstrate that the proposed framework achieves safe trajectory generation and efficient real-time control.
comment: 8 pages, 4 figures
Unified Sensitivity-Based Heuristic for Optimal Line Switching and Substation Reconfiguration SC
Optimal transmission switching (OTS) determines which transmission lines to remove from service to minimize dispatch costs. Unlike topology design, it alters the operational status of operating lines. Sensitivity-based methods, as advanced optimization techniques, select lines whose outage yields a significant cost reduction. However, these methods overlook bus splitting, an effective congestion management strategy that our work incorporates to achieve improved economic gains. In this work, we formulate an optimal transmission reconfiguration (OTR) problem that incorporates both line switching and bus splitting. We develop a novel approach to quantify the sensitivity of the OTR objective to line switching and bus splitting, establish connections between the proposed sensitivity framework and existing heuristic metrics, prove the equivalence between bus splitting and a generalized line switching to enable unified treatment, and provide a simpler derivation of Bus Split Distribution Factor (BSDF). Simulations on nine IEEE test systems spanning 118 to 13,659 buses demonstrate the high effectiveness of our proposed sensitivity method. They also demonstrate that incorporating bus splitting into transmission reconfiguration achieves greater cost savings than line switching alone. The results confirm the economic advantage of this comprehensive approach to transmission system operation.
comment: Accepted to PSCC 2026; to appear in a special issue of Electric Power Systems Research
Active-power control strategies in grid-forming power converters to improve transient stability in power systems with 100% converter-based generation
Grid-forming voltage source converters (GFM-VSCs) play a crucial role in the stability of power systems with large amounts of converter-based generation. Transient stability (angle stability under large disturbances) is a critical limiting factor in stressed power systems. Previous studies have proposed control strategies in GFM-VSCs to improve transient stability. These approaches typically rely on suitable current-limiting algorithms, voltage/reactive-power and active-power supplementary control strategies. This paper investigates and compares the effectiveness of three active-power control strategies in GFM-VSCs to enhance transient stability in power systems with 100 % converter-based generation: (i) a wide-area control strategy (TSP-WACS) using the centre of inertia (COI) frequency, (ii) a local transient damping method (TSP-TDM), and (iii) a novel local control strategy (TSP-L) proposed in this work. All strategies were implemented and assessed using short-circuit simulations on Kundur two-area test system with 100 % GFM-VSC generators, demonstrating critical clearing time (CCT) improvement. The TSP-WACS strategy achieves the best performance but requires a communication infrastructure, while TSP-L strategy offers a simple-but-robust alternative using local measurements, only.
comment: 17 pages
Adaptive and robust experimental design for linear dynamical models using Kalman filter
Current experimental design techniques for dynamical systems often only incorporate measurement noise, while dynamical systems also involve process noise. To construct experimental designs we need to quantify their information content. The Fisher information matrix is a popular tool to do so. Calculating the Fisher information matrix for linear dynamical systems with both process and measurement noise involves estimating the uncertain dynamical states using a Kalman filter. The Fisher information matrix, however, depends on the true but unknown model parameters. In this paper we combine two methods to solve this issue and develop a robust experimental design methodology. First, Bayesian experimental design averages the Fisher information matrix over a prior distribution of possible model parameter values. Second, adaptive experimental design allows for this information to be updated as measurements are being gathered. This updated information is then used to adapt the remainder of the design.
Design and Development of Low-Cost Datalogger for Indoor and Outdoor Air Quality Monitoring
The rising demand for low-cost air quality monitors stems from increased public awareness and interest within the research community. These monitors play a pivotal role in empowering citizens and scientists to comprehend spatiotemporal variations in air quality parameters, aiding in the formulation of effective mitigation policies. The primary challenge lies in the diverse array of application scenarios these monitors encounter. The developed data logging device is exceptionally well-suited for air quality monitoring applications, offering exceptional versatility by seamlessly operating on a range of power sources, including solar energy, batteries, and direct electrical supply. The integration of a built-in battery charger enhances its applicability for deployment in regions with solar power or intermittent electricity availability. To ensure strong network connectivity, the advanced datalogger seamlessly integrates with WiFi, Bluetooth, and LoRaWAN networks. A notable feature is its adaptable MCU system, enabling users to swap the MCU based on specific connectivity, power, and computational requirements. Importantly, the system carefully identifies key parameters crucial for both indoor and outdoor air quality assessment, customizing sensor selection accordingly. Furthermore, optimization efforts have prioritized energy efficiency, enabling the system to function with minimal power consumption while maintaining data integrity. Additional I2C and UART ports facilitate the monitoring of supplementary parameters.
High-Endurance UCAV Propulsion System: A 1-D CNN-Based Real-Time Fault Classification for Tactical-Grade IPMSM Drive
High-performance propulsion for mission-critical applications demands unprecedented reliability and real-time fault resilience. Conventional diagnostic methods (signal-based analysis and standard ML models) are essential for stator/rotor fault detection but suffer from high latency and poor generalization across variable speeds. This paper proposes a 1-D Convolutional Neural Network (CNN) framework for real-time fault classification in the HPDM-350 interior permanent magnet synchronous motor (IPMSM). The proposed architecture extracts discriminative features directly from high-frequency current and speed signals, enabling sub-millisecond inference on embedded controllers. Compared to state-of-the-art long short term memory (LSTM) and classical ML approaches, the 1-D CNN achieves a superior weighted F1-score of 0.9834. Validated through high-fidelity magnetic-domain MATLAB/Simscape models, the method demonstrates robust performance across a +-2700 RPM envelope, providing a lightweight solution for mission-critical electric propulsion systems.
comment: 8 pages, 5 figures, 4 tables. Accepted for a lecture presentation at the 2026 IEEE Intelligent Design and Control of Automation and Drive Systems (IDCD)
Physics-Infused Neural MPC of a DC-DC Boost Converter with Adaptive Transient Recovery and Enhanced Dynamic Stability
DC-DC boost converters require advanced control to ensure efficiency and stability under varying loads. Traditional model predictive control (MPC) and data-driven neural network methods face challenges such as high complexity and limited physical constraint enforcement. This paper proposes a hybrid physics-informed neural network (PINN) combined with finite control set MPC (FCS-MPC) for boost converters. The PINN embeds physical laws into neural training, providing accurate state predictions, while FCS-MPC ensures constraint satisfaction and multi-objective optimization. The method features adaptive transient recovery, explicit duty-ratio control, and enhanced dynamic stability. Experimental results on a commercial boost module demonstrate improved transient response, reduced voltage ripple, and robust operation across conduction modes. The proposed framework offers a computationally efficient, physically consistent solution for real-time control in power electronics.
comment: 7 pages, 3 figures, 1 table. Accepted for a lecture presentation at the 2026 IEEE Intelligent Design and Control of Automation and Drive Systems (IDCD)
MSACL: Multi-Step Actor-Critic Learning with Lyapunov Certificates for Exponentially Stabilizing Control
For stabilizing control tasks, model-free reinforcement learning (RL) approaches face numerous challenges, particularly regarding the issues of effectiveness and efficiency in complex high-dimensional environments with limited training data. To address these challenges, we propose Multi-Step Actor-Critic Learning with Lyapunov Certificates (MSACL), a novel approach that integrates exponential stability into off-policy maximum entropy reinforcement learning (MERL). In contrast to existing RL-based approaches that depend on elaborate reward engineering and single-step constraints, MSACL adopts intuitive reward design and exploits multi-step samples to enable exploratory actor-critic learning. Specifically, we first introduce Exponential Stability Labels (ESLs) to categorize training samples and propose a $λ$-weighted aggregation mechanism to learn Lyapunov certificates. Based on these certificates, we further design a stability-aware advantage function to guide policy optimization, thereby promoting rapid Lyapunov descent and robust state convergence. We evaluate MSACL across six benchmarks, comprising four stabilizing and two high-dimensional tracking tasks. Experimental results demonstrate its consistent performance improvements over both standard RL baselines and state-of-the-art Lyapunov-based RL algorithms. Beyond rapid convergence, MSACL exhibits robustness against environmental uncertainties and generalization to unseen reference signals. The source code and benchmarking environments are available at \href{https://github.com/YuanZhe-Xing/MSACL}{https://github.com/YuanZhe-Xing/MSACL}.
comment: This work has been submitted to the IEEE for possible publication
The value of storage in electricity distribution: The role of markets
Electricity distribution companies deploy battery storage to defer grid upgrades by reducing peak demand. In deregulated jurisdictions, such storage often sits idle because regulatory constraints bar participation in electricity markets. Here, we develop an optimization framework that, to our knowledge, provides the first formal model of market participation constraints within storage investment and operation planning. Applying the framework to a Massachusetts case study, we find that market participation delivers similar savings as peak demand reduction. Under current conditions, market participation does not increase storage investment, but at very low storage costs, could incentivize deployment beyond local distribution needs. This might run contrary to the separation of distribution from generation in deregulated markets. Our framework can mitigate this concern by identifying investment levels appropriate for local distribution needs.
Data-driven Implementations of Various Generalizations of Balanced Truncation
Quadrature-based approximation of Gramians in standard balanced truncation yields a non-intrusive, data-driven implementation that requires only transfer function samples on the imaginary axis, which can be measured experimentally. This idea has recently been extended to several generalizations of balanced truncation, including positive-real balanced truncation, bounded-real balanced truncation, and balanced stochastic truncation. However, these extensions require samples of some spectral factorizations on the imaginary axis, and no practical method exists to obtain such data experimentally. As a result, these non-intrusive implementations are mainly of theoretical interest at present. This paper shows that if the Gramians in these generalizations are approximated via rational interpolation rather than numerical integration, the resulting non-intrusive implementations do not require spectral factorization samples. Instead, they rely only on transfer function samples. Based on this idea, non-intrusive implementations are first developed for several variants of balanced truncation, wherein the Gramians are approximated implicitly using low-rank Alternating Direction Implicit (ADI) methods for Lyapunov and Riccati equations. These formulations require transfer function samples in the right half of the \(s\)-plane, which cannot be measured experimentally. Next, building on these results, novel data-driven non-intrusive implementations are proposed that require only transfer function samples on the imaginary axis. Hence, unlike the quadrature-based and ADI-based approaches, these non-intrusive formulations can be implemented using practically measurable data. Numerical results are presented for benchmark problems in model order reduction, which show that the proposed non-intrusive implementations achieve accuracy comparable to their intrusive counterparts.
The potential and viability of V2G for California BEV drivers
Vehicle-to-Grid (V2G) adoption is hindered by uncertainties regarding its effects on battery lifetime and vehicle usability. These uncertainties are compounded by limited insight into real-world vehicle usage. Here, we leverage real-world Californian BEV usage data to design and evaluate a user-centric V2G strategy. We identified four clustered driver profiles for V2G assessment, ranging from "Daily Chargers" to "Public Chargers". We show that V2G participation is most feasible for "Daily Chargers," and that the effects on battery lifetime depend on calendar aging sensitivity. For batteries with low sensitivity, V2G participation increases capacity loss for all drivers. However, for batteries with high sensitivity, V2G participation can lead to negligible changes in capacity or even improved capacity retention, particularly for drivers who tend to keep their batteries at high states of charge. Our findings enable stakeholders to better assess the potential and viability of V2G adoption.
comment: Minor revisions
Characterizing State Space Model and Hybrid Language Model Performance with Long Context
Emerging applications such as AR are driving demands for machine intelligence capable of processing continuous and/or long-context inputs on local devices. However, currently dominant models based on Transformer architecture suffers from the quadratic computational and memory overhead, which hinders applications required to process long contexts. This has spurred a paradigm shift towards new architectures like State Space Models (SSMs) and SSM-Transformer hybrid models, which provide near-linear scaling. The near-linear scaling enabled efficient handling of millions of tokens while delivering high performance in recent studies. Although such works present promising results, their workload characteristics in terms of computational performance and hardware resource requirements are not yet thoroughly explored, which limits our understanding of their implications to the system level optimizations. To address this gap, we present a comprehensive, compara-ive benchmarking of carefully selected Transformers, SSMs, and hybrid models specifically for long-context inference on consumer and embedded GPUs. Our analysis shows that SSMs are well-suited for on-device AI on consumer and embedded GPUs for long context inferences. While Transformers are up to 1.9x faster at short sequences (<8K tokens), SSMs demonstrate a dramatic performance inversion, becoming up to 4x faster at very long contexts (~57K tokens), thanks to their linear computational complexity and ~64% reduced memory footrprint. Our operator-level analysis reveals that custom SSM kernels like selective scan despite being hardware-aware to minimize memory IO, dominate the inference runtime on edge platforms, accounting for over 55% of latency due to their sequential, element-wise nature. SSM-Scope is open-sourced at https://github.com/sapmitra/ssm-scope
comment: 13 pages, 7 figures
Computational Concept of the Psyche
This article presents an overview of approaches to modeling the human psyche in the context of constructing an artificial one. Based on this overview, a concept of cognitive architecture is proposed, in which the psyche is viewed as the operating system of a living or artificial subject, comprising a space of states, including the state of needs that determine the meaning of a subject's being in relation to stimuli from the external world, and intelligence as a decision-making system regarding actions in this world to satisfy these needs. Based on this concept, a computational formalization is proposed for creating artificial general intelligence systems for an agent through experiential learning in a state space that includes agent's needs, taking into account their biological or existential significance for the intelligent agent, along with agent's sensations and actions. Thus, the problem of constructing artificial general intelligence is formalized as a system for making optimal decisions in the space of specific agent needs under conditions of uncertainty, maximizing success in achieving goals, minimizing existential risks, and maximizing energy efficiency. A minimal experimental implementation of the model is presented.
comment: 19 pages, 5 figures
Multiagent Systems
Personality-Driven Student Agent-Based Modeling in Mathematics Education: How Well Do Student Agents Align with Human Learners?
It is crucial to explore the impact of different teaching methods on student learning in educational research. However, real-person experiments face significant ethical constraints, and we cannot conduct repeated teaching experiments on the same student. LLM-based generative agents offer a promising avenue for simulating student behavior. Before large-scale experiments, a fundamental question must be addressed: are student agents truly credible, and can they faithfully simulate human learning? In this study, we built a Big Five Personality-based student agent model with a full pipeline of student-teacher interaction, self-study, and examination. To evaluate behavioral fidelity, we collected 13 empirical studies on Big Five traits and learning, and distilled them into 14 criteria. We found that the 71.4% of the student agents' behavior was aligned with human learners.
comment: Short Paper
Architecture for Multi-Unmanned Aerial Vehicles based Autonomous Precision Agriculture Systems
The use of unmanned aerial vehicles (UAVs) in precision agriculture has seen a huge increase recently. As such, systems that aim to apply various algorithms on the field need a structured framework of abstractions. This paper defines the various tasks of the UAVs in precision agriculture and model them into an architectural framework. The presented architecture is built on the context that there will be minimal physical intervention to do the tasks defined with multiple coordinated and cooperative UAVs. Various tasks such as image processing, path planning, communication, data acquisition, and field mapping are employed in the architecture to provide an efficient system. Besides, different limitation for applying Multi-UAVs in precision agriculture has been considered in designing the architecture. The architecture provides an autonomous end-to-end solution, starting from mission planning, data acquisition and image processing framework that is highly efficient and can enable farmers to comprehensively deploy UAVs onto their lands. Simulation and field tests shows that the architecture offers a number of advantages that include fault-tolerance, robustness, developer and user-friendliness.
Emergent Formal Verification: How an Autonomous AI Ecosystem Independently Discovered SMT-Based Safety Across Six Domains
An autonomous AI ecosystem (SUBSTRATE S3), generating product specifications without explicit instructions about formal methods, independently proposed the use of Z3 SMT solver across six distinct domains of AI safety: verification of LLM-generated code, tool API safety for AI agents, post-distillation reasoning correctness, CLI command validation, hardware assembly verification, and smart contract safety. These convergent discoveries, occurring across 8 products over 13 days with Jaccard similarity below 15% between variants, suggest that formal verification is not merely a useful technique for AI safety but an emergent property of any sufficiently complex system reasoning about its own safety. We propose a unified framework (substrate-guard) that applies Z3-based verification across all six output classes through a common API, and evaluate it on 181 test cases across five implemented domains, achieving 100% classification accuracy with zero false positives and zero false negatives. Our framework detected real bugs that empirical testing would miss, including an INT_MIN overflow in branchless RISC-V assembly and mathematically proved that unconstrained string parameters in tool APIs are formally unverifiable.
comment: 10 pages, 3 figures, 5 tables. Code: https://github.com/octavuntila-prog/substrate-guard. Companion paper: https://doi.org/10.5281/zenodo.19157571
Persona Alchemy: Designing, Evaluating, and Implementing Psychologically-Grounded LLM Agents for Diverse Stakeholder Representation ICLR 2026
Despite advances in designing personas for Large Language Models (LLM), challenges remain in aligning them with human cognitive processes and representing diverse stakeholder perspectives. We introduce a Social Cognitive Theory (SCT) agent design framework for designing, evaluating, and implementing psychologically grounded LLMs with consistent behavior. Our framework operationalizes SCT through four personal factors (cognitive, motivational, biological, and affective) for designing, six quantifiable constructs for evaluating, and a graph database-backed architecture for implementing stakeholder personas. Experiments tested agents' responses to contradicting information of varying reliability. In the highly polarized renewable energy transition discourse, we design five diverse agents with distinct ideologies, roles, and stakes to examine stakeholder representation. The evaluation of these agents in contradictory scenarios occurs through comprehensive processes that implement the SCT. Results show consistent response patterns ($R^2$ range: $0.58-0.61$) and systematic temporal development of SCT construct effects. Principal component analysis identifies two dimensions explaining $73$% of variance, validating the theoretical structure. Our framework offers improved explainability and reproducibility compared to black-box approaches. This work contributes to ongoing efforts to improve diverse stakeholder representation while maintaining psychological consistency in LLM personas.
comment: Accepted at ICLR 2026 Algorithmic Fairness Across Alignment Procedures and Agentic Systems (AFAA) Workshop
Robotics
Implementing Robust M-Estimators with Certifiable Factor Graph Optimization ICRA 2026
Parameter estimation in robotics and computer vision faces formidable challenges from both outlier contamination and nonconvex optimization landscapes. While M-estimation addresses the problem of outliers through robust loss functions, it creates severely nonconvex problems that are difficult to solve globally. Adaptive reweighting schemes provide one particularly appealing strategy for implementing M-estimation in practice: these methods solve a sequence of simpler weighted least squares (WLS) subproblems, enabling both the use of standard least squares solvers and the recovery of higher-quality estimates than simple local search. However, adaptive reweighting still crucially relies upon solving the inner WLS problems effectively, a task that remains challenging in many robotics applications due to the intrinsic nonconvexity of many common parameter spaces (e.g. rotations and poses). In this paper, we show how one can easily implement adaptively reweighted M-estimators with certifiably correct solvers for the inner WLS subproblems using only fast local optimization over smooth manifolds. Our approach exploits recent work on certifiable factor graph optimization to provide global optimality certificates for the inner WLS subproblems while seamlessly integrating into existing factor graph-based software libraries and workflows. Experimental evaluation on pose-graph optimization and landmark SLAM tasks demonstrates that our adaptively reweighted certifiable estimation approach provides higher-quality estimates than alternative local search-based methods, while scaling tractably to realistic problem sizes.
comment: The paper was accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)
Characterizing the onset and offset of motor imagery during passive arm movements induced by an upper-body exoskeleton IROS 2023
Two distinct technologies have gained attention lately due to their prospects for motor rehabilitation: robotics and brain-machine interfaces (BMIs). Harnessing their combined efforts is a largely uncharted and promising direction that has immense clinical potential. However, a significant challenge is whether motor intentions from the user can be accurately detected using non-invasive BMIs in the presence of instrumental noise and passive movements induced by the rehabilitation exoskeleton. As an alternative to the straightforward continuous control approach, this study instead aims to characterize the onset and offset of motor imagery during passive arm movements induced by an upper-body exoskeleton to allow for the natural control (initiation and termination) of functional movements. Ten participants were recruited to perform kinesthetic motor imagery (MI) of the right arm while attached to the robot, simultaneously cued with LEDs indicating the initiation and termination of a goal-oriented reaching task. Using electroencephalogram signals, we built a decoder to detect the transition between i) rest and beginning MI and ii) maintaining and ending MI. Offline decoder evaluation achieved group average onset accuracy of 60.7% and 66.6% for offset accuracy, revealing that the start and stop of MI could be identified while attached to the robot. Furthermore, pseudo-online evaluation could replicate this performance, forecasting reliable online exoskeleton control in the future. Our approach showed that participants could produce quality and reliable sensorimotor rhythms regardless of noise or passive arm movements induced by wearing the exoskeleton, which opens new possibilities for BMI control of assistive devices.
comment: Accepted to IROS 2023. 6 pages, 6 figures. Project page available at https://mitrakanishka.github.io/projects/passive-arm-mi/
Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves CVPR 2026
Understanding hand-object interaction (HOI) is fundamental to computer vision, robotics, and AR/VR. However, conventional hand videos often lack essential physical information such as contact forces and motion signals, and are prone to frequent occlusions. To address the challenges, we present Glove2Hand, a framework that translates multi-modal sensing glove HOI videos into photorealistic bare hands, while faithfully preserving the underlying physical interaction dynamics. We introduce a novel 3D Gaussian hand model that ensures temporal rendering consistency. The rendered hand is seamlessly integrated into the scene using a diffusion-based hand restorer, which effectively handles complex hand-object interactions and non-rigid deformations. Leveraging Glove2Hand, we create HandSense, the first multi-modal HOI dataset featuring glove-to-hand videos with synchronized tactile and IMU signals. We demonstrate that HandSense significantly enhances downstream bare-hand applications, including video-based contact estimation and hand tracking under severe occlusion.
comment: CVPR 2026
Swim2Real: VLM-Guided System Identification for Sim-to-Real Transfer
We present Swim2Real, a pipeline that calibrates a 16-parameter robotic fish simulator from swimming videos using vision-language model (VLM) feedback, requiring no hand-designed search stages. Calibrating soft aquatic robots is particularly challenging because nonlinear fluid-structure coupling makes the parameter landscape chaotic, simplified fluid models introduce a persistent sim-to-real gap, and controlled aquatic experiments are difficult to reproduce. Prior work on this platform required three manually tailored stages to handle this complexity. The VLM compares simulated and real videos and proposes parameter updates. A backtracking line search then validates each step size, tripling the accept rate from 14% to 42% by recovering proposals where the direction is correct but the magnitude is too large. Swim2Real calibrates all 16 parameters simultaneously, most closely matching real fish velocities across all motor frequencies (MAE = 7.4 mm/s, 43% lower than the next-best method), with zero outlier seeds across five runs. Motor commands from the trained policy transfer to the physical fish at 50 Hz, completing the pipeline from swimming video to real-world deployment. Downstream RL policies swim 12% farther than those from BayesOpt-calibrated simulators and 90% farther than CMA-ES. These results demonstrate that VLM-guided calibration can close the sim-to-real gap for aquatic robots directly from video, enabling zero-shot RL transfer to physical swimmers without manual system identification, a step toward automated, general-purpose simulator tuning for underwater robotics.
Does Peer Observation Help? Vision-Sharing Collaboration for Vision-Language Navigation
Vision-Language Navigation (VLN) systems are fundamentally constrained by partial observability, as an agent can only accumulate knowledge from locations it has personally visited. As multiple robots increasingly coexist in shared environments, a natural question arises: can agents navigating the same space benefit from each other's observations? In this work, we introduce Co-VLN, a minimalist, model-agnostic framework for systematically investigating whether and how peer observations from concurrently navigating agents can benefit VLN. When independently navigating agents identify common traversed locations, they exchange structured perceptual memory, effectively expanding each agent's receptive field at no additional exploration cost. We validate our framework on the R2R benchmark under two representative paradigms (the learning-based DUET and the zero-shot MapGPT), and conduct extensive analytical experiments to systematically reveal the underlying dynamics of peer observation sharing in VLN. Results demonstrate that vision-sharing enabled model yields substantial performance improvements across both paradigms, establishing a strong foundation for future research in collaborative embodied navigation.
RoboECC: Multi-Factor-Aware Edge-Cloud Collaborative Deployment for VLA Models IJCNN 2026
Vision-Language-Action (VLA) models are mainstream in embodied intelligence but face high inference costs. Edge-Cloud Collaborative (ECC) deployment offers an effective fix by easing edge-device computing pressure to meet real-time needs. However, existing ECC frameworks are suboptimal for VLA models due to two challenges: (1) Diverse model structures hinder optimal ECC segmentation point identification; (2) Even if the optimal split point is determined, changes in network bandwidth can cause performance drift. To address these issues, we propose a novel ECC deployment framework for various VLA models, termed RoboECC. Specifically, we propose a model-hardware co-aware segmentation strategy to help find the optimal segmentation point for various VLA models. Moreover, we propose a network-aware deployment adjustment approach to adapt to the network fluctuations for maintaining optimal performance. Experiments demonstrate that RoboECC achieves a speedup of up to 3.28x with only 2.55x~2.62x overhead.
comment: This paper has been accepted by IJCNN 2026
Enhancing Vision-Based Policies with Omni-View and Cross-Modality Knowledge Distillation for Mobile Robots
Vision-based policies are widely applied in robotics for tasks such as manipulation and locomotion. On lightweight mobile robots, however, they face a trilemma of limited scene transferability, restricted onboard computation resources, and sensor hardware cost. To address these issues, we propose a knowledge distillation approach that transfers knowledge from an information-rich, appearance invariant omniview depth policy to a lightweight monocular policy. The key idea is to train the student not only to mimic the expert actions but also to align with the latent embeddings of the omni view depth teacher. Experiments demonstrate that omni-view and depth inputs improve the scene transfer and navigation performance, and that the proposed distillation method enhances the performance of a singleview monocular policy, compared with policies solely imitating actions. Real world experiments further validate the effectiveness and practicality of our approach. Code will be released publicly.
ToFormer: Towards Large-scale Scenario Depth Completion for Lightweight ToF Camera
Time-of-Flight (ToF) cameras possess compact design and high measurement precision to be applied to various robot tasks. However, their limited sensing range restricts deployment in large-scale scenarios. Depth completion has emerged as a potential solution to expand the sensing range of ToF cameras, but existing research lacks dedicated datasets and struggles to generalize to ToF measurements. In this paper, we propose a full-stack framework that enables depth completion in large-scale scenarios for short-range ToF cameras. First, we construct a multi-sensor platform with a reconstruction-based pipeline to collect real-world ToF samples with dense large-scale ground truth, yielding the first LArge-ScalE scenaRio ToF depth completion dataset (LASER-ToF). Second, we propose a sensor-aware depth completion network that incorporates a novel 3D branch with a 3D-2D Joint Propagation Pooling (JPP) module and Multimodal Cross-Covariance Attention (MXCA), enabling effective modeling of long-range relationships and efficient 3D-2D fusion under non-uniform ToF depth sparsity. Moreover, our network can utilize the sparse point cloud from visual SLAM as a supplement to ToF depth to further improve prediction accuracy. Experiments show that our method achieves an 8.6% lower mean absolute error than the second-best method, while maintaining lightweight design to support onboard deployment. Finally, to verify the system's applicability on real robots, we deploy proposed method on a quadrotor at a 10Hz runtime, enabling reliable large-scale mapping and long-range planning in challenging environments for short-range ToF cameras.
comment: 17 pages, 15 figures
ROI-Driven Foveated Attention for Unified Egocentric Representations in Vision-Language-Action Systems
The development of embodied AI systems is increasingly constrained by the availability and structure of physical interaction data. Despite recent advances in vision-language-action (VLA) models, current pipelines suffer from high data collection cost, limited cross-embodiment alignment, and poor transfer from internet-scale visual data to robot control. We propose a region-of-interest (ROI) driven engineering workflow that introduces an egocentric, geometry-grounded data representation. By projecting end-effector poses via forward kinematics (FK) into a single external camera, we derive movement-aligned hand-centric ROIs without requiring wrist-mounted cameras or multi-view systems. Unlike directly downsampling the full frame, ROI is cropped from the original image before resizing, preserving high local information density for contact-critical regions while retaining global context. We present a reproducible pipeline covering calibration, synchronization, ROI generation, deterministic boundary handling, and metadata governance. The resulting representation is embodiment-aligned and viewpoint-normalized, enabling data reuse across heterogeneous robots. We argue that egocentric ROI serves as a practical data abstraction for scalable collection and cross-embodiment learning, bridging internet-scale perception and robot-specific control.
E-SocialNav: Efficient Socially Compliant Navigation with Language Models
Language models (LMs) are increasingly applied to robotic navigation; however, existing benchmarks primarily emphasize navigation success rates while paying limited attention to social compliance. Moreover, relying on large-scale LMs can raise efficiency concerns, as their heavy computational overhead leads to slower response times and higher energy consumption, making them impractical for real-time deployment on resource-constrained robotic platforms. In this work, we evaluate the social compliance of GPT-4o and Claude in robotic navigation and propose E-SocialNav, an efficient LM designed for socially compliant navigation. Despite being trained on a relatively small dataset, E-SocialNav consistently outperforms zero-shot baselines in generating socially compliant behaviors. By employing a two-stage training pipeline consisting of supervised fine-tuning followed by direct preference optimization, E-SocialNav achieves strong performance in both text-level semantic similarity to human annotations and action accuracy. The source code is available at https://github.com/Dr-LingXiao/ESocialNav.
comment: Accepted by 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing, to appear. Preprint version
StageCraft: Execution Aware Mitigation of Distractor and Obstruction Failures in VLA Models
Large scale pre-training on text and image data along with diverse robot demonstrations has helped Vision Language Action models (VLAs) to generalize to novel tasks, objects and scenes. However, these models are still susceptible to failure in the presence of execution-time impediments such as distractors and physical obstructions in the robot's workspace. Existing policy improvement methods finetune base VLAs to improve generalization, yet they still struggle in unseen distractor settings. To address this problem, we investigate whether internet-scale pretraining of large vision-language models (VLMs) can be leveraged to reason about these impediments and mitigate policy failures. To this end, we propose StageCraft, a training-free approach to improve pretrained VLA policy performance by manipulating the environment's initial state using VLM-based in-context reasoning. StageCraft takes policy rollout videos and success labels as input and leverages VLM's reasoning ability to infer which objects in the initial state need to be manipulated to avoid anticipated execution failures. StageCraft is an extensible plug-and-play module that does not introduce additional constraints on the underlying policy, and only requires a few policy rollouts to work. We evaluate performance of state-of-the-art VLA models with StageCraft and show an absolute 40% performance improvement across three real world task domains involving diverse distractors and obstructions. Our simulation experiments in RLBench empirically show that StageCraft tailors its extent of intervention based on the strength of the underlying policy and improves its performance with more in-context samples. Videos of StageCraft in effect can be found at https://stagecraft-decorator.github.io/stagecraft/ .
Speedup Patch: Learning a Plug-and-Play Policy to Accelerate Embodied Manipulation
While current embodied policies exhibit remarkable manipulation skills, their execution remains unsatisfactorily slow as they inherit the tardy pacing of human demonstrations. Existing acceleration methods typically require policy retraining or costly online interactions, limiting their scalability for large-scale foundation models. In this paper, we propose Speedup Patch (SuP), a lightweight, policy-agnostic framework that enables plug-and-play acceleration using solely offline data. SuP introduces an external scheduler that adaptively downsamples action chunks provided by embodied policies to eliminate redundancies. Specifically, we formalize the optimization of our scheduler as a Constrained Markov Decision Process (CMDP) aimed at maximizing efficiency without compromising task performance. Since direct success evaluation is infeasible in offline settings, SuP introduces World Model based state deviation as a surrogate metric to enforce safety constraints. By leveraging a learned world model as a virtual evaluator to predict counterfactual trajectories, the scheduler can be optimized via offline reinforcement learning. Empirical results on simulation benchmarks (Libero, Bigym) and real-world tasks validate that SuP achieves an overall 1.8x execution speedup for diverse policies while maintaining their original success rates.
Towards Practical World Model-based Reinforcement Learning for Vision-Language-Action Models
Vision-Language-Action (VLA) models show strong generalization for robotic control, but finetuning them with reinforcement learning (RL) is constrained by the high cost and safety risks of real-world interaction. Training VLA models in interactive world models avoids these issues but introduces several challenges, including pixel-level world modeling, multi-view consistency, and compounding errors under sparse rewards. Building on recent advances across large multimodal models and model-based RL, we propose VLA-MBPO, a practical framework to tackle these problems in VLA finetuning. Our approach has three key design choices: (i) adapting unified multimodal models (UMMs) for data-efficient world modeling; (ii) an interleaved view decoding mechanism to enforce multi-view consistency; and (iii) chunk-level branched rollout to mitigate error compounding. Theoretical analysis and experiments across simulation and real-world tasks demonstrate that VLA-MBPO significantly improves policy performance and sample efficiency, underscoring its robustness and scalability for real-world robotic deployment.
GHOST: Ground-projected Hypotheses from Observed Structure-from-Motion Trajectories
We present a scalable self-supervised approach for segmenting feasible vehicle trajectories from monocular images for autonomous driving in complex urban environments. Leveraging large-scale dashcam videos, we treat recorded ego-vehicle motion as implicit supervision and recover camera trajectories via monocular structure-from-motion, projecting them onto the ground plane to generate spatial masks of traversed regions without manual annotation. These automatically generated labels are used to train a deep segmentation network that predicts motion-conditioned path proposals from a single RGB image at run time, without explicit modeling of road or lane markings. Trained on diverse, unconstrained internet data, the model implicitly captures scene layout, lane topology, and intersection structure, and generalizes across varying camera configurations. We evaluate our approach on NuScenes, demonstrating reliable trajectory prediction, and further show transfer to an electric scooter platform through light fine-tuning. Our results indicate that large-scale ego-motion distillation yields structured and generalizable path proposals beyond the demonstrated trajectory, enabling trajectory hypothesis estimation via image segmentation.
comment: 8 pages, 27 figures, 1 table
Unified Orbit-Attitude Estimation and Sensor Tasking Framework for Autonomous Cislunar Space Domain Awareness Using Multiplicative Unscented Kalman Filter
The cislunar regime departs from near-Earth orbital behavior through strongly non-linear, non-Keplerian dynamics, which adversely affect the accuracy of uncertainty propagation and state estimation. Additional challenges arise from long-range observation requirements, restrictive sensor-target geometry and illumination conditions, the need to monitor an expansive cislunar volume, and the large design space associated with space/ground-based sensor placement. In response to these challenges, this work introduces an advanced framework for cislunar space domain awareness (SDA) encompassing two key tasks: (1) observer architecture optimization based on a realistic cost formulation that captures key performance trade-offs, solved using the Tree of Parzen Estimators algorithm, and (2) leveraging the resulting observer architecture, a mutual information-driven sensor tasking optimization is performed at discrete tasking intervals, while orbital and attitude state estimation is carried out at a finer temporal resolution between successive tasking updates using an error-state multiplicative unscented Kalman filter. Numerical simulations demonstrate that our approach in Task 1 yields observer architectures that achieve significantly lower values of the proposed cost function than baseline random-search solutions, while using fewer sensors. Task 2 results show that translational state estimation remains satisfactory over a wide range of target-to-observer count ratios, whereas attitude estimation is significantly more sensitive to target-to-observer ratios and tasking intervals, with increased rotational-state divergence observed for high target counts and infrequent tasking updates. These results highlight important trade-offs between sensing resources, tasking cadence, and achievable state estimation performance that influence the scalability of autonomous cislunar SDA.
LASER: Level-Based Asynchronous Scheduling and Execution Regime for Spatiotemporally Constrained Multi-Robot Timber Manufacturing ICRA 2026
Automating large-scale manufacturing in domains like timber construction requires multi-robot systems to manage tightly coupled spatiotemporal constraints, such as collision avoidance and process-driven deadlines. This paper introduces LASER (Level-based Asynchronous Scheduling and Execution Regime), a complete framework for scheduling and executing complex assembly tasks, demonstrated on a screw-press gluing application for timber slab manufacturing. Our central contribution is to integrate a barrier-based mechanism into a constraint programming (CP) scheduling formulation that partitions tasks into spatiotemporally disjoint sets, which we define as levels. This structure enables robots to execute tasks in parallel and asynchronously within a level, synchronizing only at level barriers, which guarantees collision-free operation by construction and provides robustness to timing uncertainties. To solve this formulation for large problems, we propose two specialized algorithms: an iterative temporal-relaxation approach for heterogeneous task sequences and a bi-level decomposition for homogeneous tasks that balances workload. We validate the LASER framework by fabricating a full-scale 2.4m x 6m timber slab with a two-robot system mounted on parallel linear tracks, successfully coordinating 108 subroutines and 352 screws under tight adhesive time windows. Computational studies show our method scales steadily with size compared to a monolithic approach.
comment: to be published in ICRA 2026. Supplementary video: https://youtu.be/EG1GCOX3zT4?si=4mNuQS0QWAo6RDZp
Current state of the multi-agent multi-view experimental and digital twin rendezvous (MMEDR-Autonomous) framework
As near-Earth resident space objects proliferate, there is an increasing demand for reliable technologies in applications of on-orbit servicing, debris removal, and orbit modification. Rendezvous and docking are critical mission phases for such applications and can benefit from greater autonomy to reduce operational complexity and human workload. Machine learning-based methods can be integrated within the guidance, navigation, and control (GNC) architecture to design a robust rendezvous and docking framework. In this work, the Multi-Agent Multi-View Experimental and Digital Twin Rendezvous (MMEDR-Autonomous) is introduced as a unified framework comprising a learning-based optical navigation network, a reinforcement learning-based guidance approach under ongoing development, and a hardware-in-the-loop testbed. Navigation employs a lightweight monocular pose estimation network with multi-scale feature fusion, trained on realistic image augmentations to mitigate domain shift. The guidance component is examined with emphasis on learning stability, reward design, and systematic hyperparameter tuning under mission-relevant constraints. Prior Control Barrier Function results for Clohessy-Wiltshire dynamics are reviewed as a basis for enforcing safety and operational constraints and for guiding future nonlinear controller design within the MMEDR-Autonomous framework. The MMEDR-Autonomous framework is currently progressing toward integrated experimental validation in multi-agent rendezvous scenarios.
Energy-Aware Reinforcement Learning for Robotic Manipulation of Articulated Components in Infrastructure Operation and Maintenance
With the growth of intelligent civil infrastructure and smart cities, operation and maintenance (O&M) increasingly requires safe, efficient, and energy-conscious robotic manipulation of articulated components, including access doors, service drawers, and pipeline valves. However, existing robotic approaches either focus primarily on grasping or target object-specific articulated manipulation, and they rarely incorporate explicit actuation energy into multi-objective optimisation, which limits their scalability and suitability for long-term deployment in real O&M settings. Therefore, this paper proposes an articulation-agnostic and energy-aware reinforcement learning framework for robotic manipulation in intelligent infrastructure O&M. The method combines part-guided 3D perception, weighted point sampling, and PointNet-based encoding to obtain a compact geometric representation that generalises across heterogeneous articulated objects. Manipulation is formulated as a Constrained Markov Decision Process (CMDP), in which actuation energy is explicitly modelled and regulated via a Lagrangian-based constrained Soft Actor-Critic scheme. The policy is trained end-to-end under this CMDP formulation, enabling effective articulated-object operation while satisfying a long-horizon energy budget. Experiments on representative O&M tasks demonstrate 16%-30% reductions in energy consumption, 16%-32% fewer steps to success, and consistently high success rates, indicating a scalable and sustainable solution for infrastructure O&M manipulation.
comment: 18 pages, 5 figures, 7 tables. This version supersedes all previous preprint versions
Cutting the Cord: System Architecture for Low-Cost, GPU-Accelerated Bimanual Mobile Manipulation
We present a bimanual mobile manipulator built on the open-source XLeRobot with integrated onboard compute for less than \$1300. Key contributions include: (1) optimized mechanical design maximizing stiffness-to-weight ratio, (2) a Tri-Bus power topology isolating compute from motor-induced voltage transients, and (3) embedded autonomy using NVIDIA Jetson Orin Nano for untethered operation. The platform enables teleoperation, autonomous SLAM navigation, and vision-based manipulation without external dependencies, providing a low-cost alternative for research and education in robotics and robot learning.
PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model
Vision-Language-Action models (VLAs) are emerging as powerful tools for learning generalizable visuomotor control policies. However, current VLAs are mostly trained on large-scale image-text-action data and remain limited in two key ways: (i) they struggle with pixel-level scene understanding, and (ii) they rely heavily on textual prompts, which reduces their flexibility in real-world settings. To address these challenges, we introduce PixelVLA, the first VLA model designed to support both pixel-level reasoning and multimodal prompting with text and visual inputs. Our approach is built on a new visuomotor instruction tuning framework that integrates a multiscale pixel-aware encoder with a visual promptaware encoder. To train PixelVLA effectively, we further propose a two-stage automated annotation pipeline that generates Pixel-160K, a large-scale dataset with pixel-level annotations derived from existing robot data. Experiments on three standard VLA benchmarks and two VLA model variants show that PixelVLA improves manipulation success rates by 10.1%-28.7% over OpenVLA, while requiring only 1.5% of its pretraining cost. These results demonstrate that PixelVLA can be integrated into existing VLAs to enable more accurate, efficient, and versatile robot control in complex environments.
comment: 17pages,7 figures, 5 tabels
RoboMorph: Evolving Robot Morphology using Large Language Models
We introduce RoboMorph, an automated approach for generating and optimizing modular robot designs using large language models (LLMs) and evolutionary algorithms. Each robot design is represented by a structured grammar, and we use LLMs to efficiently explore this design space. Traditionally, such exploration is time-consuming and computationally intensive. Using a best-shot prompting strategy combined with reinforcement learning (RL)-based control evaluation, RoboMorph iteratively refines robot designs within an evolutionary feedback loop. Across four terrain types, RoboMorph discovers diverse, terrain-specialized morphologies, including wheeled quadrupeds and hexapods, that match or outperform designs produced by Robogrammar's graph-search method. These results demonstrate that LLMs, when coupled with evolutionary selection, can serve as effective generative operators for automated robot design. Our project page and code are available at https://robomorph.github.io.
Stratified Topological Autonomy for Long-Range Coordination (STALC)
In this paper, we present Stratified Topological Autonomy for Long-Range Coordination (STALC), a hierarchical planning approach for multi-robot coordination in real-world environments with significant inter-robot spatial and temporal dependencies. At its core, STALC consists of a multi-robot graph-based planner which combines a topological graph with a novel, computationally efficient mixed-integer programming formulation to generate highly-coupled multi-robot plans in seconds. To enable autonomous planning across different spatial and temporal scales, we construct our graphs so that they capture connectivity between free-space regions and other problem-specific features, such as traversability or risk. We then use receding-horizon planners to achieve local collision avoidance and formation control. To evaluate our approach, we consider a multi-robot reconnaissance scenario where robots must autonomously coordinate to navigate through an environment while minimizing the risk of detection by observers. Through simulation-based experiments, we show that our approach is able to scale to address complex multi-robot planning scenarios. Through hardware experiments, we demonstrate our ability to generate graphs from real-world data and successfully plan across the entire hierarchy to achieve shared objectives.
comment: ©2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Multi-Step First: A Lightweight Deep Reinforcement Learning Strategy for Robust Continuous Control with Partial Observability
Deep Reinforcement Learning (DRL) has made considerable advances in simulated and physical robot control tasks, especially when problems admit a fully observed Markov Decision Process (MDP) formulation. When observations only partially capture the underlying state, the problem becomes a Partially Observable MDP (POMDP), and performance rankings between algorithms can change. We empirically compare Proximal Policy Optimization (PPO), Twin Delayed Deep Deterministic Policy Gradient (TD3), and Soft Actor-Critic (SAC) on representative POMDP variants of continuous-control benchmarks. Contrary to widely reported MDP results where TD3 and SAC typically outperform PPO, we observe an inversion: PPO attains higher robustness under partial observability. We attribute this to the stabilizing effect of multi-step bootstrapping. Furthermore, incorporating multi-step targets into TD3 (MTD3) and SAC (MSAC) improves their robustness. These findings provide practical guidance for selecting and adapting DRL algorithms in partially observable settings without requiring new theoretical machinery.
comment: 21 pages, 12 figures. Published in Neural Networks, Vol. 199, 2026
Multi-Source Human-in-the-Loop Digital Twin Testbed for Connected and Autonomous Vehicles in Mixed Traffic Flow
In the emerging mixed traffic environments, Connected and Autonomous Vehicles (CAVs) have to interact with surrounding human-driven vehicles (HDVs). This paper introduces MSH-MCCT (Multi-Source Human-in-the-Loop Mixed Cloud Control Testbed), a novel CAV testbed that captures complex interactions between various CAVs and HDVs. Utilizing the Mixed Digital Twin concept, which combines Mixed Reality with Digital Twin, MSH-MCCT integrates physical, virtual, and mixed platforms, along with multi-source control inputs. Bridged by the mixed platform, MSH-MCCT allows human drivers and CAV algorithms to operate both physical and virtual vehicles within multiple fields of view. Particularly, this testbed facilitates the coexistence and real-time interaction of physical and virtual CAVs \& HDVs, significantly enhancing the experimental flexibility and scalability. Experiments on vehicle platooning in mixed traffic showcase the potential of MSH-MCCT to conduct CAV testing with multi-source real human drivers in the loop through driving simulators of diverse fidelity. The videos for the experiments are available at our project website: https://dongjh20.github.io/MSH-MCCT.
HERE: Hierarchical Active Exploration of Radiance Field with Epistemic Uncertainty Minimization
We present HERE, an active 3D scene reconstruction framework based on neural radiance fields, enabling high-fidelity implicit mapping. Our approach centers around an active learning strategy for camera trajectory generation, driven by accurate identification of unseen regions, which supports efficient data acquisition and precise scene reconstruction. The key to our approach is epistemic uncertainty quantification based on evidential deep learning, which directly captures data insufficiency and exhibits a strong correlation with reconstruction errors. This allows our framework to more reliably identify unexplored or poorly reconstructed regions compared to existing methods, leading to more informed and targeted exploration. Additionally, we design a hierarchical exploration strategy that leverages learned epistemic uncertainty, where local planning extracts target viewpoints from high-uncertainty voxels based on visibility for trajectory generation, and global planning uses uncertainty to guide large-scale coverage for efficient and comprehensive reconstruction. The effectiveness of the proposed method in active 3D reconstruction is demonstrated by achieving higher reconstruction completeness compared to previous approaches on photorealistic simulated scenes across varying scales, while a hardware demonstration further validates its real-world applicability. Project page: https://taekbum.github.io/here/
comment: Accepted to IEEE RA-L. The first two authors contributed equally
sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only
Understanding articulated objects from monocular video is a crucial yet challenging task in robotics and digital twin creation. Existing methods often rely on complex multi-view setups, high-fidelity object scans, or fragile long-term point tracks that frequently fail in casual real-world captures. In this paper, we present sim2art, a data-driven framework that recovers the 3D part segmentation and joint parameters of articulated objects from a single monocular video captured by a freely moving camera. Our core insight is a robust representation based on per-frame surface point sampling, which we augment with short-term scene flow and DINOv3 semantic features. Unlike previous works that depend on error-prone long-term correspondences, our representation is easy to obtain and exhibits a negligible difference between simulation and reality without requiring domain adaptation. Also, by construction, our method relies on single-viewpoint visibility, ensuring that the geometric representation remains consistent across synthetic and real data despite noise and occlusions. Leveraging a suitable Transformer-based architecture, sim2art is trained exclusively on synthetic data yet generalizes strongly to real-world sequences. To address the lack of standardized benchmarks in the field, we introduce two datasets featuring a significantly higher diversity of object categories and instances than prior work. Our evaluations show that sim2art effectively handles large camera motions and complex articulations, outperforming state-of-the-art optimization-based and tracking-dependent methods. sim2art offers a scalable solution that can be easily extended to new object categories without the need for cumbersome real-world annotations. Project webpage: https://aartykov.github.io/sim2art/
Expand Your SCOPE: Semantic Cognition over Potential-Based Exploration for Embodied Visual Navigation AAAI 2026
Embodied visual navigation remains a challenging task, as agents must explore unknown environments with limited knowledge. Existing zero-shot studies have shown that incorporating memory mechanisms to support goal-directed behavior can improve long-horizon planning performance. However, they overlook visual frontier boundaries, which fundamentally dictate future trajectories and observations, and fall short of inferring the relationship between partial visual observations and navigation goals. In this paper, we propose Semantic Cognition Over Potential-based Exploration (SCOPE), a zero-shot framework that explicitly leverages frontier information to drive potential-based exploration, enabling more informed and goal-relevant decisions. SCOPE estimates exploration potential with a Vision-Language Model and organizes it into a spatio-temporal potential graph, capturing boundary dynamics to support long-horizon planning. In addition, SCOPE incorporates a self-reconsideration mechanism that revisits and refines prior decisions, enhancing reliability and reducing overconfident errors. Experimental results on two diverse embodied navigation tasks show that SCOPE outperforms state-of-the-art baselines by 4.6\% in accuracy. Further analysis demonstrates that its core components lead to improved calibration, stronger generalization, and higher decision quality.
comment: Accepted to AAAI 2026
Risk-Aware Obstacle Avoidance Algorithm for Real-Time Applications
Robust navigation in changing marine environments requires autonomous systems capable of perceiving, reasoning, and acting under uncertainty. This study introduces a hybrid risk-aware navigation architecture that integrates probabilistic modeling of obstacles along the vehicle path with smooth trajectory optimization for autonomous surface vessels. The system constructs probabilistic risk maps that capture both obstacle proximity and the behavior of dynamic objects. A risk-biased Rapidly Exploring Random Tree (RRT) planner leverages these maps to generate collision-free paths, which are subsequently refined using B-spline algorithms to ensure trajectory continuity. Three distinct RRT* rewiring modes are implemented based on the cost function: minimizing the path length, minimizing risk, and optimizing a combination of the path length and total risk. The framework is evaluated in experimental scenarios containing both static and dynamic obstacles. The results demonstrate the system's ability to navigate safely, maintain smooth trajectories, and dynamically adapt to changing environmental risks. Compared with conventional LIDAR or vision-only navigation approaches, the proposed method shows improvements in operational safety and autonomy, establishing it as a promising solution for risk-aware autonomous vehicle missions in uncertain and dynamic environments.
Barrier-Riccati Synthesis for Nonlinear Safe Control with Expanded Region of Attraction
We present a Riccati-based framework for safety-critical nonlinear control that integrates the barrier states (BaS) methodology with the State-Dependent Riccati Equation (SDRE) approach. The BaS formulation embeds safety constraints into the system dynamics via auxiliary states, enabling safety to be treated as a control objective. To overcome the limited region of attraction in linear BaS controllers, we extend the framework to nonlinear systems using SDRE synthesis applied to the barrier-augmented dynamics and derive a matrix inequality condition that certifies forward invariance of a large region of attraction and guarantees asymptotic safe stabilization. The resulting controller is computed online via pointwise Riccati solutions. We validate the method on an unstable constrained system and cluttered quadrotor navigation tasks, demonstrating improved constraint handling, scalability, and robustness near safety boundaries. This framework offers a principled and computationally tractable solution for synthesizing nonlinear safe feedback in safety-critical environments.
comment: This work has been accepted for publication in the proceedings of the 2026 American Control Conference (ACC), New Orleans, Louisiana, USA
AERO-MPPI: Anchor-Guided Ensemble Trajectory Optimization for Agile Mapless Drone Navigation ICRA 2026
Agile mapless navigation in cluttered 3D environments poses significant challenges for autonomous drones. Conventional mapping-planning-control pipelines incur high computational cost and propagate estimation errors. We present AERO-MPPI, a fully GPU-accelerated framework that unifies perception and planning through an anchor-guided ensemble of Model Predictive Path Integral (MPPI) optimizers. Specifically, we design a multi-resolution LiDAR point-cloud representation that rapidly extracts spatially distributed "anchors" as look-ahead intermediate endpoints, from which we construct polynomial trajectory guides to explore distinct homotopy path classes. At each planning step, we run multiple MPPI instances in parallel and evaluate them with a two-stage multi-objective cost that balances collision avoidance and goal reaching. Implemented entirely with NVIDIA Warp GPU kernels, AERO-MPPI achieves real-time onboard operation and mitigates the local-minima failures of single-MPPI approaches. Extensive simulations in forests, verticals, and inclines demonstrate sustained reliable flight above 7 m/s, with success rates above 80% and smoother trajectories compared to state-of-the-art baselines. Real-world experiments on a LiDAR-equipped quadrotor with NVIDIA Jetson Orin NX 16G confirm that AERO-MPPI runs in real time onboard and consistently achieves safe, agile, and robust flight in complex cluttered environments. Code is available at https://github.com/XinChen-stars/AERO_MPPI.
comment: Accepted by ICRA 2026
Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface CVPR 2026
Recent progress in robot learning has been driven by large-scale datasets and powerful visuomotor policy architectures, yet policy robustness remains limited by the substantial cost of collecting diverse demonstrations, particularly for spatial generalization in manipulation tasks. To reduce repetitive data collection, we present Real2Edit2Real, a framework that generates new demonstrations by bridging 3D editability with 2D visual data through a 3D control interface. Our approach first reconstructs scene geometry from multi-view RGB observations with a metric-scale 3D reconstruction model. Based on the reconstructed geometry, we perform depth-reliable 3D editing on point clouds to generate new manipulation trajectories while geometrically correcting the robot poses to recover physically consistent depth, which serves as a reliable condition for synthesizing new demonstrations. Finally, we propose a multi-conditional video generation model guided by depth as the primary control signal, together with action, edge, and ray maps, to synthesize spatially augmented multi-view manipulation videos. Experiments on four real-world manipulation tasks demonstrate that policies trained on data generated from only 1-5 source demonstrations can match or outperform those trained on 50 real-world demonstrations, improving data efficiency by up to 10-50x. Moreover, experimental results on height and texture editing demonstrate the framework's flexibility and extensibility, indicating its potential to serve as a unified data generation framework. Project website is https://real2edit2real.github.io/.
comment: Accepted to CVPR 2026
Reactive Slip Control in Multifingered Grasping: Hybrid Tactile Sensing and Internal-Force Optimization ICRA
We build a low-level reflex control layer driven by fast tactile feedback for multifinger grasp stabilization. Our hybrid approach combines learned tactile slip detection with model-based internal-force control to halt in-hand slip while preserving the object-level wrench. The multimodal tactile stack integrates piezoelectric sensing (PzE) for fast slip cues and piezoresistive arrays (PzR) for contact localization, enabling online construction of a contact-centric grasp representation without prior object knowledge. Experiments demonstrate reactive stabilization of multifingered grasps under external perturbations, without explicit friction models or direct force sensing. In controlled trials, slip onset is detected after 20.4 +/- 6 ms. The framework yields a theoretical grasp response latency on the order of 30 ms, with grasp-model updates in less than 5 ms and internal-force selection in about 4 ms. The analysis supports the feasibility of sub-50 ms tactile-driven grasp responses, aligned with human reflex baselines.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA), 2026
Physics-Informed Policy Optimization via Analytic Dynamics Regularization
Reinforcement learning (RL) has achieved strong performance in robotic control; however, state-of-the-art policy learning methods, such as actor-critic methods, still suffer from high sample complexity and often produce physically inconsistent actions. This limitation stems from neural policies implicitly rediscovering complex physics from data alone, despite accurate dynamics models being readily available in simulators. In this paper, we introduce a novel physics-informed RL framework, called PIPER, that seamlessly integrates physical constraints directly into neural policy optimization with analytical soft physics constraints. At the core of our method is the integration of a differentiable Lagrangian residual as a regularization term within the actor's objective. This residual, extracted from a robot's simulator description, subtly biases policy updates towards dynamically consistent solutions. Crucially, this physics integration is realized through an additional loss term during policy optimization, requiring no alterations to existing simulators or core RL algorithms. Extensive experiments demonstrate that our method significantly improves learning efficiency, stability, and control accuracy, establishing a new paradigm for efficient and physically consistent robotic control.
comment: 11 pages, 8 figures
Contractive Diffusion Policies: Robust Action Diffusion via Contractive Score-Based Sampling with Differential Equations ICLR 2026
Diffusion policies have emerged as powerful generative models for offline policy learning, whose sampling process can be rigorously characterized by a score function guiding a stochastic differential equation (SDE). However, the same score-based SDE modeling that grants diffusion policies the flexibility to learn diverse behavior also incurs solver and score-matching errors, large data requirements, and inconsistencies in action generation. While less critical in image generation, these inaccuracies compound and lead to failure in continuous control settings. We introduce contractive diffusion policies (CDPs) to induce contractive behavior in the diffusion sampling dynamics. Contraction pulls nearby flows closer to enhance robustness against solver and score-matching errors while reducing unwanted action variance. We develop an in-depth theoretical analysis along with a practical implementation recipe to incorporate CDPs into existing diffusion policy architectures with minimal modification and computational cost. We evaluate CDPs for offline learning by conducting extensive experiments in simulation and real-world settings. Across benchmarks, CDPs often outperform baseline policies, with pronounced benefits under data scarcity.
comment: Published as a conference paper at ICLR 2026
RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach
Bus holding control is challenging due to stochastic traffic and passenger demand. While deep reinforcement learning (DRL) shows promise, standard actor-critic algorithms suffer from Q-value instability in volatile environments. A key source of this instability is the conflation of two distinct uncertainties: aleatoric uncertainty (irreducible noise) and epistemic uncertainty (data insufficiency). Treating these as a single risk leads to value underestimation in noisy states, causing catastrophic policy collapse. We propose a robust ensemble soft actor-critic (RE-SAC) framework to explicitly disentangle these uncertainties. RE-SAC applies Integral Probability Metric (IPM)-based weight regularization to the critic network to hedge against aleatoric risk, providing a smooth analytical lower bound for the robust Bellman operator without expensive inner-loop perturbations. To address epistemic risk, a diversified Q-ensemble penalizes overconfident value estimates in sparsely covered regions. This dual mechanism prevents the ensemble variance from misidentifying noise as a data gap, a failure mode identified in our ablation study. Experiments in a realistic bidirectional bus corridor simulation demonstrate that RE-SAC achieves the highest cumulative reward (approx. -0.4e6) compared to vanilla SAC (-0.55e6). Mahalanobis rareness analysis confirms that RE-SAC reduces Oracle Q-value estimation error by up to 62% in rare out-of-distribution states (MAE of 1647 vs. 4343), demonstrating superior robustness under high traffic variability.
LAOF: Robust Latent Action Learning with Optical Flow Constraints CVPR 2026
Learning latent actions from large-scale videos is crucial for the pre-training of scalable embodied foundation models, yet existing methods often struggle with action-irrelevant distractors. Although incorporating action supervision can alleviate these distractions, its effectiveness is restricted by the scarcity of available action labels. Optical flow represents pixel-level motion between consecutive frames, naturally suppressing background elements and emphasizing moving objects. Motivated by this, we propose robust Latent Action learning with Optical Flow constraints, called LAOF, a pseudo-supervised framework that leverages the agent's optical flow as an action-driven signal to learn latent action representations robust to distractors. Experimental results show that the latent representations learned by LAOF outperform existing methods on downstream imitation learning and reinforcement learning tasks. This superior performance arises from optical flow constraints, which substantially stabilize training and improve the quality of latent representations under extremely label-scarce conditions, while remaining effective as the proportion of action labels increases to 10 percent. Importantly, even without action supervision, LAOF matches or surpasses action-supervised methods trained with 1 percent of action labels.
comment: CVPR 2026; Project page: https://github.com/XizoB/LAOF
DriveCode: Domain Specific Numerical Encoding for LLM-Based Autonomous Driving
Large language models (LLMs) have shown great promise for autonomous driving. However, discretizing numbers into tokens limits precise numerical reasoning, fails to reflect the positional significance of digits in the training objective, and makes it difficult to achieve both decoding efficiency and numerical precision. These limitations affect both the processing of sensor measurements and the generation of precise control commands, creating a fundamental barrier for deploying LLM-based autonomous driving systems. In this paper, we introduce DriveCode, a novel numerical encoding method that represents numbers as dedicated embeddings rather than discrete text tokens. DriveCode employs a number projector to map numbers into the language model's hidden space, enabling seamless integration with visual and textual features in a unified multimodal sequence. Evaluated on OmniDrive, DriveGPT4, and DriveGPT4-V2 datasets, DriveCode demonstrates superior performance in trajectory prediction and control signal generation, confirming its effectiveness for LLM-based autonomous driving systems.
comment: The project page is available at https://shiftwilliam.github.io/DriveCode
PACE: Physics Augmentation for Coordinated End-to-end Reinforcement Learning toward Versatile Humanoid Table Tennis
Humanoid table tennis (TT) demands rapid perception, proactive whole-body motion, and agile footwork under strict timing--capabilities that remain difficult for end-to-end control policies. We propose a reinforcement learning (RL) framework that maps ball-position observations directly to whole-body joint commands for both arm striking and leg locomotion, strengthened by predictive signals and dense, physics-guided rewards. A lightweight learned predictor, fed with recent ball positions, estimates future ball states and augments the policy's observations for proactive decision-making. During training, a physics-based predictor supplies precise future states to construct dense, informative rewards that lead to effective exploration. The resulting policy attains strong performance across varied serve ranges (hit rate$\geq$96% and success rate$\geq$92%) in simulations. Ablation studies confirm that both the learned predictor and the predictive reward design are critical for end-to-end learning. Deployed zero-shot on a physical Booster T1 humanoid with 23 revolute joints, the policy produces coordinated lateral and forward-backward footwork with accurate, fast returns, suggesting a practical path toward versatile, competitive humanoid TT. We have open-sourced our RL training code at: https://github.com/purdue-tracelab/TTRL-ICRA2026
VertiAdaptor: Online Kinodynamics Adaptation for Vertically Challenging Terrain
Autonomous driving in off-road environments presents significant challenges due to the dynamic and unpredictable nature of unstructured terrain. Traditional kinodynamic models often struggle to generalize across diverse geometric and semantic terrain types, underscoring the need for real-time adaptation to ensure safe and reliable navigation. We propose VertiAdaptor (VA), a novel online adaptation framework that efficiently integrates elevation with semantic embeddings to enable terrain-aware kinodynamic modeling and planning via function encoders. VA learns a kinodynamic space spanned by a set of neural ordinary differential equation basis functions, capturing complex vehicle-terrain interactions across varied environments. After offline training, the proposed approach can rapidly adapt to new, unseen environments by identifying kinodynamics in the learned space through a computationally efficient least-squares calculation. We evaluate VA within the Verti-Bench simulator, built on the Chrono multi-physics engine, and validate its performance both in simulation and on a physical Verti-4-Wheeler platform. Our results demonstrate that VA improves prediction accuracy by up to 23.9% and achieves a 5X faster adaptation time, advancing the robustness and reliability of autonomous robots in complex and evolving off-road environments.
CAR: Cross-Vehicle Kinodynamics Adaptation via Mobility Representation
Developing autonomous off-road mobility typically requires either extensive, platform-specific data collection or relies on simplified abstractions, such as unicycle or bicycle models, that fail to capture the complex kinodynamics of diverse platforms, ranging from wheeled to tracked vehicles. This limitation hinders scalability across evolving heterogeneous autonomous robot fleets. To address this challenge, we propose Cross-vehicle kinodynamics Adaptation via mobility Representation (CAR), a novel framework that enables rapid mobility transfer to new vehicles. CAR employs a Transformer encoder with Adaptive Layer Normalization to embed vehicle trajectory transitions and physical configurations into a shared mobility latent space. By identifying and extracting commonality from nearest neighbors within this latent space, our approach enables rapid kinodynamics adaptation to novel platforms with minimal data collection and computational overhead. We evaluate CAR using the Verti-Bench simulator, built on the Chrono multi-physics engine, and validate its performance on four distinct physical configurations of the Verti-4-Wheeler platform. With only one minute of new trajectory data, CAR achieves up to 67.2% reduction in prediction error compared to direct neighbor transfer across diverse unseen vehicle configurations, demonstrating the effectiveness of cross-vehicle mobility knowledge transfer in both simulated and real-world environments.
CoViLLM: An Adaptive Human-Robot Collaborative Assembly Framework Using Large Language Models
With increasing demand for mass customization, traditional manufacturing robots that rely on rule-based operations lack the flexibility to accommodate customized or new product variants. Human-Robot Collaboration has demonstrated potential to improve system adaptability by leveraging human versatility and decision-making capabilities. However, existing Human-Robot Collaborative frameworks typically depend on predefined perception-manipulation pipelines, limiting their ability to autonomously generate task plans for new product assembly. In this work, we propose CoViLLM, an adaptive human-robot collaborative assembly framework that supports the assembly of customized and previously unseen products. CoViLLM combines depth-camera-based localization for object position estimation, human operator classification for identifying new components, and a Large Language Model for assembly task planning based on natural language instructions. The framework is validated on the NIST Assembly Task Board for known, customized, and new product cases. Experimental results show that the proposed framework enables flexible collaborative assembly by extending Human-Robot Collaboration beyond predefined product and task settings.
comment: 6 pages, 7 figures. Accepted to ASME MSEC 2026
Multiagent Systems
Cyber Deception for Mission Surveillance via Hypergame-Theoretic Deep Reinforcement Learning
Unmanned Aerial Vehicles (UAVs) are valuable for mission-critical systems like surveillance, rescue, or delivery. Not surprisingly, such systems attract cyberattacks, including Denial-of-Service (DoS) attacks to overwhelm the resources of mission drones (MDs). How can we defend UAV mission systems against DoS attacks? We adopt cyber deception as a defense strategy, in which honey drones (HDs) are proposed to bait and divert attacks. The attack and deceptive defense hinge upon radio signal strength: The attacker selects victim MDs based on their signals, and HDs attract the attacker from afar by emitting stronger signals, despite this reducing battery life. We formulate an optimization problem for the attacker and defender to identify their respective strategies for maximizing mission performance while minimizing energy consumption. To address this problem, we propose a novel approach, called HT-DRL. HT-DRL identifies optimal solutions without a long learning convergence time by taking the solutions of hypergame theory into the neural network of deep reinforcement learning. This achieves a systematic way to intelligently deceive attackers. We analyze the performance of diverse defense mechanisms under different attack strategies. Further, the HT-DRL-based HD approach outperforms existing non-HD counterparts up to two times better in mission performance while incurring low energy consumption.
comment: 23 pages, 21 figures
Learning to Aggregate Zero-Shot LLM Agents for Corporate Disclosure Classification
This paper studies whether a lightweight trained aggregator can combine diverse zero-shot large language model judgments into a stronger downstream signal for corporate disclosure classification. Zero-shot LLMs can read disclosures without task-specific fine-tuning, but their predictions often vary across prompts, reasoning styles, and model families. I address this problem with a multi-agent framework in which three zero-shot agents independently read each disclosure and output a sentiment label, a confidence score, and a short rationale. A logistic meta-classifier then aggregates these signals to predict next-day stock return direction. I use a sample of 18,420 U.S. corporate disclosures issued by Nasdaq and S&P 500 firms between 2018 and 2024, matched to next-day stock returns. Results show that the trained aggregator outperforms all single agents, majority vote, confidence-weighted voting, and a FinBERT baseline. Balanced accuracy rises from 0.561 for the best single agent to 0.612 for the trained aggregator, with the largest gains in disclosures combining strong current performance with weak guidance or elevated risk. The results suggest that zero-shot LLM agents capture complementary financial signals and that supervised aggregation can turn cross-agent disagreement into a more useful classification target.
Agentic Physical-AI for Self-Aware RF Systems
Intelligent control of RF transceivers adapting to dynamic operational conditions is essential in the modern and future communication systems. We propose a multi-agent neurosymbolic AI system, where AI agents are assigned for circuit components. Agents have an internal model and a corresponding control algorithm as its constituents. Modeling of the IF amplifier shows promising results, where the same approach can be extended to all the components, thus creating a fully intelligent RF system.
comment: 2 pages, 3 figures, Accepted for 2026 International Applied Computational Electromagnetics Society (ACES) Symposium
Towards Intelligent Geospatial Data Discovery: a knowledge graph-driven multi-agent framework powered by large language models
The rapid growth in the volume, variety, and velocity of geospatial data has created data ecosystems that are highly distributed, heterogeneous, and semantically inconsistent. Existing data catalogs, portals, and infrastructures still rely largely on keyword-based search with limited semantic support, which often fails to capture user intent and leads to weak retrieval performance. To address these challenges, this study proposes a knowledge graph-driven multi-agent framework for intelligent geospatial data discovery, powered by large language models. The framework introduces a unified geospatial metadata ontology as a semantic mediation layer to align heterogeneous metadata standards across platforms and constructs a geospatial metadata knowledge graph to explicitly model datasets and their multidimensional relationships. Building on the structured representation, the framework adopts a multi-agent collaborative architecture to perform intent parsing, knowledge graph retrieval, and answer synthesis, forming an interpretable and closed-loop discovery process from user queries to results. Results from representative use cases and performance evaluation show that the framework substantially improves intent matching accuracy, ranking quality, recall, and discovery transparency compared with traditional systems. This study advances geospatial data discovery toward a more semantic, intent-aware, and intelligent paradigm, providing a practical foundation for next-generation intelligent and autonomous spatial data infrastructures and contributing to the broader vision of Autonomous GIS.
Position: Multi-Agent Algorithmic Care Systems Demand Contestability for Trustworthy AI
Multi-agent systems (MAS) are increasingly used in healthcare to support complex decision-making through collaboration among specialized agents. Because these systems act as collective decision-makers, they raise challenges for trust, accountability, and human oversight. Existing approaches to trustworthy AI largely rely on explainability, but explainability alone is insufficient in multi-agent settings, as it does not enable care partners to challenge or correct system outputs. To address this limitation, Contestable AI (CAI) characterizes systems that support effective human challenge throughout the decision-making lifecycle by providing transparency, structured opportunities for intervention, and mechanisms for review, correction, or override. This position paper argues that contestability is a necessary design requirement for trustworthy multi-agent algorithmic care systems. We identify key limitations in current MAS and Explainable AI (XAI) research and present a human-in-the-loop framework that integrates structured argumentation and role-based contestation to preserve human agency, clinical responsibility, and trust in high-stakes care contexts.
LASER: Level-Based Asynchronous Scheduling and Execution Regime for Spatiotemporally Constrained Multi-Robot Timber Manufacturing ICRA 2026
Automating large-scale manufacturing in domains like timber construction requires multi-robot systems to manage tightly coupled spatiotemporal constraints, such as collision avoidance and process-driven deadlines. This paper introduces LASER (Level-based Asynchronous Scheduling and Execution Regime), a complete framework for scheduling and executing complex assembly tasks, demonstrated on a screw-press gluing application for timber slab manufacturing. Our central contribution is to integrate a barrier-based mechanism into a constraint programming (CP) scheduling formulation that partitions tasks into spatiotemporally disjoint sets, which we define as levels. This structure enables robots to execute tasks in parallel and asynchronously within a level, synchronizing only at level barriers, which guarantees collision-free operation by construction and provides robustness to timing uncertainties. To solve this formulation for large problems, we propose two specialized algorithms: an iterative temporal-relaxation approach for heterogeneous task sequences and a bi-level decomposition for homogeneous tasks that balances workload. We validate the LASER framework by fabricating a full-scale 2.4m x 6m timber slab with a two-robot system mounted on parallel linear tracks, successfully coordinating 108 subroutines and 352 screws under tight adhesive time windows. Computational studies show our method scales steadily with size compared to a monolithic approach.
comment: to be published in ICRA 2026. Supplementary video: https://youtu.be/EG1GCOX3zT4?si=4mNuQS0QWAo6RDZp
The Coordination Gap: Multi-Agent Alternation Metrics for Temporal Fairness in Repeated Games
Multi-agent coordination dilemmas expose a fundamental tension between individual optimization and collective welfare, yet characterizing such coordination requires metrics sensitive to temporal structure and collective dynamics. As a diagnostic testbed, we study a BoE-derived multi-agent variant of the Battle of the Exes, formalizing it as a Markov game in which turn-taking emerges as a periodic coordination regime. Conventional outcome-based metrics (e.g., efficiency and min/max fairness) are temporally blind (they cannot distinguish structured alternation from monopolistic or random access patterns) and fairness ratios lose discriminative power as n grows, obscuring inequities. To address this limitation, we introduce Perfect Alternation (PA) as a reference coordination regime and propose six novel Alternation (ALT) metrics designed as temporally sensitive observables of coordination quality. Using Q-learning agents as a minimal adaptive diagnostic baseline, and comparing against random-policy null processes, we uncover a clear measurement failure: despite exhibiting deceptively high traditional metrics (e.g., reward fairness often exceeding 0.9), learned policies perform up to 81% below random baselines under ALT-variant evaluation, a deficit already present in the two-agent case and intensifying as n grows. These results demonstrate, in this setting, that high aggregate payoffs can coexist with poor temporal coordination, and that conventional metrics may severely mischaracterize emergent dynamics. Our findings underscore the necessity of temporally aware observables for analyzing coordination in multi-agent games and highlight random-policy baselines as essential null processes for interpreting coordination outcomes relative to chance-level behavior.
comment: 42 pages, 5 figures, 4 tables, 1 supplementary pdf. Submitted to Social Choice & Welfare
A Unified Cloud-Edge-Terminal Framework for Multimodal Integrated Sensing and Communication
The transition to 6G calls for tightly integrated sensing and communication to support mission-critical services such as autonomous driving, embodied AI, and high-precision telemedicine. However, most existing ISAC designs rely on a single sensing modality (often RF), which limits environmental understanding and becomes a bottleneck in complex and dynamic scenes. This motivates a shift from single-modal to multimodal ISAC, where heterogeneous sensors (e.g., radar, LiDAR, and cameras) complement each other to improve robustness and semantic awareness. In this article, we first summarize key challenges for multimodal ISAC, including heterogeneous fusion, communication overhead, and scalable system design. We then highlight three enabling technologies: large AI models, semantic communications, and multi-agent systems, and discuss how their combination can enable task-oriented multimodal perception. Building on these insights, we propose a unified cloud-edge-terminal (CET) framework that hierarchically distributes intelligence and supports three adaptive operation modes: global fusion mode (GFM), cooperative relay mode (CRM), and peer interaction mode (PIM). A case study evaluates the framework across three modes, demonstrating that GFM achieves the highest accuracy, PIM minimizes latency, and CRM strikes an optimal balance between performance and efficiency. Finally, we conclude with open research issues and future directions.
Systems and Control (EESS)
Physics-Informed Graph Neural Jump ODEs for Cascading Failure Prediction in Power Grids
Cascading failures in power grids pose severe risks to infrastructure reliability, yet real-time prediction of their progression remains an open challenge. Physics-based simulators require minutes to hours per scenario, while existing graph neural network approaches treat cascading failures as static classification tasks, ignoring temporal evolution and physical laws. This paper proposes Physics-Informed Graph Neural Jump ODEs (PI-GN-JODE), combining an edge-conditioned graph neural network encoder, a Neural ODE for continuous power redistribution, a jump process handler for discrete relay trips, and Kirchhoff-based physics regularization. The model simultaneously predicts edge and node failure probabilities, severity classification, and demand not served, while an autoregressive extension enables round-by-round temporal cascade prediction. Evaluated on the IEEE 24-bus and 118-bus systems with 20,000 scenarios each, PI-GN-JODE achieves a Precision--Recall Area Under the Curve of 0.991 for edge failure detection, 0.973 for node failure detection, and a coefficient of determination of 0.951 for demand-not-served regression on the 118-bus system, outperforming a standard graph convolutional network baseline (0.948, 0.925, and 0.912, respectively). Ablation studies reveal that the four components function synergistically, with the physics-informed loss alone contributing +9.2 points to demand-not-served regression. Performance improves when scaling to larger grids, and the architecture achieves the highest balanced accuracy (0.996) on the PowerGraph benchmark using data from a different simulation framework.
comment: 10 pages, 6 figures
Achieving $\widetilde{O}(1/ε)$ Sample Complexity for Bilinear Systems Identification under Bounded Noises
This paper studies finite-sample set-membership identification for discrete-time bilinear systems under bounded symmetric log-concave disturbances. Compared with existing finite-sample results for linear systems and related analyses under stronger noise assumptions, we consider the more challenging bilinear setting with trajectory-dependent regressors and allow marginally stable dynamics with polynomial mean-square state growth. Under these conditions, we prove that the diameter of the feasible parameter set shrinks with sample complexity $\widetilde{O}(1/ε)$. Simulation supports the theory and illustrates the advantage of the proposed estimator for uncertainty quantification.
Towards Certified Sim-to-Real Transfer via Stochastic Simulation-Gap Functions
This paper introduces the notion of stochastic simulation-gap function, which formally quantifies the gap between an approximate mathematical model and a high-fidelity stochastic simulator. Since controllers designed for the mathematical model may fail in practice due to unmodeled gaps, the stochastic simulation-gap function enables the simulator to be interpreted as the nominal model with bounded state- and input-dependent disturbances. We propose a data-driven approach and establish a formal guarantee on the quantification of this gap. Leveraging the stochastic simulation-gap function, we design a controller for the mathematical model that ensures the desired specification is satisfied in the high-fidelity simulator with high confidence, taking a step toward bridging the sim-to-real gap. We demonstrate the effectiveness of the proposed method using a TurtleBot model and a pendulum system in stochastic simulators.
EQISA: Energy-efficient Quantum Instruction Set Architecture using Sparse Dictionary Learning
The scalability of quantum computing in supporting sophisticated algorithms critically depends not only on qubit quality and error handling, but also on the efficiency of classical control, constrained by the cryogenic control bandwidth and energy budget. In this work, we address this challenge by investigating the algorithmic complexity of quantum circuits at the instruction set architecture (ISA) level. We introduce an energy-efficient quantum instruction set architecture (EQISA) that synthesizes quantum circuits in a discrete Solovay-Kitaev basis of fixed depth and encodes instruction streams using a sparse dictionary learned from decomposing a set of Haar-random unitaries, followed by entropy-optimal Huffman coding and an additional lossless bzip2 compression stage. This approach is evaluated on benchmark quantum circuits demonstrating over 60% compression of quantum instruction streams across system sizes, enabling proportional reductions in classical control energy and communication overhead without loss of computational fidelity. Beyond compression, EQISA facilitates the discovery of higher-level composable abstractions in quantum circuits and provides estimates of quantum algorithmic complexity. These findings position EQISA as an impactful direction for improving the energy efficiency and scalability of quantum control architectures.
comment: associated repository: https://github.com/Advanced-Research-Centre/EQISA/
Energy-Aware Reinforcement Learning for Robotic Manipulation of Articulated Components in Infrastructure Operation and Maintenance
With the growth of intelligent civil infrastructure and smart cities, operation and maintenance (O&M) increasingly requires safe, efficient, and energy-conscious robotic manipulation of articulated components, including access doors, service drawers, and pipeline valves. However, existing robotic approaches either focus primarily on grasping or target object-specific articulated manipulation, and they rarely incorporate explicit actuation energy into multi-objective optimisation, which limits their scalability and suitability for long-term deployment in real O&M settings. Therefore, this paper proposes an articulation-agnostic and energy-aware reinforcement learning framework for robotic manipulation in intelligent infrastructure O&M. The method combines part-guided 3D perception, weighted point sampling, and PointNet-based encoding to obtain a compact geometric representation that generalises across heterogeneous articulated objects. Manipulation is formulated as a Constrained Markov Decision Process (CMDP), in which actuation energy is explicitly modelled and regulated via a Lagrangian-based constrained Soft Actor-Critic scheme. The policy is trained end-to-end under this CMDP formulation, enabling effective articulated-object operation while satisfying a long-horizon energy budget. Experiments on representative O&M tasks demonstrate 16%-30% reductions in energy consumption, 16%-32% fewer steps to success, and consistently high success rates, indicating a scalable and sustainable solution for infrastructure O&M manipulation.
comment: 18 pages, 5 figures, 7 tables. This version supersedes all previous preprint versions
Antifragile perimeter control: Anticipating and gaining from disruptions with reinforcement learning
The optimal operation of transportation systems is often susceptible to unexpected disruptions. Many established control strategies reliant on mathematical models can struggle with real-world disruptions, leading to significant divergence from their anticipated efficiency. This study integrates the cutting-edge concept of antifragility with learning-based traffic control strategies to optimize urban road network operations under disruptions. Antifragile systems not only withstand and recover from stressors but also thrive and enhance performance in the presence of such adversarial events. Incorporating antifragile modules composed of traffic state derivatives and redundancy, a deep reinforcement learning algorithm is developed. Subsequently, it is evaluated in a cordon-shaped transportation network and a case study with real-world data. Promising results highlight that the proposed algorithm provides: (i) superior performance achieving up to 27.6% and 41.9% performance gain over baselines under increasing demand and supply disruptions, (ii) lower distribution skewness under disruptions, demonstrating its relative antifragility against baselines, (iii) effectiveness under limited observability due to real-world data availability constraints, and (iv) the robustness and transferability to be combined with various state-of-the-art RL frameworks. The proposed antifragile methodology is generalizable and holds potential for applications beyond traffic engineering, offering integration into control systems exposed to disruptions across various disciplines.
comment: 38 pages, 21 figures
Stratified Topological Autonomy for Long-Range Coordination (STALC)
In this paper, we present Stratified Topological Autonomy for Long-Range Coordination (STALC), a hierarchical planning approach for multi-robot coordination in real-world environments with significant inter-robot spatial and temporal dependencies. At its core, STALC consists of a multi-robot graph-based planner which combines a topological graph with a novel, computationally efficient mixed-integer programming formulation to generate highly-coupled multi-robot plans in seconds. To enable autonomous planning across different spatial and temporal scales, we construct our graphs so that they capture connectivity between free-space regions and other problem-specific features, such as traversability or risk. We then use receding-horizon planners to achieve local collision avoidance and formation control. To evaluate our approach, we consider a multi-robot reconnaissance scenario where robots must autonomously coordinate to navigate through an environment while minimizing the risk of detection by observers. Through simulation-based experiments, we show that our approach is able to scale to address complex multi-robot planning scenarios. Through hardware experiments, we demonstrate our ability to generate graphs from real-world data and successfully plan across the entire hierarchy to achieve shared objectives.
comment: ©2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Prescribed-Time Distributed Generalized Nash Equilibrium Seeking
This paper proposes the first fully distributed algorithm for finding the Generalized Nash Equilibrium (GNE) of a non-cooperative game with shared coupling constraints and general cost coupling at a user-prescribed finite time T. As a foundation, a centralized gradient-based prescribed-time convergence result is established for the GNE problem, extending the optimization Lyapunov function framework to gradient dynamics, the only known realization among existing alternatives that naturally decomposes into per-agent computations. Building on this, a fully distributed architecture is designed in which each agent concurrently runs three coupled dynamics: a prescribed-time distributed state observer, a gradient-based optimization law, and a dual consensus mechanism that enforces the shared-multiplier requirement of the variational GNE, thus guaranteeing convergence to the same solution as the centralized case. The simultaneous operation of these layers creates bidirectional perturbations between consensus and optimization, which are resolved through gain synchronization that matches the temporal singularities of the optimization and consensus layers, ensuring all error components vanish exactly at T. The Fischer-Burmeister reformulation renders the algorithm projection-free and guarantees constraint satisfaction at the deadline. Numerical simulations on a Nash-Cournot game and a time-critical sensor coverage problem validate the approach.
comment: 12 pages, 5 figures
Multi-Source Human-in-the-Loop Digital Twin Testbed for Connected and Autonomous Vehicles in Mixed Traffic Flow
In the emerging mixed traffic environments, Connected and Autonomous Vehicles (CAVs) have to interact with surrounding human-driven vehicles (HDVs). This paper introduces MSH-MCCT (Multi-Source Human-in-the-Loop Mixed Cloud Control Testbed), a novel CAV testbed that captures complex interactions between various CAVs and HDVs. Utilizing the Mixed Digital Twin concept, which combines Mixed Reality with Digital Twin, MSH-MCCT integrates physical, virtual, and mixed platforms, along with multi-source control inputs. Bridged by the mixed platform, MSH-MCCT allows human drivers and CAV algorithms to operate both physical and virtual vehicles within multiple fields of view. Particularly, this testbed facilitates the coexistence and real-time interaction of physical and virtual CAVs \& HDVs, significantly enhancing the experimental flexibility and scalability. Experiments on vehicle platooning in mixed traffic showcase the potential of MSH-MCCT to conduct CAV testing with multi-source real human drivers in the loop through driving simulators of diverse fidelity. The videos for the experiments are available at our project website: https://dongjh20.github.io/MSH-MCCT.
Barrier-Riccati Synthesis for Nonlinear Safe Control with Expanded Region of Attraction
We present a Riccati-based framework for safety-critical nonlinear control that integrates the barrier states (BaS) methodology with the State-Dependent Riccati Equation (SDRE) approach. The BaS formulation embeds safety constraints into the system dynamics via auxiliary states, enabling safety to be treated as a control objective. To overcome the limited region of attraction in linear BaS controllers, we extend the framework to nonlinear systems using SDRE synthesis applied to the barrier-augmented dynamics and derive a matrix inequality condition that certifies forward invariance of a large region of attraction and guarantees asymptotic safe stabilization. The resulting controller is computed online via pointwise Riccati solutions. We validate the method on an unstable constrained system and cluttered quadrotor navigation tasks, demonstrating improved constraint handling, scalability, and robustness near safety boundaries. This framework offers a principled and computationally tractable solution for synthesizing nonlinear safe feedback in safety-critical environments.
comment: This work has been accepted for publication in the proceedings of the 2026 American Control Conference (ACC), New Orleans, Louisiana, USA
AERO-MPPI: Anchor-Guided Ensemble Trajectory Optimization for Agile Mapless Drone Navigation ICRA 2026
Agile mapless navigation in cluttered 3D environments poses significant challenges for autonomous drones. Conventional mapping-planning-control pipelines incur high computational cost and propagate estimation errors. We present AERO-MPPI, a fully GPU-accelerated framework that unifies perception and planning through an anchor-guided ensemble of Model Predictive Path Integral (MPPI) optimizers. Specifically, we design a multi-resolution LiDAR point-cloud representation that rapidly extracts spatially distributed "anchors" as look-ahead intermediate endpoints, from which we construct polynomial trajectory guides to explore distinct homotopy path classes. At each planning step, we run multiple MPPI instances in parallel and evaluate them with a two-stage multi-objective cost that balances collision avoidance and goal reaching. Implemented entirely with NVIDIA Warp GPU kernels, AERO-MPPI achieves real-time onboard operation and mitigates the local-minima failures of single-MPPI approaches. Extensive simulations in forests, verticals, and inclines demonstrate sustained reliable flight above 7 m/s, with success rates above 80% and smoother trajectories compared to state-of-the-art baselines. Real-world experiments on a LiDAR-equipped quadrotor with NVIDIA Jetson Orin NX 16G confirm that AERO-MPPI runs in real time onboard and consistently achieves safe, agile, and robust flight in complex cluttered environments. Code is available at https://github.com/XinChen-stars/AERO_MPPI.
comment: Accepted by ICRA 2026
Reactive Slip Control in Multifingered Grasping: Hybrid Tactile Sensing and Internal-Force Optimization ICRA
We build a low-level reflex control layer driven by fast tactile feedback for multifinger grasp stabilization. Our hybrid approach combines learned tactile slip detection with model-based internal-force control to halt in-hand slip while preserving the object-level wrench. The multimodal tactile stack integrates piezoelectric sensing (PzE) for fast slip cues and piezoresistive arrays (PzR) for contact localization, enabling online construction of a contact-centric grasp representation without prior object knowledge. Experiments demonstrate reactive stabilization of multifingered grasps under external perturbations, without explicit friction models or direct force sensing. In controlled trials, slip onset is detected after 20.4 +/- 6 ms. The framework yields a theoretical grasp response latency on the order of 30 ms, with grasp-model updates in less than 5 ms and internal-force selection in about 4 ms. The analysis supports the feasibility of sub-50 ms tactile-driven grasp responses, aligned with human reflex baselines.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA), 2026
Stannic: Systolic STochAstic ONliNe SchedulIng AcCelerator
Efficient workload scheduling is a critical challenge in modern heterogeneous computing environments, particularly in high-performance computing (HPC) systems. Traditional software-based schedulers struggle to efficiently balance workloads due to scheduling overhead, lack of adaptability to stochastic workloads, and suboptimal resource utilization. The scheduling problem further compounds in the context of shared HPC clusters, where job arrivals and processing times are inherently stochastic. Prediction of these elements is possible, but it introduces additional overhead. To perform this complex scheduling, we developed two FPGA-assisted hardware accelerator microarchitectures, Hercules and Stannic. Hercules adopts a task-centric abstraction of stochastic scheduling, whereas Stannic inherits a schedule-centric abstraction. These hardware-assisted solutions leverage parallelism, pre-calculation, and spatial memory access to significantly accelerate scheduling. We accelerate a non-preemptive stochastic online scheduling algorithm to produce heterogeneity-aware schedules in near real time. With Hercules, we achieved a speedup of up to 1060x over a baseline C/C++ implementation, demonstrating the efficacy of a hardware-assisted acceleration for heterogeneity-aware stochastic scheduling. With Stannic, we further improved efficiency, achieving a 7.5x reduction in latency per computation iteration and a 14x increase in the target heterogeneous system size. Experimental results show that the resulting schedules demonstrate efficient machine utilization and low average job latency in stochastic contexts.
comment: 30 pages, 18 figures, Conference version published in Int'l Conference on Computer Aided Design (ICCAD) 2025. Journal version (current version) is under revision with ACM TRETS
Robotics
GustPilot: A Hierarchical DRL-INDI Framework for Wind-Resilient Quadrotor Navigation
Wind disturbances remain a key barrier to reliable autonomous navigation for lightweight quadrotors, where the rapidly varying airflow can destabilize both planning and tracking. This paper introduces GustPilot, a hierarchical wind-resilient navigation stack in which a deep reinforcement learning (DRL) policy generates inertial-frame velocity reference for gate traversal. At the same time, a geometric Incremental Nonlinear Dynamic Inversion (INDI) controller provides low-level tracking with fast residual disturbance rejection. The INDI layer achieves this by providing incremental feedback on both specific linear acceleration and angular acceleration rate, using onboard sensor measurements to reject wind disturbances rapidly. Robustness is obtained through a two-level strategy, wind-aware planning learned via fan-jet domain randomization during training, and rapid execution-time disturbance rejection by the INDI tracking controller. We evaluate GustPilot in real flights on a 50g quad-copter platform against a DRL-PID baseline across four scenarios ranging from no-wind to fully dynamic conditions with a moving gate and a moving disturbance source. Despite being trained only in a minimal single-gate and single-fan setup, the policy generalizes to significantly more complex environments (up to six gates and four fans) without retraining. Across 80 experiments, DRL-INDI achieves a 94.7% versus 55.0% for DRL-PID as average Overall Success Rate (OSR), reduces tracking RMSE up to 50%, and sustains speeds up to 1.34 m/s under wind disturbances up to 3.5 m/s. These results demonstrate that combining DRL-based velocity planning with structured INDI disturbance rejection provides a practical and generalizable approach to wind-resilient autonomous flight navigation.
comment: 8 pages, 5 figures
Radar-Inertial Odometry with Online Spatio-Temporal Calibration via Continuous-Time IMU Modeling
Radar-Inertial Odometry (RIO) has emerged as a robust alternative to vision- and LiDAR-based odometry in challenging conditions such as low light, fog, featureless environments, or in adverse weather. However, many existing RIO approaches assume known radar-IMU extrinsic calibration or rely on sufficient motion excitation for online extrinsic estimation, while temporal misalignment between sensors is often neglected or treated independently. In this work, we present a RIO framework that performs joint online spatial and temporal calibration within a factor-graph optimization formulation, based on continuous-time modeling of inertial measurements using uniform cubic B-splines. The proposed continuous-time representation of acceleration and angular velocity accurately captures the asynchronous nature of radar-IMU measurements, enabling reliable convergence of both the temporal offset and extrinsic calibration parameters, without relying on scan matching, target tracking, or environment-specific assumptions.
LIORNet: Self-Supervised LiDAR Snow Removal Framework for Autonomous Driving under Adverse Weather Conditions
LiDAR sensors provide high-resolution 3D perception and long-range detection, making them indispensable for autonomous driving and robotics. However, their performance significantly degrades under adverse weather conditions such as snow, rain, and fog, where spurious noise points dominate the point cloud and lead to false perception. To address this problem, various approaches have been proposed: distance-based filters exploiting spatial sparsity, intensity-based filters leveraging reflectance distributions, and learning-based methods that adapt to complex environments. Nevertheless, distance-based methods struggle to distinguish valid object points from noise, intensity-based methods often rely on fixed thresholds that lack adaptability to changing conditions, and learning-based methods suffer from the high cost of annotation, limited generalization, and computational overhead. In this study, we propose LIORNet, which eliminates these drawbacks and integrates the strengths of all three paradigms. LIORNet is built upon a U-Net++ backbone and employs a self-supervised learning strategy guided by pseudo-labels generated from multiple physical and statistical cues, including range-dependent intensity thresholds, snow reflectivity, point sparsity, and sensing range constraints. This design enables LIORNet to distinguish noise points from environmental structures without requiring manual annotations, thereby overcoming the difficulty of snow labeling and the limitations of single-principle approaches. Extensive experiments on the WADS and CADC datasets demonstrate that LIORNet outperforms state-of-the-art filtering algorithms in both accuracy and runtime while preserving critical environmental features. These results highlight LIORNet as a practical and robust solution for LiDAR perception in extreme weather, with strong potential for real-time deployment in autonomous driving systems.
comment: 14 pages, 6 figures, 2 tables
Sense4HRI: A ROS 2 HRI Framework for Physiological Sensor Integration and Synchronized Logging
Physiological signals are increasingly relevant to estimate the mental states of users in human-robot interaction (HRI), yet ROS 2-based HRI frameworks still lack reusable support to integrate such data streams in a standardized way. Therefore, we propose Sense4HRI, an adapted framework for human-robot interaction in ROS 2 that integrates physiological measurements and derived user-state indicators. The framework is designed to be extensible, allowing the integration of additional physiological sensors, their interpretation, and multimodal fusion to provide a robust assessment of the mental states of users. In addition, it introduces reusable interfaces for timestamped physiological time-series data and supports synchronized logging of physiological signals together with experiment context, enabling interoperable and traceable multimodal analysis within ROS 2-based HRI systems.
comment: 6 pages, 3 figures, submitted at IEEE RO-MAN 2026
Beyond detection: cooperative multi-agent reasoning for rapid onboard EO crisis response
Rapid identification of hazardous events is essential for next-generation Earth Observation (EO) missions supporting disaster response. However, current monitoring pipelines remain largely ground-centric, introducing latency due to downlink limitations, multi-source data fusion constraints, and the computational cost of exhaustive scene analysis. This work proposes a hierarchical multi-agent architecture for onboard EO processing under strict resource and bandwidth constraints. The system enables the exploitation of complementary multimodal observations by coordinating specialized AI agents within an event-driven decision pipeline. AI agents can be deployed across multiple nodes in a distributed setting, such as satellite platforms. An Early Warning agent generates fast hypotheses from onboard observations and selectively activates domain-specific analysis agents, while a Decision agent consolidates the evidence to issue a final alert. The architecture combines vision-language models, traditional remote sensing analysis tools, and role-specialized agents to enable structured reasoning over multimodal observations while minimizing unnecessary computation. A proof-of-concept implementation was executed on the engineering model of an edge-computing platform currently deployed in orbit, using representative satellite data. Experiments on wildfire and flood monitoring scenarios show that the proposed routing-based pipeline significantly reduces computational overhead while maintaining coherent decision outputs, demonstrating the feasibility of distributed agent-based reasoning for future autonomous EO constellations.
comment: Accepted for presentation at the ESA's 4S Symposium 2026 Conference (see https://atpi.eventsair.com/4s-symposium-2026/)
Multi-Agent Motion Planning on Industrial Magnetic Levitation Platforms: A Hybrid ADMM-HOCBF approach
This paper presents a novel hybrid motion planning method for holonomic multi-agent systems. The proposed decentralised model predictive control (MPC) framework tackles the intractability of classical centralised MPC for a growing number of agents while providing safety guarantees. This is achieved by combining a decentralised version of the alternating direction method of multipliers (ADMM) with a centralised high-order control barrier function (HOCBF) architecture. Simulation results show significant improvement in scalability over classical centralised MPC. We validate the efficacy and real-time capability of the proposed method by developing a highly efficient C++ implementation and deploying the resulting trajectories on a real industrial magnetic levitation platform.
comment: 8 pages, 4 figures, accepted to the European Control Conference 2026
Real-Time Structural Detection for Indoor Navigation from 3D LiDAR Using Bird's-Eye-View Images
Efficient structural perception is essential for mapping and autonomous navigation on resource-constrained robots. Existing 3D methods are computationally prohibitive, while traditional 2D geometric approaches lack robustness. This paper presents a lightweight, real-time framework that projects 3D LiDAR data into 2D Bird's-Eye-View (BEV) images to enable efficient detection of structural elements relevant to mapping and navigation. Within this representation, we systematically evaluate several feature extraction strategies, including classical geometric techniques (Hough Transform, RANSAC, and LSD) and a deep learning detector based on YOLO-OBB. The resulting detections are integrated through a spatiotemporal fusion module that improves stability and robustness across consecutive frames. Experiments conducted on a standard mobile robotic platform highlight clear performance trade-offs. Classical methods such as Hough and LSD provide fast responses but exhibit strong sensitivity to noise, with LSD producing excessive segment fragmentation that leads to system congestion. RANSAC offers improved robustness but fails to meet real-time constraints. In contrast, the YOLO-OBB-based approach achieves the best balance between robustness and computational efficiency, maintaining an end-to-end latency (satisfying 10 Hz operation) while effectively filtering cluttered observations in a low-power single-board computer (SBC) without using GPU acceleration. The main contribution of this work is a computationally efficient BEV-based perception pipeline enabling reliable real-time structural detection from 3D LiDAR on resource-constrained robotic platforms that cannot rely on GPU-intensive processing.
Mixed Integer vs. Continuous Model Predictive Controllers for Binary Thruster Control: A Comparative Study
Binary on/off thrusters are commonly used for spacecraft attitude and position control during proximity operations. However, their discrete nature poses challenges for conventional continuous control methods. The control of these discrete actuators is either explicitly formulated as a mixed-integer optimization problem or handled in a two-layer approach, where a continuous controller's output is converted to binary commands using analog-to digital modulation techniques such as Delta-Sigma-modulation. This paper provides the first systematic comparison between these two paradigms for binary thruster control, contrasting continuous Model Predictive Control (MPC) with Delta-Sigma modulation against direct Mixed-Integer MPC (MIMPC) approaches. Furthermore, we propose a new variant of MPC for binary actuated systems, which is informed using the state of the Delta-Sigma Modulator. The two variations for the continuous MPC along with the MIMPC are evaluated through extensive simulations using ESA's REACSA platform. Results demonstrate that while all approaches perform similarly in high-thrust regimes, MIMPC achieves superior fuel efficiency in low-thrust conditions. Continuous MPC with modulation shows instabilities at higher thrust levels, while binary informed MPC, which incorporates modulator dynamics, improves robustness and reduces the efficiency gap to the MIMPC. It can be seen from the simulated and real-system experiments that MIMPC offers complete stability and fuel efficiency benefits, particularly for resource-constrained missions, while continuous control methods remain attractive for computationally limited applications.
comment: Accepted to CEAS EuroGNC 2026
Generalized Task-Driven Design of Soft Robots via Reduced-Order FEM-based Surrogate Modeling
Task-driven design of soft robots requires models that are physically accurate and computationally efficient, while remaining transferable across actuator designs and task scenarios. However, existing modeling approaches typically face a fundamental trade-off between physical fidelity and computational efficiency, which limits model reuse across design and task variations and constrains scalable task-driven optimization. This paper presents a unified reduced-order finite element method (FEM)-based surrogate modeling pipeline for generalized task-driven soft robot design. High-fidelity FEM simulations characterize actuator behavior at the modular level, from which compact surrogate joint models are constructed for evaluation within a pseudo-rigid body model (PRBM). A meta-model maps actuator design parameters to surrogate representations, enabling rapid instantiation across a parameterized actuator family. The resulting models are embedded into a PRBM-based simulation environment, supporting task-level simulation and optimization under realistic physical constraints. The proposed pipeline is validated through sim-to-real transfer across multiple actuator types, including bellow-type pneumatic actuators and a tendon-driven soft finger, as well as two task-driven design studies: soft gripper co-design via Reinforcement Learning (RL) and 3D actuator shape matching via evolutionary optimization. The results demonstrate high accuracy, efficiency, and reliable reuse, providing a scalable foundation for autonomous task-driven soft robot design.
Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis
Equipping humanoid robots with versatile interaction skills typically requires either extensive policy training or explicit human-to-robot motion retargeting. However, learning-based policies face prohibitive data collection costs. Meanwhile, retargeting relies on human-centric pose estimation (e.g., SMPL), introducing a morphology gap. Skeletal scale mismatches result in severe spatial misalignments when mapped to robots, compromising interaction success. In this work, we propose Dream2Act, a robot-centric framework enabling zero-shot interaction through generative video synthesis. Given a third-person image of the robot and target object, our framework leverages video generation models to envision the robot completing the task with morphology-consistent motion. We employ a high-fidelity pose extraction system to recover physically feasible, robot-native joint trajectories from these synthesized dreams, subsequently executed via a general-purpose whole-body controller. Operating strictly within the robot-native coordinate space, Dream2Act avoids retargeting errors and eliminates task-specific policy training. We evaluate Dream2Act on the Unitree G1 across four whole-body mobile interaction tasks: ball kicking, sofa sitting, bag punching, and box hugging. Dream2Act achieves a 37.5% overall success rate, compared to 0% for conventional retargeting. While retargeting fails to establish correct physical contacts due to the morphology gap (with errors compounded during locomotion), Dream2Act maintains robot-consistent spatial alignment, enabling reliable contact formation and substantially higher task completion.
DynFlowDrive: Flow-Based Dynamic World Modeling for Autonomous Driving
Recently, world models have been incorporated into the autonomous driving systems to improve the planning reliability. Existing approaches typically predict future states through appearance generation or deterministic regression, which limits their ability to capture trajectory-conditioned scene evolution and leads to unreliable action planning. To address this, we propose DynFlowDrive, a latent world model that leverages flow-based dynamics to model the transition of world states under different driving actions. By adopting the rectifiedflow formulation, the model learns a velocity field that describes how the scene state changes under different driving actions, enabling progressive prediction of future latent states. Building upon this, we further introduce a stability-aware multi-mode trajectory selection strategy that evaluates candidate trajectories according to the stability of the induced scene transitions. Extensive experiments on the nuScenes and NavSim benchmarks demonstrate consistent improvements across diverse driving frameworks without introducing additional inference overhead. Source code will be abaliable at https://github.com/xiaolul2/DynFlowDrive.
comment: 18 pages, 6 figs
Legged Autonomous Surface Science In Analogue Environments (LASSIE): Making Every Robotic Step Count in Planetary Exploration
The ability to efficiently and effectively explore planetary surfaces is currently limited by the capability of wheeled rovers to traverse challenging terrains, and by pre-programmed data acquisition plans with limited in-situ flexibility. In this paper, we present two novel approaches to address these limitations: (i) high-mobility legged robots that use direct surface interactions to collect rich information about the terrain's mechanics to guide exploration; (ii) human-inspired data acquisition algorithms that enable robots to reason about scientific hypotheses and adapt exploration priorities based on incoming ground-sensing measurements. We successfully verify our approach through lab work and field deployments in two planetary analog environments. The new capability for legged robots to measure soil mechanical properties is shown to enable effective traversal of challenging terrains. When coupled with other geologic properties (e.g., composition, thermal properties, and grain size data etc), soil mechanical measurements reveal key factors governing the formation and development of geologic environments. We then demonstrate how human-inspired algorithms turn terrain-sensing robots into teammates, by supporting more flexible and adaptive data collection decisions with human scientists. Our approach therefore enables exploration of a wider range of planetary environments and new substrate investigation opportunities through integrated human-robot systems that support maximum scientific return.
Accurate Open-Loop Control of a Soft Continuum Robot Through Visually Learned Latent Representations
This work addresses open-loop control of a soft continuum robot (SCR) from video-learned latent dynamics. Visual Oscillator Networks (VONs) from previous work are used, that provide mechanistically interpretable 2D oscillator latents through an attention broadcast decoder (ABCD). Open-loop, single-shooting optimal control is performed in latent space to track image-specified waypoints without camera feedback. An interactive SCR live simulator enables design of static, dynamic, and extrapolated targets and maps them to model-specific latent waypoints. On a two-segment pneumatic SCR, Koopman, MLP, and oscillator dynamics, each with and without ABCD, are evaluated on setpoint and dynamic trajectories. ABCD-based models consistently reduce image-space tracking error. The VON and ABCD-based Koopman models attains the lowest MSEs. Using an ablation study, we demonstrate that several architecture choices and training settings contribute to the open-loop control performance. Simulation stress tests further confirm static holding, stable extrapolated equilibria, and plausible relaxation to the rest state. To the best of our knowledge, this is the first demonstration that interpretable, video-learned latent dynamics enable reliable long-horizon open-loop control of an SCR.
ContractionPPO: Certified Reinforcement Learning via Differentiable Contraction Layers
Legged locomotion in unstructured environments demands not only high-performance control policies but also formal guarantees to ensure robustness under perturbations. Control methods often require carefully designed reference trajectories, which are challenging to construct in high-dimensional, contact-rich systems such as quadruped robots. In contrast, Reinforcement Learning (RL) directly learns policies that implicitly generate motion, and uniquely benefits from access to privileged information, such as full state and dynamics during training, that is not available at deployment. We present ContractionPPO, a framework for certified robust planning and control of legged robots by augmenting Proximal Policy Optimization (PPO) RL with a state-dependent contraction metric layer. This approach enables the policy to maximize performance while simultaneously producing a contraction metric that certifies incremental exponential stability of the simulated closed-loop system. The metric is parameterized as a Lipschitz neural network and trained jointly with the policy, either in parallel or as an auxiliary head of the PPO backbone. While the contraction metric is not deployed during real-world execution, we derive upper bounds on the worst-case contraction rate and show that these bounds ensure the learned contraction metric generalizes from simulation to real-world deployment. Our hardware experiments on quadruped locomotion demonstrate that ContractionPPO enables robust, certifiably stable control even under strong external perturbations.
comment: Accepted to RA-L journal
LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment
We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban environments. While prior work LoD-Loc v2 achieves localization through semantic building silhouette alignment with low-detail city models, it suffers from two key limitations: poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k images with precise instance building annotations. This enables trained models to exhibit remarkable zero-shot generalization capability. Second, we reformulate the localization paradigm by shifting from semantic to instance silhouette alignment, which significantly reduces pose estimation ambiguity in dense scenes. Extensive experiments demonstrate that LoD-Loc v3 outperforms existing state-of-the-art (SOTA) baselines, achieving superior performance in both cross-scene and dense urban scenarios with a large margin. The project is available at https://nudt-sawlab.github.io/LoD-Locv3/.
CeRLP: A Cross-embodiment Robot Local Planning Framework for Visual Navigation
Visual navigation for cross-embodiment robots is challenging due to variations in robot and camera configurations, which can lead to the failure of navigation tasks. Previous approaches typically rely on collecting massive datasets across different robots, which is highly data-intensive, or fine-tuning models, which is time-consuming. Furthermore, both methods often lack explicit consideration of robot geometry. In this paper, we propose a Cross-embodiment Robot Local Planning (CeRLP) framework for general visual navigation, which abstracts visual information into a unified geometric formulation and applies to heterogeneous robots with varying physical dimensions, camera parameters, and camera types. CeRLP introduces a depth estimation scale correction method that utilizes offline pre-calibration to resolve the scale ambiguity of monocular depth estimation, thereby recovering precise metric depth images. Furthermore, CeRLP designs a visual-to-scan abstraction module that projects varying visual inputs into height-adaptive laser scans, making the policy robust to heterogeneous robots. Experiments in simulation environments demonstrate that CeRLP outperforms comparative methods, validating its robust obstacle avoidance capabilities as a local planner. Additionally, extensive real-world experiments verify the effectiveness of CeRLP in tasks such as point-to-point navigation and vision-language navigation, demonstrating its generalization across varying robot and camera configurations.
Evolving Embodied Intelligence: Graph Neural Network--Driven Co-Design of Morphology and Control in Soft Robotics
The intelligent behavior of robots does not emerge solely from control systems, but from the tight coupling between body and brain, a principle known as embodied intelligence. Designing soft robots that leverage this interaction remains a significant challenge, particularly when morphology and control require simultaneous optimization. A significant obstacle in this co-design process is that morphological evolution can disrupt learned control strategies, making it difficult to reuse or adapt existing knowledge. We address this by develop a Graph Neural Network-based approach for the co-design of morphology and controller. Each robot is represented as a graph, with a graph attention network (GAT) encoding node features and a pooled representation passed through a multilayer perceptron (MLP) head to produce actuator commands or value estimates. During evolution, inheritance follows a topology-consistent mapping: shared GAT layers are reused, MLP hidden layers are transferred intact, matched actuator outputs are copied, and unmatched ones are randomly initialized and fine-tuned. This morphology-aware policy class lets the controller adapt when the body mutates. On the benchmark, our GAT-based approach achieves higher final fitness and stronger adaptability to morphological variations compared to traditional MLP-only co-design methods. These results indicate that graph-structured policies provide a more effective interface between evolving morphologies and control for embodied intelligence.
Zero Shot Deformation Reconstruction for Soft Robots Using a Flexible Sensor Array and Cage Based 3D Gaussian Modeling
We present a zero-shot deformation reconstruction framework for soft robots that operates without any visual supervision at inference time. In this work, zero-shot deformation reconstruction is defined as the ability to infer object-wide deformations on previously unseen soft robots without collecting object-specific deformation data or performing any retraining during deployment. Our method assumes access to a static geometric proxy of the undeformed object, which can be obtained from a STL model. During operation, the system relies exclusively on tactile sensing, enabling camera-free deformation inference. The proposed framework integrates a flexible piezoresistive sensor array with a geometry-aware, cage-based 3D Gaussian deformation model. Local tactile measurements are mapped to low-dimensional cage control signals and propagated to dense Gaussian primitives to generate globally consistent shape deformations. A graph attention network regresses cage displacements from tactile input, enforcing spatial smoothness and structural continuity via boundary-aware propagation. Given only a nominal geometric proxy and real-time tactile signals, the system performs zero-shot deformation reconstruction of unseen soft robots in bending and twisting motions, while rendering photorealistic RGB in real time. It achieves 0.67 IoU, 0.65 SSIM, and 3.48 mm Chamfer distance, demonstrating strong zero-shot generalization through explicit coupling of tactile sensing and structured geometric deformation.
Pedestrian Crossing Intent Prediction via Psychological Features and Transformer Fusion
Pedestrian intention prediction needs to be accurate for autonomous vehicles to navigate safely in urban environments. We present a lightweight, socially informed architecture for pedestrian intention prediction. It fuses four behavioral streams (attention, position, situation, and interaction) using highway encoders, a compact 4-token Transformer, and global self-attention pooling. To quantify uncertainty, we incorporate two complementary heads: a variational bottleneck whose KL divergence captures epistemic uncertainty, and a Mahalanobis distance detector that identifies distributional shift. Together, these components yield calibrated probabilities and actionable risk scores without compromising efficiency. On the PSI 1.0 benchmark, our model outperforms recent vision language models by achieving 0.9 F1, 0.94 AUC-ROC, and 0.78 MCC by using only structured, interpretable features. On the more diverse PSI 2.0 dataset, where, to the best of our knowledge, no prior results exist, we establish a strong initial baseline of 0.78 F1 and 0.79 AUC-ROC. Selective prediction based on Mahalanobis scores increases test accuracy by up to 0.4 percentage points at 80% coverage. Qualitative attention heatmaps further show how the model shifts its cross-stream focus under ambiguity. The proposed approach is modality-agnostic, easy to integrate with vision language pipelines, and suitable for risk-aware intent prediction on resource-constrained platforms.
comment: Accepted to IEEE Intelligent Vehicles Symposium (IV) 2026. 8 pages, 3 figures
MeanFlow Meets Control: Scaling Sampled-Data Control for Swarms
Steering large-scale swarms in only a few control updates is challenging because real systems operate in sampled-data form: control inputs are updated intermittently and applied over finite intervals. In this regime, the natural object is not an instantaneous velocity field, but a finite-window control quantity that captures the system response over each sampling interval. Inspired by MeanFlow, we introduce a control-space learning framework for swarm steering under linear time-invariant dynamics. The learned object is the coefficient that parameterizes the finite-horizon minimum-energy control over each interval. We show that this coefficient admits both an integral representation and a local differential identity along bridge trajectories, which leads to a simple stop-gradient training objective. At implementation time, the learned coefficient is used directly in sampled-data updates, so the prescribed dynamics and actuation map are respected by construction. The resulting framework provides a scalable approach to few-step swarm steering that is consistent with the sampled-data structure of real control systems.
IndoorR2X: Indoor Robot-to-Everything Coordination with LLM-Driven Planning
Although robot-to-robot (R2R) communication improves indoor scene understanding beyond what a single robot can achieve, R2R alone cannot overcome partial observability without substantial exploration overhead or scaling team size. In contrast, many indoor environments already include low-cost Internet of Things (IoT) sensors (e.g., cameras) that provide persistent, building-wide context beyond onboard perception. We therefore introduce IndoorR2X, the first benchmark and simulation framework for Large Language Model (LLM)-driven multi-robot task planning with Robot-to-Everything (R2X) perception and communication in indoor environments. IndoorR2X integrates observations from mobile robots and static IoT devices to construct a global semantic state that supports scalable scene understanding, reduces redundant exploration, and enables high-level coordination through LLM-based planning. IndoorR2X provides configurable simulation environments, sensor layouts, robot teams, and task suites to systematically evaluate high-level semantic coordination strategies. Extensive experiments across diverse settings demonstrate that IoT-augmented world modeling improves multi-robot efficiency and reliability, and we highlight key insights and failure modes for advancing LLM-based collaboration between robot teams and indoor IoT sensors.
The Robot's Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning ICRA 2026
Conventional robot social behavior generation has been limited in flexibility and autonomy, relying on predefined motions or human feedback. This study proposes CRISP (Critique-and-Replan for Interactive Social Presence), an autonomous framework where a robot critiques and replans its own actions by leveraging a Vision-Language Model (VLM) as a `human-like social critic.' CRISP integrates (1) extraction of movable joints and constraints by analyzing the robot's description file (e.g., MJCF), (2) generation of step-by-step behavior plans based on situational context, (3) generation of low-level joint control code by referencing visual information (joint range-of-motion visualizations), (4) VLM-based evaluation of social appropriateness and naturalness, including pinpointing erroneous steps, and (5) iterative refinement of behaviors through reward-based search. This approach is not tied to a specific robot API; it can generate subtly different, human-like motions on various platforms using only the robot's structure file. In a user study involving five different robot types and 20 scenarios, including mobile manipulators and humanoids, our proposed method achieved significantly higher preference and situational appropriateness ratings compared to previous methods. This research presents a general framework that minimizes human intervention while expanding the robot's autonomous interaction capabilities and cross-platform applicability. Detailed result videos and supplementary information regarding this work are available at: https://limjiyu99.github.io/inner-critic/
comment: Accepted to ICRA 2026. 8 pages, 9 figures, Project page: https://limjiyu99.github.io/inner-critic/
HortiMulti: A Multi-Sensor Dataset for Localisation and Mapping in Horticultural Polytunnels
Agricultural robotics is gaining increasing relevance in both research and real-world deployment. As these systems are expected to operate autonomously in more complex tasks, the availability of representative real-world datasets becomes essential. While domains such as urban and forestry robotics benefit from large and established benchmarks, horticultural environments remain comparatively under-explored despite the economic significance of this sector. To address this gap, we present HortiMulti, a multimodal, cross-season dataset collected in commercial strawberry and raspberry polytunnels across an entire growing season, capturing substantial appearance variation, dynamic foliage, specular reflections from plastic covers, severe perceptual aliasing, and GNSS-unreliable conditions, all of which directly degrade existing localisation and perception algorithms. The sensor suite includes two 3D LiDARs, four RGB cameras, an IMU, GNSS, and wheel odometry. Ground truth trajectories are derived from a combination of Total Station surveying, AprilTag fiducial markers, and LiDAR-inertial odometry, spanning dense, sparse, and marker-free coverage to support evaluation under both controlled and realistic conditions. We release time-synchronised raw measurements, calibration files, reference trajectories, and baseline benchmarks for visual, LiDAR, and multi-sensor SLAM, with results confirming that current state-of-the-art methods remain inadequate for reliable polytunnel deployment, establishing HortiMulti as a one-stop resource for developing and testing robotic perception systems in horticulture environments.
AGILE: A Comprehensive Workflow for Humanoid Loco-Manipulation Learning
Recent advances in reinforcement learning (RL) have enabled impressive humanoid behaviors in simulation, yet transferring these results to new robots remains challenging. In many real deployments, the primary bottleneck is no longer simulation throughput or algorithm design, but the absence of systematic infrastructure that links environment verification, training, evaluation, and deployment in a coherent loop. To address this gap, we present AGILE, an end-to-end workflow for humanoid RL that standardizes the policy-development lifecycle to mitigate common sim-to-real failure modes. AGILE comprises four stages: (1) interactive environment verification, (2) reproducible training, (3) unified evaluation, and (4) descriptor-driven deployment via robot/task configuration descriptors. For evaluation stage, AGILE supports both scenario-based tests and randomized rollouts under a shared suite of motion-quality diagnostics, enabling automated regression testing and principled robustness assessment. AGILE also incorporates a set of training stabilizations and algorithmic enhancements in training stage to improve optimization stability and sim-to-real transfer. With this pipeline in place, we validate AGILE across five representative humanoid skills spanning locomotion, recovery, motion imitation, and loco-manipulation on two hardware platforms (Unitree G1 and Booster T1), achieving consistent sim-to-real transfer. Overall, AGILE shows that a standardized, end-to-end workflow can substantially improve the reliability and reproducibility of humanoid RL development.
KUKAloha: A General, Low-Cost, and Shared-Control based Teleoperation Framework for Construction Robot Arm
This paper presents KUKAloha, a general, low-cost, and shared-control teleoperation framework designed for construction robot arms. The proposed system employs a leader-follower paradigm in which a lightweight leading arm enables intuitive human guidance for coarse robot motion, while an autonomous perception module based on AprilTag detection performs precise alignment and grasp execution. By explicitly decoupling human control from fine manipulation, KUKAloha improves safety and repeatability when operating large-scale manipulators. We implement the framework on a KUKA robot arm and conduct a usability study with representative construction manipulation tasks. Experimental results demonstrate that KUKAloha reduces operator workload, improves task completion efficiency, and provides a practical solution for scalable demonstration collection and shared human-robot control in construction environments.
comment: 9 pages, 4 figures, 1 table
Not an Obstacle for Dog, but a Hazard for Human: A Co-Ego Navigation System for Guide Dog Robots
Guide dogs offer independence to Blind and Low-Vision (BLV) individuals, yet their limited availability leaves the vast majority of BLV users without access. Quadruped robotic guide dogs present a promising alternative, but existing systems rely solely on the robot's ground-level sensors for navigation, overlooking a critical class of hazards: obstacles that are transparent to the robot yet dangerous at human body height, such as bent branches. We term this the viewpoint asymmetry problem and present the first system to explicitly address it. Our Co-Ego system adopts a dual-branch obstacle avoidance framework that integrates the robot-centric ground sensing with the user's elevated egocentric perspective to ensure comprehensive navigation safety. Deployed on a quadruped robot, the system is evaluated in a controlled user study with sighted participants under blindfold across three conditions: unassisted, single-view, and cross-view fusion. Results demonstrate that cross-view fusion significantly reduces collision times and cognitive load, verifying the necessity of viewpoint complementarity for safe robotic guide dog navigation.
Spectral Alignment in Forward-Backward Representations via Temporal Abstraction
Forward-backward (FB) representations provide a powerful framework for learning the successor representation (SR) in continuous spaces by enforcing a low-rank factorization. However, a fundamental spectral mismatch often exists between the high-rank transition dynamics of continuous environments and the low-rank bottleneck of the FB architecture, making accurate low-rank representation learning difficult. In this work, we analyze temporal abstraction as a mechanism to mitigate this mismatch. By characterizing the spectral properties of the transition operator, we show that temporal abstraction acts as a low-pass filter that suppresses high-frequency spectral components. This suppression reduces the effective rank of the induced SR while preserving a formal bound on the resulting value function error. Empirically, we show that this alignment is a key factor for stable FB learning, particularly at high discount factors where bootstrapping becomes error-prone. Our results identify temporal abstraction as a principled mechanism for shaping the spectral structure of the underlying MDP and enabling effective long-horizon representations in continuous control.
A Unified Platform and Quality Assurance Framework for 3D Ultrasound Reconstruction with Robotic, Optical, and Electromagnetic Tracking
Three-dimensional (3D) Ultrasound (US) can facilitate diagnosis, treatment planning, and image-guided therapy. However, current studies rarely provide a comprehensive evaluation of volumetric accuracy and reproducibility, highlighting the need for robust Quality Assurance (QA) frameworks, particularly for tracked 3D US reconstruction using freehand or robotic acquisition. This study presents a QA framework for 3D US reconstruction and a flexible open source platform for tracked US research. A custom phantom containing geometric inclusions with varying symmetry properties enables straightforward evaluation of optical, electromagnetic, and robotic kinematic tracking for 3D US at different scanning speeds and insonation angles. A standardised pipeline performs real-time segmentation and 3D reconstruction of geometric targets (DSC = 0.97, FPS = 46) without GPU acceleration, followed by automated registration and comparison with ground-truth geometries. Applying this framework showed that our robotic 3D US achieves state-of-the-art reconstruction performance (DSC-3D = 0.94 +- 0.01, HD95 = 1.17 +- 0.12), approaching the spatial resolution limit imposed by the transducer. This work establishes a flexible experimental platform and a reproducible validation methodology for 3D US reconstruction. The proposed framework enables robust cross-platform comparisons and improved reporting practices, supporting the safe and effective clinical translation of 3D ultrasound in diagnostic and image-guided therapy applications.
comment: This work has been submitted to the IEEE for possible publication
Uncertainty Matters: Structured Probabilistic Online Mapping for Motion Prediction in Autonomous Driving
Online map generation and trajectory prediction are critical components of the autonomous driving perception-prediction-planning pipeline. While modern vectorized mapping models achieve high geometric accuracy, they typically treat map estimation as a deterministic task, discarding structural uncertainty. Existing probabilistic approaches often rely on diagonal covariance matrices, which assume independence between points and fail to capture the strong spatial correlations inherent in road geometry. To address this, we propose a structured probabilistic formulation for online map generation. Our method explicitly models intra-element dependencies by predicting a dense covariance matrix, parameterized via a Low-Rank plus Diagonal (LRPD) covariance decomposition. This formulation represents uncertainty as a combination of a low-rank component, which captures global spatial structure, and a diagonal component representing independent local noise, thereby capturing geometric correlations without the prohibitive computational cost of full covariance matrices. Evaluations on the nuScenes dataset demonstrate that our uncertainty-aware framework yields consistent improvements in online map generation quality compared to deterministic baselines. Furthermore, our approach establishes new state-of-the-art performance for map-based motion prediction, highlighting the critical role of uncertainty in planning tasks. Code is published under link-available-soon.
Multi-Robot Learning-Informed Task Planning Under Uncertainty ICRA 2026
We want a multi-robot team to complete complex tasks in minimum time where the locations of task-relevant objects are not known. Effective task completion requires reasoning over long horizons about the likely locations of task-relevant objects, how individual actions contribute to overall progress, and how to coordinate team efforts. Planning in this setting is extremely challenging: even when task-relevant information is partially known, coordinating which robot performs which action and when is difficult, and uncertainty introduces a multiplicity of possible outcomes for each action, which further complicates long-horizon decision-making and coordination. To address this, we propose a multi-robot planning abstraction that integrates learning to estimate uncertain aspects of the environment with model-based planning for long-horizon coordination. We demonstrate the efficient multi-stage task planning of our approach for 1, 2, and 3 robot teams over competitive baselines in large ProcTHOR household environments. Additionally, we demonstrate the effectiveness of our approach with a team of two LoCoBot mobile robots in real household settings.
comment: 8 pages, 8 figures. Accepted at ICRA 2026
Memory Over Maps: 3D Object Localization Without Reconstruction
Target localization is a prerequisite for embodied tasks such as navigation and manipulation. Conventional approaches rely on constructing explicit 3D scene representations to enable target localization, such as point clouds, voxel grids, or scene graphs. While effective, these pipelines incur substantial mapping time, storage overhead, and scalability limitations. Recent advances in vision-language models suggest that rich semantic reasoning can be performed directly on 2D observations, raising a fundamental question: is a complete 3D scene reconstruction necessary for object localization? In this work, we revisit object localization and propose a map-free pipeline that stores only posed RGB-D keyframes as a lightweight visual memory--without constructing any global 3D representation of the scene. At query time, our method retrieves candidate views, re-ranks them with a vision-language model, and constructs a sparse, on-demand 3D estimate of the queried target through depth backprojection and multi-view fusion. Compared to reconstruction-based pipelines, this design drastically reduces preprocessing cost, enabling scene indexing that is over two orders of magnitude faster to build while using substantially less storage. We further validate the localized targets on downstream object-goal navigation tasks. Despite requiring no task-specific training, our approach achieves strong performance across multiple benchmarks, demonstrating that direct reasoning over image-based scene memory can effectively replace dense 3D reconstruction for object-centric robot navigation. Project page: https://ruizhou-cn.github.io/memory-over-maps/
comment: 8 pages, 6 figures
High-Speed, All-Terrain Autonomy: Ensuring Safety at the Limits of Mobility
A novel local trajectory planner, capable of controlling an autonomous off-road vehicle on rugged terrain at high-speed is presented. Autonomous vehicles are currently unable to safely operate off-road at high-speed, as current approaches either fail to predict and mitigate rollovers induced by rough terrain or are not real-time feasible. To address this challenge, a novel model predictive control (MPC) formulation is developed for local trajectory planning. A new dynamics model for off-road vehicles on rough, non-planar terrain is derived and used for prediction. Extreme mobility, including tire liftoff without rollover, is safely enabled through a new energy-based constraint. The formulation is analytically shown to mitigate rollover types ignored by many state-of-the-art methods, and real-time feasibility is achieved through parallelized GPGPU computation. The planner's ability to provide safe, extreme trajectories is studied through both simulated trials and full-scale physical experiments. The results demonstrate fewer rollovers and more successes compared to a state-of-the-art baseline across several challenging scenarios that push the vehicle to its mobility limits.
comment: 19 pages, 16 figures, submitted to IEEE Transactions on Robotics
An Open Source Computer Vision and Machine Learning Framework for Affordable Life Science Robotic Automation
We present an open-source robotic framework that integrates computer vision and machine learning based inverse kinematics to enable low-cost laboratory automation tasks such as colony picking and liquid handling. The system uses a custom trained U-net model for semantic segmentation of microbial cultures, combined with Mixture Density Network for predicating joint angles of a simple 5-DOF robot arm. We evaluated the framework using a modified robot arm, upgraded with a custom liquid handling end-effector. Experimental results demonstrate the framework's feasibility for precise, repeatable operations, with mean positional error below 1 mm and joint angle prediction errors below 4 degrees and colony detection capabilities with IoU score of 0.537 and Dice coefficient of 0.596.
TRGS-SLAM: IMU-Aided Gaussian Splatting SLAM for Blurry, Rolling Shutter, and Noisy Thermal Images
Thermal cameras offer several advantages for simultaneous localization and mapping (SLAM) with mobile robots: they provide a passive, low-power solution to operating in darkness, are invariant to rapidly changing or high dynamic range illumination, and can see through fog, dust, and smoke. However, uncooled microbolometer thermal cameras, the only practical option in most robotics applications, suffer from significant motion blur, rolling shutter distortions, and fixed pattern noise. In this paper, we present TRGS-SLAM, a 3D Gaussian Splatting (3DGS) based thermal inertial SLAM system uniquely capable of handling these degradations. To overcome the challenges of thermal data, we introduce a model-aware 3DGS rendering method and several general innovations to 3DGS SLAM, including B-spline trajectory optimization with a two-stage IMU loss, view-diversity-based opacity resetting, and pose drift correction schemes. Our system demonstrates accurate tracking on real-world, fast motion, and high-noise thermal data that causes all other tested SLAM methods to fail. Moreover, through offline refinement of our SLAM results, we demonstrate thermal image restoration competitive with prior work that required ground truth poses.
comment: Project page: https://umautobots.github.io/trgs_slam
Scene Representation using 360° Saliency Graph and its Application in Vision-based Indoor Navigation
A Scene, represented visually using different formats such as RGB-D, LiDAR scan, keypoints, rectangular, spherical, multi-views, etc., contains information implicitly embedded relevant to applications such as scene indexing, vision-based navigation. Thus, these representations may not be efficient for such applications. This paper proposes a novel 360° saliency graph representation of the scenes. This rich representation explicitly encodes the relevant visual, contextual, semantic, and geometric information of the scene as nodes, edges, edge weights, and angular position in the 360° graph. Also, this representation is robust against scene view change and addresses challenges of indoor environments such as varied illumination, occlusions, and shadows as in the case of existing traditional methods. We have utilized this rich and efficient representation for vision-based navigation and compared it with existing navigation methods using 360° scenes. However, these existing methods suffer from limitations of poor scene representation, lacking scene-specific information. This work utilizes the proposed representation first to localize the query scene in the given topological map, and then facilitate 2D navigation by estimating the next required movement directions towards the target destination in the topological map by using the embedded geometric information in the 360° saliency graph. Experimental results demonstrate the efficacy of the proposed 360° saliency graph representation in enhancing both scene localization and vision-based indoor navigation.
Data Analogies Enable Efficient Cross-Embodiment Transfer
Generalist robot policies are trained on demonstrations collected across a wide variety of robots, scenes, and viewpoints. Yet it remains unclear how to best organize and scale such heterogeneous data so that it genuinely improves performance in a given target setting. In this work, we ask: what form of demonstration data is most useful for enabling transfer across robot set-ups? We conduct controlled experiments that vary end-effector morphology, robot platform appearance, and camera perspective, and compare the effects of simply scaling the number of demonstrations against systematically broadening the diversity in different ways. Our simulated experiments show that while perceptual shifts such as viewpoint benefit most from broad diversity, morphology shifts benefit far less from unstructured diversity and instead see the largest gains from data analogies, i.e. paired demonstrations that align scenes, tasks, and/or trajectories across different embodiments. Informed by the simulation results, we improve real-world cross-embodiment transfer success by an average of $22.5\%$ over large-scale, unpaired datasets by changing only the composition of the data.
comment: 14 pages, 11 Figures, 6 Tables
CoInfra: A Large-Scale Cooperative Infrastructure Perception System and Dataset for Vehicle-Infrastructure Cooperation in Adverse Weather
Vehicle-infrastructure (V2I) cooperative perception can substantially extend the range, coverage, and robustness of autonomous driving systems beyond the limits of onboard-only sensing, particularly in occluded and adverse-weather environments. However, its practical value is still difficult to quantify because existing benchmarks do not adequately capture large-scale multi-node deployments, realistic communication conditions, and adverse-weather operation. This paper presents CoInfra, a deployable cooperative infrastructure perception platform comprising 14 roadside sensor nodes connected through a commercial 5G network, together with a large-scale dataset and an open-source system stack for V2I cooperation research. The system supports synchronized multi-node sensing and delay-aware fusion under real 5G communication constraints. The released dataset covers an eight-node urban roundabout under four weather conditions (sunny, rainy, heavy snow, and freezing rain) and contains 294k LiDAR frames, 589k camera images, and 332k globally consistent 3D bounding boxes. It also includes a synchronized V2I subset collected with an autonomous vehicle. Beyond standard perception benchmarks, we further evaluate whether infrastructure sensing improves awareness of safety-critical traffic participants during roundabout interactions. In structured conflict scenarios, V2I cooperation increases critical-frame completeness from 33%-46% with vehicle-only sensing to 86%-100%. These results show that multi-node infrastructure perception can significantly improve situational awareness in conflict-rich traffic scenarios where vehicle-only sensing is most limited.
comment: This paper has been submitted to the Transportation Research Part C: Emerging Technologies for review
FORWARD: Dataset of a forwarder operating in rough terrain
We present FORWARD, a high-resolution multimodal dataset of a cut-to-length forwarder operating in rough terrain on two harvest sites in the middle part of Sweden. The forwarder is a large Komatsu model equipped with vehicle telematics sensors, including global positioning via satellite navigation, movement sensors, accelerometers, and engine sensors. The forwarder was additionally equipped with cameras, operator vibration sensors, and multiple IMUs. The data includes event time logs recorded at 5 Hz of driving speed, fuel consumption, machine position with centimeter accuracy, and crane use while the forwarder operates in forest areas, aerially laser-scanned with a resolution of around 1500 points per square meter. Production log files (Stanford standard) with time-stamped machine events, extensive video material, and terrain data in various formats are included as well. About 18 hours of regular wood extraction work during three days is annotated from 360-video material into individual work elements and included in the dataset. We also include scenario specifications of conducted experiments on forest roads and in terrain. Scenarios include repeatedly driving the same routes with and without steel tracks, different load weights, and different target driving speeds. The dataset is intended for developing models and algorithms for trafficability, perception, and autonomous control of forest machines using artificial intelligence, simulation, and experiments on physical testbeds. In part, we focus on forwarders traversing terrain, avoiding or handling obstacles, and loading or unloading logs, with consideration for efficiency, fuel consumption, safety, and environmental impact. Other benefits of the open dataset include the ability to explore auto-generation and calibration of forestry machine simulators and automation scenario descriptions using the data recorded in the field.
comment: 33 pages, 24 figures
TeleDex: Accessible Dexterous Teleoperation
Despite increasing dataset scale and model capacity, robot manipulation policies still struggle to generalize beyond their training distributions. As a result, deploying state-of-the-art policies in new environments, tasks, or robot embodiments often requires collecting additional demonstrations. Enabling this in real-world deployment settings requires tools that allow users to collect demonstrations quickly, affordably, and with minimal setup. We present TeleDex, an open-source system for intuitive teleoperation of dexterous hands and robotic manipulators using any readily available phone. The system streams low-latency 6-DoF wrist poses and articulated 21-DoF hand state estimates from the phone, which are retargeted to robot arms and multi-fingered hands without requiring external tracking infrastructure. TeleDex supports both a handheld phone-only mode and an optional 3D-printable hand-mounted interface for finger-level teleoperation. By lowering the hardware and setup barriers to dexterous teleoperation, TeleDex enables users to quickly collect demonstrations during deployment to support policy fine-tuning. We evaluate the system across simulation and real-world manipulation tasks, demonstrating its effectiveness as a unified scalable interface for robot teleoperation. All software and hardware designs, along with demonstration videos, are open-source and available at orayyan.com/teledex.
comment: For project website and videos, see https://www.orayyan.com/teledex
DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation CVPR2026
Vision-and-Language Navigation (VLN) requires agents to follow long-horizon instructions and navigate complex 3D environments. However, existing approaches face two major challenges: constructing an effective long-term memory bank and overcoming the compounding errors problem. To address these issues, we propose DecoVLN, an effective framework designed for robust streaming perception and closed-loop control in long-horizon navigation. First, we formulate long-term memory construction as an optimization problem and introduce adaptive refinement mechanism that selects frames from a historical candidate pool by iteratively optimizing a unified scoring function. This function jointly balances three key criteria: semantic relevance to the instruction, visual diversity from the selected memory, and temporal coverage of the historical trajectory. Second, to alleviate compounding errors, we introduce a state-action pair-level corrective finetuning strategy. By leveraging geodesic distance between states to precisely quantify deviation from the expert trajectory, the agent collects high-quality state-action pairs in the trusted region while filtering out the polluted data with low relevance. This improves both the efficiency and stability of error correction. Extensive experiments demonstrate the effectiveness of DecoVLN, and we have deployed it in real-world environments.
comment: 16 pages, 8 figures, CVPR2026
From Vocal Instructions to Household Tasks: The Inria TIAGo++ in the euROBIN Service Robots Coopetition
This paper describes the Inria team's integrated robotics system used in the 1st euROBIN coopetition, during which service robots performed voice-activated household tasks in a kitchen setting. The team developed a modified TIAGo++ platform that leverages a whole-body control stack for autonomous and teleoperated modes, and an LLM-based pipeline for instruction understanding and task planning. The key contributions (opens-sourced) are the integration of these components and the design of custom teleoperation devices, addressing practical challenges in the deployment of service robots.
ReMAP-DP: Reprojected Multi-view Aligned PointMaps for Diffusion Policy
Generalist robot policies built upon 2D visual representations excel at semantic reasoning but inherently lack the explicit 3D spatial awareness required for high-precision tasks. Existing 3D integration methods struggle to bridge this gap due to the structural irregularity of sparse point clouds and the geometric distortion introduced by multi-view orthographic rendering. To overcome these barriers, we present ReMAP-DP, a novel framework synergizing standardized perspective reprojection with a structure-aware dual-stream diffusion policy. By coupling the re-projected views with pixel-aligned PointMaps, our dual-stream architecture leverages learnable modality embeddings to fuse frozen semantic features and explicit geometric descriptors, ensuring precise implicit patch-level alignment. Extensive experiments across simulation and real-world environments demonstrate ReMAP-DP's superior performance in diverse manipulation tasks. On RoboTwin 2.0, it attains a 59.3% average success rate, outperforming the DP3 baseline by +6.6%. On ManiSkill 3, our method yields a 28% improvement over DP3 on the geometrically challenging Stack Cube task. Furthermore, ReMAP-DP exhibits remarkable real-world robustness, executing high-precision and dynamic manipulations with superior data efficiency from only a handful of demonstrations. Project page is available at: https://icr-lab.github.io/ReMAP-DP/
comment: fix some typos
Learning Discrete Abstractions for Visual Rearrangement Tasks Using Vision-Guided Graph Coloring
Learning abstractions directly from data is a core challenge in robotics. Humans naturally operate at an abstract level, reasoning over high-level subgoals while delegating execution to low-level motor skills -- an ability that enables efficient problem solving in complex environments. In robotics, abstractions and hierarchical reasoning have long been central to planning, yet they are typically hand-engineered, demanding significant human effort and limiting scalability. Automating the discovery of useful abstractions directly from visual data would make planning frameworks more scalable and more applicable to real-world robotic domains. In this work, we focus on rearrangement tasks where the state is represented with raw images, and propose a method to induce discrete, graph-structured abstractions by combining structural constraints with an attention-guided visual distance. Our approach leverages the inherent bipartite structure of rearrangement problems, integrating structural constraints and visual embeddings into a unified framework. This enables the autonomous discovery of abstractions from vision alone, which can subsequently support high-level planning. We evaluate our method on two rearrangement tasks in simulation and show that it consistently identifies meaningful abstractions that facilitate effective planning and outperform existing approaches.
FD-VLA: Force-Distilled Vision-Language-Action Model for Contact-Rich Manipulation ICRA 2026
Force sensing is a crucial modality for Vision-Language-Action (VLA) frameworks, as it enables fine-grained perception and dexterous manipulation in contact-rich tasks. We present Force-Distilled VLA (FD-VLA), a novel framework that integrates force awareness into contact-rich manipulation without relying on physical force sensors. The core of our approach is a Force Distillation Module (FDM), which distills force by mapping a learnable query token, conditioned on visual observations and robot states, into a predicted force token aligned with the latent representation of actual force signals. During inference, this distilled force token is injected into the pretrained VLM, enabling force-aware reasoning while preserving the integrity of its vision-language semantics. This design provides two key benefits: first, it allows practical deployment across a wide range of robots that lack expensive or fragile force-torque sensors, thereby reducing hardware cost and complexity; second, the FDM introduces an additional force-vision-state fusion prior to the VLM, which improves cross-modal alignment and enhances perception-action robustness in contact-rich scenarios. Surprisingly, our physical experiments show that the distilled force token outperforms direct sensor force measurements as well as other baselines, which highlights the effectiveness of this force-distilled VLA approach.
comment: ICRA 2026 Accepted
Task-Specified Compliance Bounds for Humanoids via Lipschitz-Constrained Policies
Reinforcement learning (RL) has demonstrated substantial potential for humanoid bipedal locomotion and the control of complex motions. To cope with oscillations and impacts induced by environmental interactions, compliant control is widely regarded as an effective remedy. However, the model-free nature of RL makes it difficult to impose task-specified and quantitatively verifiable compliance objectives, and classical model-based stiffness designs are not directly applicable. Lipschitz-Constrained Policies (LCP), which regularize the local sensitivity of a policy via gradient penalties, have recently been used to smooth humanoid motions. Nevertheless, existing LCP-based methods typically employ a single scalar Lipschitz budget and lack an explicit connection to physically meaningful compliance specifications in real-world systems. In this study, we propose an anisotropic Lipschitz-constrained policy (ALCP) that maps a task-space stiffness upper bound to a state-dependent Lipschitz-style constraint on the policy Jacobian. The resulting constraint is enforced during RL training via a hinge-squared spectral-norm penalty, preserving physical interpretability while enabling direction-dependent compliance. Experiments on humanoid robots show that ALCP improves locomotion stability and impact robustness, while reducing oscillations and energy usage.
comment: Submitted to IEEE for possible publication, under review
SpikeGrasp: A Benchmark for 6-DoF Grasp Pose Detection from Stereo Spike Streams
Most robotic grasping systems rely on converting sensor data into explicit 3D point clouds, which is a computational step not found in biological intelligence. This paper explores a fundamentally different, neuro-inspired paradigm for 6-DoF grasp detection. We introduce SpikeGrasp, a framework that mimics the biological visuomotor pathway, processing raw, asynchronous events from stereo spike cameras, similarly to retinas, to directly infer grasp poses. Our model fuses these stereo spike streams and uses a recurrent spiking neural network, analogous to high-level visual processing, to iteratively refine grasp hypotheses without ever reconstructing a point cloud. To validate this approach, we built a large-scale synthetic benchmark dataset. Experiments show that SpikeGrasp surpasses traditional point-cloud-based baselines, especially in cluttered and textureless scenes, and demonstrates remarkable data efficiency. By establishing the viability of this end-to-end, neuro-inspired approach, SpikeGrasp paves the way for future systems capable of the fluid and efficient manipulation seen in nature, particularly for dynamic objects.
comment: Some real machine experiments need to be supplemented, and the entire paper is incomplete
Multimodal Fused Learning for Solving the Generalized Traveling Salesman Problem in Robotic Task Planning
Effective and efficient task planning is essential for mobile robots, especially in applications like warehouse retrieval and environmental monitoring. These tasks often involve selecting one location from each of several target clusters, forming a Generalized Traveling Salesman Problem (GTSP) that remains challenging to solve both accurately and efficiently. To address this, we propose a Multimodal Fused Learning (MMFL) framework that leverages both graph and image-based representations to capture complementary aspects of the problem, and learns a policy capable of generating high-quality task planning schemes in real time. Specifically, we first introduce a coordinate-based image builder that transforms GTSP instances into spatially informative representations. We then design an adaptive resolution scaling strategy to enhance adaptability across different problem scales, and develop a multimodal fusion module with dedicated bottlenecks that enables effective integration of geometric and spatial features. Extensive experiments show that our MMFL approach significantly outperforms state-of-the-art methods across various GTSP instances while maintaining the computational efficiency required for real-time robotic applications. Physical robot tests further validate its practical effectiveness in real-world scenarios.
comment: 14 pages, 6 figures, Proceedings of the Conference on Robot Learning (CoRL 2025)
R2-Dreamer: Redundancy-Reduced World Models without Decoders or Augmentation
A central challenge in image-based Model-Based Reinforcement Learning (MBRL) is to learn representations that distill essential information from irrelevant visual details. While promising, reconstruction-based methods often waste capacity on large task-irrelevant regions. Decoder-free methods instead learn robust representations by leveraging Data Augmentation (DA), but reliance on such external regularizers limits versatility. We propose R2-Dreamer, a decoder-free MBRL framework with a self-supervised objective that serves as an internal regularizer, preventing representation collapse without resorting to DA. The core of our method is a redundancy-reduction objective inspired by Barlow Twins, which can be easily integrated into existing frameworks. On DeepMind Control Suite and Meta-World, R2-Dreamer is competitive with strong baselines such as DreamerV3 and TD-MPC2 while training 1.59x faster than DreamerV3, and yields substantial gains on DMC-Subtle with tiny task-relevant objects. These results suggest that an effective internal regularizer can enable versatile, high-performance decoder-free MBRL. Code is available at https://github.com/NM512/r2dreamer.
comment: 20 pages, 12 figures, 2 tables
SG-CoT: An Ambiguity-Aware Robotic Planning Framework using Scene Graph Representations
Ambiguity poses a major challenge to large language models (LLMs) used as robotic planners. In this letter, we present Scene Graph-Chain-of-Thought (SG-CoT), a two-stage framework where LLMs iteratively query a scene graph representation of the environment to detect and clarify ambiguities. First, a structured scene graph representation of the environment is constructed from input observations, capturing objects, their attributes, and relationships with other objects. Second, the LLM is equipped with retrieval functions to query portions of the scene graph that are relevant to the provided instruction. This grounds the reasoning process of the LLM in the observation, increasing the reliability of robotic planners under ambiguous situations. SG-CoT also allows the LLM to identify the source of ambiguity and pose a relevant disambiguation question to the user or another robot. Extensive experimentation demonstrates that SG-CoT consistently outperforms prior methods, with a minimum of 10% improvement in question accuracy and a minimum success rate increase of 4% in single-agent and 15% in multi-agent environments, validating its effectiveness for more generalizable robot planning.
comment: This work has been submitted to the IEEE Robotics and Automation Letters for possible publication
Path Integral Particle Filtering for Hybrid Systems via Saltation Matrices
State estimation for hybrid systems that undergo intermittent contact with their environments, such as extraplanetary robots and satellites undergoing docking operations, is difficult due to the discrete uncertainty propagation during contact. To handle this propagation, this paper presents an optimal-control-based particle filtering method that leverages saltation matrices to map out uncertainty propagation during contact events. By exploiting a path integral filtering framework that exploits the duality between smoothing and optimal control, the resulting state estimation algorithm is robust to outlier effects, flexible to non-Gaussian noise distributions, and handles challenging contact dynamics in hybrid systems. To evaluate the validity and consistency of the proposed approach, this paper tests it against strong baselines on the stochastic dynamics generated by a bouncing ball and spring loaded inverted pendulum.
RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation
The pursuit of robot generalists, agents capable of performing diverse tasks across diverse environments, demands rigorous and scalable evaluation. Yet real-world testing of robot policies remains fundamentally constrained: it is labor-intensive, slow, unsafe at scale, and difficult to reproduce. As policies expand in scope and complexity, these barriers only intensify, since defining "success" in robotics often hinges on nuanced human judgments of execution quality. We introduce RobotArena Infinity, a new benchmarking framework that overcomes these challenges by shifting vision-language-action (VLA) evaluation into large-scale simulated environments augmented with online human feedback. Leveraging advances in vision-language models, 2D-to-3D generative modeling, and differentiable rendering, our approach automatically converts video demonstrations from widely used robot datasets into simulated counterparts. Within these digital twins, we assess VLA policies using both automated vision-language-model-guided scoring and scalable human preference judgments collected from crowdworkers, transforming human involvement from tedious scene setup, resetting, and safety supervision into lightweight preference comparisons. To measure robustness, we systematically perturb simulated environments along multiple axes, including textures and object placements, stress-testing policy generalization under controlled variation. The result is a continuously evolving, reproducible, and scalable benchmark for real-world-trained robot manipulation policies, addressing a critical missing capability in today's robotics landscape.
comment: Website: https://robotarenainf.github.io
Latent Action Diffusion for Cross-Embodiment Manipulation ICRA
End-to-end learning is emerging as a powerful paradigm for robotic manipulation, but its effectiveness is limited by data scarcity and the heterogeneity of action spaces across robot embodiments. In particular, diverse action spaces across different end-effectors create barriers for cross-embodiment learning and skill transfer. We address this challenge through diffusion policies learned in a latent action space that unifies diverse end-effector actions. We first show that we can learn a semantically aligned latent action space for anthropomorphic robotic hands, a human hand, and a parallel jaw gripper using encoders trained with a contrastive loss. Second, we show that by using our proposed latent action space for co-training on manipulation data from different end-effectors, we can utilize a single policy for multi-robot control and obtain up to 25.3% improved manipulation success rates, indicating successful skill transfer despite a significant embodiment gap. Our approach using latent cross-embodiment policies presents a new method to unify different action spaces across embodiments, enabling efficient multi-robot control and data sharing across robot setups. This unified representation significantly reduces the need for extensive data collection for each new robot morphology, accelerates generalization across embodiments, and ultimately facilitates more scalable and efficient robotic learning.
comment: 8 pages, 5 figures. Accepted to the 2026 IEEE International Conference on Robotics & Automation (ICRA). Website: https://mimicrobotics.github.io/lad/
Pseudo-Simulation for Autonomous Driving
Existing evaluation paradigms for Autonomous Vehicles (AVs) face critical limitations. Real-world evaluation is often challenging due to safety concerns and a lack of reproducibility, whereas closed-loop simulation can face insufficient realism or high computational costs. Open-loop evaluation, while being efficient and data-driven, relies on metrics that generally overlook compounding errors. In this paper, we propose pseudo-simulation, a novel paradigm that addresses these limitations. Pseudo-simulation operates on real datasets, similar to open-loop evaluation, but augments them with synthetic observations generated prior to evaluation using 3D Gaussian Splatting. Our key idea is to approximate potential future states the AV might encounter by generating a diverse set of observations that vary in position, heading, and speed. Our method then assigns a higher importance to synthetic observations that best match the AV's likely behavior using a novel proximity-based weighting scheme. This enables evaluating error recovery and the mitigation of causal confusion, as in closed-loop benchmarks, without requiring sequential interactive simulation. We show that pseudo-simulation is better correlated with closed-loop simulations ($R^2=0.8$) than the best existing open-loop approach ($R^2=0.7$). We also establish a public leaderboard for the community to benchmark new methodologies with pseudo-simulation. Our code is available at https://github.com/autonomousvision/navsim.
comment: CoRL 2025, updated with leaderboard snapshot from March 2026
Risk-Bounded Multi-Agent Visual Navigation via Iterative Risk Allocation ICAPS '26
Safe navigation is essential for autonomous systems operating in hazardous environments, especially when multiple agents must coordinate using only high-dimensional visual observations. While recent approaches successfully combine Goal-Conditioned RL (GCRL) for graph construction with Conflict-Based Search (CBS) for planning, they typically rely on deleting edges with high risk before running CBS to enforce safety. This binary strategy is overly conservative, precluding feasible missions that require traversing high-risk regions, even when the aggregate risk is acceptable. To address this, we introduce a framework for Risk-Bounded Multi-Agent Path Finding ($Δ$-MAPF), where agents share a user-specified global risk budget ($Δ$). Rather than permanently discarding edges, our framework dynamically distributes per-agent risk budgets ($δ_i$) during search via an Iterative Risk Allocation (IRA) layer that integrates with a standard CBS planner. We investigate two distribution strategies: a greedy surplus-deficit scheme for rapid feasibility repair, and a market-inspired mechanism that treats risk as a priced resource to guide improved allocation. The market-based mechanism yields a tunable trade-off wherein agents exploit available risk to secure shorter, more efficient paths, but revert to longer, safer detours under tighter budgets. Experiments in complex visual environments show that our dynamic allocation framework achieves higher success rates than baselines and effectively leverages the available safety budget to reduce travel time. Project website can be found at https://rb-visual-mapf-mers.csail.mit.edu
comment: Published at ICAPS '26
Efficient and Reliable Teleoperation through Real-to-Sim-to-Real Shared Autonomy
Fine-grained, contact-rich teleoperation remains slow, error-prone, and unreliable in real-world manipulation tasks, even for experienced operators. Shared autonomy offers a promising way to improve performance by combining human intent with automated assistance, but learning effective assistance in simulation requires a faithful model of human behavior, which is difficult to obtain in practice. We propose a real-to-sim-to-real shared autonomy framework that augments human teleoperation with learned corrective behaviors, using a simple yet effective k-nearest-neighbor (kNN) human surrogate to model operator actions in simulation. The surrogate is fit from less than five minutes of real-world teleoperation data and enables stable training of a residual copilot policy with model-free reinforcement learning. The resulting copilot is deployed to assist human operators in real-world fine-grained manipulation tasks. Through simulation experiments and a user study with sixteen participants on industry-relevant tasks, including nut threading, gear meshing, and peg insertion, we show that our system improves task success for novice operators and execution efficiency for experienced operators compared to direct teleoperation and shared-autonomy baselines that rely on expert priors or behavioral-cloning pilots. In addition, copilot-assisted teleoperation produces higher-quality demonstrations for downstream imitation learning.
comment: Project Page: https://residual-copilot.github.io/
SPOT: Point Cloud Based Stereo Visual Place Recognition for Similar and Opposing Viewpoints ICRA 2024
Recognizing places from an opposing viewpoint during a return trip is a common experience for human drivers. However, the analogous robotics capability, visual place recognition (VPR) with limited field of view cameras under 180 degree rotations, has proven to be challenging to achieve. To address this problem, this paper presents Same Place Opposing Trajectory (SPOT), a technique for opposing viewpoint VPR that relies exclusively on structure estimated through stereo visual odometry (VO). The method extends recent advances in lidar descriptors and utilizes a novel double (similar and opposing) distance matrix sequence matching method. We evaluate SPOT on a publicly available dataset with 6.7-7.6 km routes driven in similar and opposing directions under various lighting conditions. The proposed algorithm demonstrates remarkable improvement over the state-of-the-art, achieving up to 91.7% recall at 100% precision in opposing viewpoint cases, while requiring less storage than all baselines tested and running faster than all but one. Moreover, the proposed method assumes no a priori knowledge of whether the viewpoint is similar or opposing, and also demonstrates competitive performance in similar viewpoint cases.
comment: Expanded version with added appendix. Published in ICRA 2024. Project page: https://umautobots.github.io/spot
Articulated-Body Dynamics Network: Dynamics-Grounded Prior for Robot Learning
Recent work in reinforcement learning has shown that incorporating structural priors for articulated robots, such as link connectivity, into policy networks improves learning efficiency. However, dynamics properties, despite their fundamental role in determining how forces and motion propagate through the body, remain largely underexplored as an inductive bias for policy learning. To address this gap, we present the Articulated-Body Dynamics Network (ABD-Net), a novel graph neural network architecture grounded in the computational structure of forward dynamics. Specifically, we adapt the inertia propagation mechanism from the Articulated Body Algorithm, systematically aggregating inertial quantities from child to parent links in a tree-structured manner, while replacing physical quantities with learnable parameters. Embedding ABD-NET into the policy actor enables dynamics-informed representations that capture how actions propagate through the body, leading to efficient and robust policy learning. Through experiments with simulated humanoid, quadruped, and hopper robots, our approach demonstrates increased sample efficiency and generalization to dynamics shifts compared to transformer-based and GNN baselines. We further validate the learned policy on real Unitree G1 and Go2 robots, state-of-the-art humanoid and quadruped platforms, generating dynamic, versatile and robust locomotion behaviors through sim-to-real transfer with real-time inference.
comment: Arxiv_r2
Multiagent Systems
Beyond detection: cooperative multi-agent reasoning for rapid onboard EO crisis response
Rapid identification of hazardous events is essential for next-generation Earth Observation (EO) missions supporting disaster response. However, current monitoring pipelines remain largely ground-centric, introducing latency due to downlink limitations, multi-source data fusion constraints, and the computational cost of exhaustive scene analysis. This work proposes a hierarchical multi-agent architecture for onboard EO processing under strict resource and bandwidth constraints. The system enables the exploitation of complementary multimodal observations by coordinating specialized AI agents within an event-driven decision pipeline. AI agents can be deployed across multiple nodes in a distributed setting, such as satellite platforms. An Early Warning agent generates fast hypotheses from onboard observations and selectively activates domain-specific analysis agents, while a Decision agent consolidates the evidence to issue a final alert. The architecture combines vision-language models, traditional remote sensing analysis tools, and role-specialized agents to enable structured reasoning over multimodal observations while minimizing unnecessary computation. A proof-of-concept implementation was executed on the engineering model of an edge-computing platform currently deployed in orbit, using representative satellite data. Experiments on wildfire and flood monitoring scenarios show that the proposed routing-based pipeline significantly reduces computational overhead while maintaining coherent decision outputs, demonstrating the feasibility of distributed agent-based reasoning for future autonomous EO constellations.
comment: Accepted for presentation at the ESA's 4S Symposium 2026 Conference (see https://atpi.eventsair.com/4s-symposium-2026/)
Helix: A Dual-Helix Co-Evolutionary Multi-Agent System for Prompt Optimization and Question Reformulation
Automated prompt optimization (APO) aims to improve large language model performance by refining prompt instructions. However, existing methods are largely constrained by fixed prompt templates, limited search spaces, or single-sided optimization that treats user questions as immutable inputs. In practice, question formulation and prompt design are inherently interdependent: clearer question structures facilitate focused reasoning and task understanding, while effective prompts reveal better ways to organize and restate queries. Ignoring this coupling fundamentally limits the effectiveness and adaptability of current APO approaches. We propose a unified multi-agent system (Helix) that jointly optimizes question reformulation and prompt instructions through a structured three-stage co-evolutionary framework. Helix integrates (1) planner-guided decomposition that breaks optimization into coupled question-prompt objectives, (2) dual-track co-evolution where specialized agents iteratively refine and critique each other to produce complementary improvements, and (3) strategy-driven question generation that instantiates high-quality reformulations for robust inference. Extensive experiments on 12 benchmarks against 6 strong baselines demonstrate the effectiveness of Helix, achieving up to 3.95% performance improvements across tasks with favorable optimization efficiency.
comment: under review
A Subgoal-driven Framework for Improving Long-Horizon LLM Agents
Large language model (LLM)-based agents have emerged as powerful autonomous controllers for digital environments, including mobile interfaces, operating systems, and web browsers. Web navigation, for example, requires handling dynamic content and long sequences of actions, making it particularly challenging. Existing LLM-based agents struggle with long-horizon planning in two main ways. During online execution, they often lose track as new information arrives, lacking a clear and adaptive path toward the final goal. This issue is further exacerbated during reinforcement learning (RL) fine-tuning, where sparse and delayed rewards make it difficult for agents to identify which actions lead to success, preventing them from maintaining coherent reasoning over extended tasks. To address these challenges, we propose two contributions. First, we introduce an agent framework that leverages proprietary models for online planning through subgoal decomposition. Second, we present MiRA (Milestoning your Reinforcement Learning Enhanced Agent), an RL training framework that uses dense, milestone-based reward signals. The real-time planning mechanism improves proprietary models such as Gemini by approximately a 10% absolute increase in success rate (SR) on the WebArena-Lite benchmark. Meanwhile, applying MiRA to the open Gemma3-12B model increases its success rate from 6.4% to 43.0%. This performance surpasses proprietary systems such as GPT-4-Turbo (17.6%) and GPT-4o (13.9%), as well as the previous open-model state of the art, WebRL (38.4%). Overall, our findings demonstrate that combining explicit inference-time planning with milestone-based rewards significantly improves an agent's long-horizon capabilities, paving the way for more robust and general-purpose autonomous systems.
comment: 50 pages, 15 figures
GoAgent: Group-of-Agents Communication Topology Generation for LLM-based Multi-Agent Systems
Large language model (LLM)-based multi-agent systems (MAS) have demonstrated exceptional capabilities in solving complex tasks, yet their effectiveness depends heavily on the underlying communication topology that coordinates agent interactions. Within these systems, successful problem-solving often necessitates task-specific group structures to divide and conquer subtasks. However, most existing approaches generate communication topologies in a node-centric manner, leaving group structures to emerge implicitly from local connectivity decisions rather than modeling them explicitly, often leading to suboptimal coordination and unnecessary communication overhead. To address this limitation, we propose GoAgent (Group-of-Agents), a communication topology generation method that explicitly treats collaborative groups as the atomic units of MAS construction. Specifically, GoAgent first enumerates task-relevant candidate groups through an LLM and then autoregressively selects and connects these groups as atomic units to construct the final communication graph, jointly capturing intra-group cohesion and inter-group coordination. To mitigate communication redundancy and noise propagation inherent in expanding topologies, we further introduce a conditional information bottleneck (CIB) objective that compresses inter-group communication, preserving task-relevant signals while filtering out redundant historical noise. Extensive experiments on six benchmarks demonstrate the state-of-the-art performance of GoAgent with 93.84% average accuracy while reducing token consumption by about 17%.
On the existence of fair zero-determinant strategies in the periodic prisoner's dilemma game
Repeated games are a framework for investigating long-term interdependence of multi-agent systems. In repeated games, zero-determinant (ZD) strategies attract much attention in evolutionary game theory, since they can unilaterally control payoffs. Especially, fair ZD strategies unilaterally equalize the payoff of the focal player and the average payoff of the opponents, and they were found in several games including the social dilemma games. Although the existence condition of ZD strategies in repeated games was specified, its extension to stochastic games is almost unclear. Stochastic games are an extension of repeated games, where a state of an environment exists, and the state changes to another one according to an action profile of players. Because of the transition of an environmental state, the existence condition of ZD strategies in stochastic games is more complicated than that in repeated games. Here, we investigate the existence condition of fair ZD strategies in the periodic prisoner's dilemma game, which is one of the simplest stochastic games. We show that fair ZD strategies do not necessarily exist in the periodic prisoner's dilemma game, in contrast to the repeated prisoner's dilemma game. Furthermore, we also prove that the Tit-for-Tat strategy, which imitates the opponent's action, is not necessarily a fair ZD strategy in the periodic prisoner's dilemma game, whereas the Tit-for-Tat strategy is always a fair ZD strategy in the repeated prisoner's dilemma game. Our results highlight difference between ZD strategies in the periodic prisoner's dilemma game and ones in the standard repeated prisoner's dilemma game.
comment: 25 pages
Planning Autonomous Vehicle Maneuvering in Work Zones Through Game-Theoretic Trajectory Generation
Work zone navigation remains one of the most challenging manoeuvres for autonomous vehicles (AVs), where constrained geometries and unpredictable traffic patterns create a high-risk environment. Despite extensive research on AV trajectory planning, few studies address the decision-making required to navigate work zones safely. This paper proposes a novel game-theoretic framework for trajectory generation and control to enhance the safety of lane changes in a work zone environment. By modelling the lane change manoeuvre as a non-cooperative game between vehicles, we use a game-theoretic planner to generate trajectories that balance safety, progress, and traffic stability. The simulation results show that the proposed game-theoretic model reduces the frequency of conflicts by 35 percent and decreases the probability of high risk safety events compared to traditional vehicle behaviour planning models in safety-critical highway work-zone scenarios.
comment: This work has been submitted to the IEEE for possible publication
MeanFlow Meets Control: Scaling Sampled-Data Control for Swarms
Steering large-scale swarms in only a few control updates is challenging because real systems operate in sampled-data form: control inputs are updated intermittently and applied over finite intervals. In this regime, the natural object is not an instantaneous velocity field, but a finite-window control quantity that captures the system response over each sampling interval. Inspired by MeanFlow, we introduce a control-space learning framework for swarm steering under linear time-invariant dynamics. The learned object is the coefficient that parameterizes the finite-horizon minimum-energy control over each interval. We show that this coefficient admits both an integral representation and a local differential identity along bridge trajectories, which leads to a simple stop-gradient training objective. At implementation time, the learned coefficient is used directly in sampled-data updates, so the prescribed dynamics and actuation map are respected by construction. The resulting framework provides a scalable approach to few-step swarm steering that is consistent with the sampled-data structure of real control systems.
IndoorR2X: Indoor Robot-to-Everything Coordination with LLM-Driven Planning
Although robot-to-robot (R2R) communication improves indoor scene understanding beyond what a single robot can achieve, R2R alone cannot overcome partial observability without substantial exploration overhead or scaling team size. In contrast, many indoor environments already include low-cost Internet of Things (IoT) sensors (e.g., cameras) that provide persistent, building-wide context beyond onboard perception. We therefore introduce IndoorR2X, the first benchmark and simulation framework for Large Language Model (LLM)-driven multi-robot task planning with Robot-to-Everything (R2X) perception and communication in indoor environments. IndoorR2X integrates observations from mobile robots and static IoT devices to construct a global semantic state that supports scalable scene understanding, reduces redundant exploration, and enables high-level coordination through LLM-based planning. IndoorR2X provides configurable simulation environments, sensor layouts, robot teams, and task suites to systematically evaluate high-level semantic coordination strategies. Extensive experiments across diverse settings demonstrate that IoT-augmented world modeling improves multi-robot efficiency and reliability, and we highlight key insights and failure modes for advancing LLM-based collaboration between robot teams and indoor IoT sensors.
Multi-Robot Learning-Informed Task Planning Under Uncertainty ICRA 2026
We want a multi-robot team to complete complex tasks in minimum time where the locations of task-relevant objects are not known. Effective task completion requires reasoning over long horizons about the likely locations of task-relevant objects, how individual actions contribute to overall progress, and how to coordinate team efforts. Planning in this setting is extremely challenging: even when task-relevant information is partially known, coordinating which robot performs which action and when is difficult, and uncertainty introduces a multiplicity of possible outcomes for each action, which further complicates long-horizon decision-making and coordination. To address this, we propose a multi-robot planning abstraction that integrates learning to estimate uncertain aspects of the environment with model-based planning for long-horizon coordination. We demonstrate the efficient multi-stage task planning of our approach for 1, 2, and 3 robot teams over competitive baselines in large ProcTHOR household environments. Additionally, we demonstrate the effectiveness of our approach with a team of two LoCoBot mobile robots in real household settings.
comment: 8 pages, 8 figures. Accepted at ICRA 2026
Measuring Reasoning Trace Legibility: Can Those Who Understand Teach?
Language models are increasingly being trained to "reason" before answering users' queries, outputting hundreds or even thousands of tokens worth of deliberation before their final answer. While the main intention of reasoning is to improve models' ability to arrive at a correct answer, we argue that these models should be assessed for the legibility of their reasoning traces in addition to the correctness of their final answers. In this paper, we evaluate 90k traces from 12 Reasoning Language Models (RLMs) for the quality of their reasoning traces. We introduce the concept of transfer utility, which assesses how useful an RLM's reasoning traces are for guiding a weaker, non-reasoning model toward arriving at the correct answer. We find that the reasoning traces of the highest-performing models rank among the lowest for legibility. Furthermore, we uncover tensions between efficiency-based measurements of legibility (such as trace length) and transfer utility. These tensions establish a legibility Pareto frontier, and we demonstrate that an RLM's ability to output highly legible traces can be a task- and audience-dependent goal. Crucially, we find that reward models used to train RLMs do not intrinsically reward legibility. Together, these metrics and the findings they surface chart a path towards scaffolding reasoning traces for a multi-agent future.
Hetero-Net: An Energy-Efficient Resource Allocation and 3D Placement in Heterogeneous LoRa Networks via Multi-Agent Optimization
The evolution of Internet of Things (IoT) into multi-layered environments has positioned Low-Power Wide Area Networks (LPWANs), particularly Long Range (LoRa), as the backbone for connectivity across both surface and subterranean landscapes. However, existing LoRa-based network designs often treat ground-based wireless sensor networks (WSNs) and wireless underground sensor networks (WUSNs) as separate systems, resulting in inefficient and non-integrated connectivity across diverse environments. To address this, we propose Hetero-Net, a unified heterogeneous LoRa framework that integrates diverse LoRa end devices with multiple unmanned aerial vehicle (UAV)-mounted LoRa gateways. Our objective is to maximize system energy efficiency through the joint optimization of the spreading factor, transmission power, and three-dimensional (3D) placement of the UAVs. To manage the dynamic and partially observable nature of this system, we model the problem as a partially observable stochastic game (POSG) and address it using a multi-agent proximal policy optimization (MAPPO) framework. An ablation study shows that our proposed MAPPO Hetero-Net significantly outperforms traditional, isolated network designs, achieving energy efficiency improvements of 55.81\% and 198.49\% over isolated WSN-only and WUSN-only deployments, respectively.
comment: 6 pages, 7 figures
ALARA for Agents: Least-Privilege Context Engineering Through Portable Composable Multi-Agent Teams
Industry practitioners and academic researchers regularly use multi-agent systems to accelerate their work, yet the frameworks through which these systems operate do not provide a simple, unified mechanism for scalably managing the critical aspects of the agent harness, impacting both the quality of individual human-agent interactions and the capacity for practitioners to coordinate toward common goals through shared agent infrastructure. Agent frameworks have enabled increasingly sophisticated multi-agent systems, but the behavioral specifications that define what these agents can do remain fragmented across prose instruction files, framework-internal configuration, and mechanisms like MCP servers that operate separately from individual agent definitions, making these specifications difficult to share, version, or collaboratively maintain across teams and projects. Applying the ALARA principle from radiation safety (exposures kept as low as reasonably achievable) to agent context, we introduce a declarative context-agent-tool (CAT) data layer expressed through interrelated files that scope each agent's tool access and context to the minimum its role requires, and \texttt{npcsh}, a command-line shell for executing it. Because the system parses and enforces these files structurally, modifying an agent's tool list produces a guaranteed behavioral change rather than a suggestion the model may or may not follow. We evaluate 22 locally-hosted models from 0.6B to 35B parameters across 115 practical tasks spanning file operations, web search, multi-step scripting, tool chaining, and multi-agent delegation, characterizing which model families succeed at which task categories and where they break down across $\sim$2500 total executions.
comment: Submitted to HAXD 2026, 8 pages, 6 figures, framework and benchmark are open source at https://github.com/NPC-Worldwide/npcsh
Bounded Coupled AI Learning Dynamics in Tri-Hierarchical Drone Swarms
Modern autonomous multi-agent systems combine heterogeneous learning mechanisms operating at different timescales. An open question remains: can one formally guarantee that coupled dynamics of such mechanisms stay within the admissible operational regime? This paper studies a tri-hierarchical swarm learning system where three mechanisms act simultaneously: (1) local Hebbian online learning at individual agent level (fast timescale, 10-100 ms); (2) multi-agent reinforcement learning (MARL) for tactical group coordination (medium timescale, 1-10 s); (3) meta-learning (MAML) for strategic adaptation (slow timescale, 10-100 s). Four results are established. The Bounded Total Error Theorem shows that under contractual constraints on learning rates, Lipschitz continuity of inter-level mappings, and weight stabilization, total suboptimality admits a component-wise upper bound uniform in time. The Bounded Representation Drift Theorem gives a worst-case estimate of how Hebbian updates affect coordination-level embeddings during one MARL cycle. The Meta-Level Compatibility Theorem provides sufficient conditions under which strategic adaptation preserves lower-level invariants. The Non-Accumulation Theorem proves that error does not grow unboundedly over time.
comment: 25 pages, 3 tables
When Agents Disagree: The Selection Bottleneck in Multi-Agent LLM Pipelines
Multi-agent LLM pipelines produce contradictory evidence on whether team diversity improves output quality: heterogeneous Mixture-of-Agents teams outperform single models, yet homogeneous Self-MoA teams consistently win under synthesis-based aggregation. We propose a resolution by identifying the selection bottleneck -- a crossover threshold in aggregation quality that determines whether diversity helps or hurts. Under this model, we obtain a closed-form crossover threshold $s^*$ (Proposition 1) that separates the regimes where diversity helps and hurts. In a targeted experiment spanning 42 tasks across 7 categories ($N=210$), a diverse team with judge-based selection achieves a win rate of 0.810 against a single-model baseline, while a homogeneous team scores 0.512 -- near chance (Glass's $Δ= 2.07$). Judge-based selection outperforms MoA-style synthesis by $Δ_{\mathrm{WR}} = +0.631$ -- the synthesis approach is preferred over the baseline in zero of 42 tasks by the judge panel. A decoupled evaluation with independent judges confirms all directional findings (Spearman $ρ= 0.90$). Exploratory evidence suggests that including a weaker model improves performance while reducing cost ($p < 10^{-4}$, not pre-registered). Our results suggest that selector quality may be a more impactful design lever than generator diversity in single-round generate-then-select pipelines.
comment: 12 pages, 3 figures, 5 tables
Is Your LLM-as-a-Recommender Agent Trustable? LLMs' Recommendation is Easily Hacked by Biases (Preferences)
Current Large Language Models (LLMs) are gradually exploited in practically valuable agentic workflows such as Deep Research, E-commerce recommendation, and job recruitment. In these applications, LLMs need to select some optimal solutions from massive candidates, which we term as \textit{LLM-as-a-Recommender} paradigm. However, the reliability of using LLM agents for recommendations is underexplored. In this work, we introduce a \textbf{Bias} \textbf{Rec}ommendation \textbf{Bench}mark (\textbf{BiasRecBench}) to highlight the critical vulnerability of such agents to biases in high-value real-world tasks. The benchmark includes three practical domains: paper review, e-commerce, and job recruitment. We construct a \textsc{Bias Synthesis Pipeline with Calibrated Quality Margins} that 1) synthesizes evaluation data by controlling the quality gap between optimal and sub-optimal options to provide a calibrated testbed to elicit the vulnerability to biases; 2) injects contextual biases that are logical and suitable for option contexts. Extensive experiments on both SOTA (Gemini-{2.5,3}-pro, GPT-4o, DeepSeek-R1) and small-scale LLMs reveal that agents frequently succumb to injected biases despite having sufficient reasoning capabilities to identify the ground truth. These findings expose a significant reliability bottleneck in current agentic workflows, calling for specialized alignment strategies for LLM-as-a-Recommender. The complete code and evaluation datasets will be made publicly available shortly.
A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning CVPR2026
This paper presents a multi-agent perception-action exploration alliance, dubbed A4VL, for efficient long-video reasoning. A4VL operates in a multi-round perception-action exploration loop with a selection of VLM agents. In each round, the team of agents performs video question-answer (VideoQA) via perception exploration followed by action exploration. During perception exploration, each agent learns to extract query-specific perception clue(s) from a few sampled frames and performs clue-based alignment to find the video block(s) that are most relevant to the query-specific event. During action exploration, A4VL performs video reasoning in three steps: (1) each agent produces its initial answer with rational, (2) all agents collaboratively scores one another through cross-reviews and relevance ranking, and (3) based on whether a satisfactory consensus is reached, the decision is made either to start a new round of perception-action deliberation by pruning (e.g., filtering out the lowest performing agent) and re-staging (e.g., new-clue and matching block based perception-action exploration), or to conclude by producing its final answer. The integration of the multi-agent alliance through multi-round perception-action exploration, coupled with event-driven partitioning and cue-guided block alignment, enables A4VL to effectively scale to real world long videos while preserving high quality video reasoning. Evaluation Results on five popular VideoQA benchmarks show that A4VL outperforms 18 existing representative VLMs and 11 recent methods optimized for long-video reasoning, while achieving significantly lower inference latency. Our code is released at https://github.com/git-disl/A4VL.
comment: Accepted by CVPR2026
ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems
Autonomous LLM-based agents increasingly operate as long-running processes forming densely interconnected multi-agent ecosystems, whose security properties remain largely unexplored. In particular, OpenClaw, an open-source platform with over 40,000 active instances, has stood out recently with its persistent configurations, tool-execution privileges, and cross-platform messaging capabilities. In this work, we present ClawWorm, the first self-replicating worm attack against a production-scale agent framework, achieving a fully autonomous infection cycle initiated by a single message: the worm first hijacks the victim's core configuration to establish persistent presence across session restarts, then executes an arbitrary payload upon each reboot, and finally propagates itself to every newly encountered peer without further attacker intervention. We evaluate the attack on a controlled testbed across four distinct LLM backends, three infection vectors, and three payload types (1,800 total trials). We demonstrate a 64.5\% aggregate attack success rate, sustained multi-hop propagation, and reveal stark divergences in model security postures -- highlighting that while execution-level filtering effectively mitigates dormant payloads, skill supply chains remain universally vulnerable. We analyse the architectural root causes underlying these vulnerabilities and propose defence strategies targeting each identified trust boundary. Code and samples will be released upon completion of responsible disclosure.
Risk-Bounded Multi-Agent Visual Navigation via Iterative Risk Allocation ICAPS '26
Safe navigation is essential for autonomous systems operating in hazardous environments, especially when multiple agents must coordinate using only high-dimensional visual observations. While recent approaches successfully combine Goal-Conditioned RL (GCRL) for graph construction with Conflict-Based Search (CBS) for planning, they typically rely on deleting edges with high risk before running CBS to enforce safety. This binary strategy is overly conservative, precluding feasible missions that require traversing high-risk regions, even when the aggregate risk is acceptable. To address this, we introduce a framework for Risk-Bounded Multi-Agent Path Finding ($Δ$-MAPF), where agents share a user-specified global risk budget ($Δ$). Rather than permanently discarding edges, our framework dynamically distributes per-agent risk budgets ($δ_i$) during search via an Iterative Risk Allocation (IRA) layer that integrates with a standard CBS planner. We investigate two distribution strategies: a greedy surplus-deficit scheme for rapid feasibility repair, and a market-inspired mechanism that treats risk as a priced resource to guide improved allocation. The market-based mechanism yields a tunable trade-off wherein agents exploit available risk to secure shorter, more efficient paths, but revert to longer, safer detours under tighter budgets. Experiments in complex visual environments show that our dynamic allocation framework achieves higher success rates than baselines and effectively leverages the available safety budget to reduce travel time. Project website can be found at https://rb-visual-mapf-mers.csail.mit.edu
comment: Published at ICAPS '26
Designing Auctions when Algorithms Learn to Bid
Algorithms increasingly automate bidding in online auctions, raising concerns about tacit bid suppression and revenue shortfalls. Prior work identifies individual mechanisms behind algorithmic bid suppression, but it remains unclear which factors matter most and how they interact, and policy conclusions rest on algorithms unlike those deployed in practice. This paper develops a computational laboratory framework, based on factorial experimental designs and large-scale Monte Carlo simulation, that addresses bid suppression across multiple algorithm classes within a common methodology. Each simulation is treated as a black-box input-output observation; the framework varies inputs and ranks factors by association with outcomes, without explaining algorithms' internal mechanisms. Across six sub-experiments spanning Q-learning, contextual bandits, and budget-constrained pacing, the framework ranks the relative importance of auction format, competitive pressure, learning parameters, and budget constraints on seller revenue. The central finding is that structural market parameters dominate algorithmic design choices. In unconstrained settings, competitive pressure is the strongest predictor of revenue; under budget constraints, budget tightness takes over. The auction-format effect is context-dependent, favouring second-price under learning algorithms but reversing to favour first-price under budget-constrained pacing. Because the optimal format depends on the prevailing bidding technology, no single auction format is universally superior when bidders are algorithms, and applying format recommendations from one algorithm class to another leads to counterproductive design interventions.
Systems and Control (EESS)
Predictor-Feedback Stabilization of Linear Switched Systems with State-Dependent Switching and Input Delay
We develop a predictor-feedback control design for a class of linear systems with state-dependent switching. The main ingredient of our design is a novel construction of an exact predictor state. Such a construction is possible as for a given, state-dependent switching rule, an implementable formula for the predictor state can be derived in a way analogous to the case of nonlinear systems with input delay. We establish uniform exponential stability of the corresponding closed-loop system via a novel construction of multiple Lyapunov functionals, relying on a backstepping transformation that we introduce. We validate our design in simulation considering a switching rule motivated by communication networks.
comment: 6 pages, 3 figures, submitted to European Control Conference 2026 (ECC)
Steady State Distributed Kalman Filter
One of the main challenges in set-based state estimation is the trade-off between accuracy and computational complexity, which becomes particularly critical for systems with time-varying dynamics. Accurate set representations such as polytopes, even when encoded as Constrained Zonotopes (CZs) or Constrained Convex Generators (CCGs), typically lead to a progressive growth of the set description, requiring order reduction procedures that increase the online computational burden. In this paper, we propose a fixed structure and computationally efficient approach for guaranteed state estimation of discrete-time Linear Time-Varying (LTV) systems using CCG formulations. The proposed method expresses the state enclosure explicitly in terms of a fixed number of past inputs and measurements, resulting in a constant-size set description and avoiding the need for online order reduction. Numerical results illustrate the effectiveness and computational advantages of the proposed method.
Computational Complexity Analysis of Interval Methods in Solving Uncertain Nonlinear Systems
This paper analyses the computational complexity of validated interval methods for uncertain nonlinear systems. Interval analysis produces guaranteed enclosures that account for uncertainty and round-off, but its adoption is often limited by computational cost in high dimensions. We develop an algorithm-level worst-case framework that makes the dependence on the initial search volume $\mathrm{Vol}(X_0)$, the target tolerance $\varepsilon$, and the costs of validated primitives explicit (inclusion-function evaluation, Jacobian evaluation, and interval linear algebra). Within this framework, we derive worst-case time and space bounds for interval bisection, subdivision$+$filter, interval constraint propagation, interval Newton, and interval Krawczyk. The bounds quantify the scaling with $\mathrm{Vol}(X_0)$ and $\varepsilon$ for validated steady-state enclosure and highlight dominant cost drivers. We also show that determinant and inverse computation for interval matrices via naive Laplace expansion is factorial in the matrix dimension, motivating specialised interval linear algebra. Finally, interval Newton and interval Krawczyk have comparable leading-order costs; Krawczyk is typically cheaper in practice because it inverts a real midpoint matrix rather than an interval matrix. These results support the practical design of solvers for validated steady-state analysis in applications such as biochemical reaction network modelling, robust parameter estimation, and other uncertainty-aware computations in systems and synthetic biology.
comment: 20 pages, 2 figures
Structural Controllability of Large-Scale Hypergraphs
Controlling real-world networked systems, including ecological, biomedical, and engineered networks that exhibit higher-order interactions, remains challenging due to inherent nonlinearities and large system scales. Despite extensive studies on graph controllability, the controllability properties of hypergraphs remain largely underdeveloped. Existing results focus primarily on exact controllability, which is often impractical for large-scale hypergraphs. In this article, we develop a structural controllability framework for hypergraphs by modeling hypergraph dynamics as polynomial dynamical systems. In particular, we extend classical notions of accessibility and dilation from linear graph-based systems to polynomial hypergraph dynamics and establish a hypergraph-based criterion under which the topology guarantees satisfaction of classical Lie-algebraic and Kalman-type rank conditions for almost all parameter choices. We further derive a topology-based lower bound on the minimum number of driver nodes required for structural controllability and leverage this bound to design a scalable driver node selection algorithm combining dilation-aware initialization via maximum matching with greedy accessibility expansion. We demonstrate the effectiveness and scalability of the proposed framework through numerical experiments on hypergraphs with tens to thousands of nodes and higher-order interactions.
comment: 14 pages, 4 figures, 1 table
On the Capacity of Future Lane-Free Urban Infrastructure
In this paper, the potential capacity and spatial efficiency of future autonomous lane-free traffic in urban environments are explored using a combination of analytical and simulation-based approaches. For lane-free roadways, a simple analytical approach is employed, which shows not only that lane-free traffic offers a higher capacity than lane-based traffic for the same street width, but also that the relationship between capacity and street width is continuous under lane-free traffic. To test the potential capacity and properties of lane-free signal-free intersections (automated intersection management), two approaches were simulated and compared, including a novel approach which we call OptWULF. This approach uses a multi-agent conflict-based search approach with a low-level planner that uses a combination of optimization and simple window-based reservation. With these simulations, we confirm the continuous relationship between capacity and street width for intersection scenarios. We also show that OptWULF results in an even utilization of the entire drivable area of the street and intersection area. Furthermore, we show that OptWULF is capable of handling asymmetric demand patterns without any substantial loss in capacity compared to symmetric demand patterns.
comment: 9 pages, 8 figures, submitted to IEEE Transactions on Intelligent Transportation Systems
Learning Adaptive Parameter Policies for Nonlinear Bayesian Filtering
Algorithms for Bayesian state estimation of nonlinear systems inevitably introduce approximation errors. These algorithms depend on parameters that influence the accuracy of the numerical approximations used. The parameters include, for example, the number of particles, scaling parameters, and the number of iterations in iterative computations. Typically, these parameters are fixed or adjusted heuristically, although the approximation accuracy can change over time with the local degree of nonlinearity and uncertainty. The approximation errors introduced at a time step propagate through subsequent updates, affecting the accuracy, consistency, and robustness of future estimates. This paper presents adaptive parameter selection in nonlinear Bayesian filtering as a sequential decision-making problem, where parameters influence not only the immediate estimation outcome but also the future estimates. The decision-making problem is addressed using reinforcement learning to learn adaptive parameter policies for nonlinear Bayesian filters. Experiments with the unscented Kalman filter and stochastic integration filter demonstrate that the learned policies improve both estimate quality and consistency.
comment: Submitted to 29th International Conference on Information Fusion
Complex Frequency as Generalized Eigenvalue
This paper shows that the concept of complex frequency, originally introduced to characterize the dynamics of signals with complex values, constitutes a generalization of eigenvalues when applied to the states of linear time-invariant (LTI) systems. Starting from the definition of geometric frequency, which provides a geometrical interpretation of frequency in electric circuits that admits a natural decomposition into symmetric and antisymmetric components associated with amplitude variation and rotational motion, respectively, we show that complex frequency arises as its restriction to the two-dimensional Euclidean plane. For LTI systems, it is shown that the complex frequencies computed from the system's states subject to a non-isometric transformation, coincide with the original system's eigenvalues. This equivalence is demonstrated for diagonalizable systems of any order. The paper provides a unified geometric interpretation of eigenvalues, bridging classical linear system theory with differential geometry of curves. The paper also highlights that this equivalence does not generally hold for nonlinear systems. On the other hand, the geometric frequency of the system can always be defined, providing a geometrical interpretation of the system flow. A variety of examples based on linear and nonlinear circuits illustrate the proposed framework.
A Spectral Perspective on Stochastic Control Barrier Functions
Stochastic control barrier functions (SCBFs) provide a safety-critical control framework for systems subject to stochastic disturbances by bounding the probability of remaining within a safe set. However, synthesizing a valid SCBF that explicitly reflects the true safety probability of the system, which is the most natural measure of safety, remains a challenge. This paper addresses this issue by adopting a spectral perspective, utilizing the linear operator that governs the evolution of the closed-loop system's safety probability. We find that the dominant eigenpair of this Koopman-like operator encodes fundamental safety information of the stochastic system. The dominant eigenfunction is a natural and valid SCBF, with values that explicitly quantify the relative long-term safety of the state, while the dominant eigenvalue indicates the global rate at which the safety probability decays. A practical synthesis algorithm is proposed, termed power-policy iteration, which jointly computes the dominant eigenpair and an optimized backup policy. The method is validated using simulation experiments on safety-critical dynamics models.
comment: 16 pages, 7 figures. This work has been submitted to the IEEE for possible publication
Mixed Integer vs. Continuous Model Predictive Controllers for Binary Thruster Control: A Comparative Study
Binary on/off thrusters are commonly used for spacecraft attitude and position control during proximity operations. However, their discrete nature poses challenges for conventional continuous control methods. The control of these discrete actuators is either explicitly formulated as a mixed-integer optimization problem or handled in a two-layer approach, where a continuous controller's output is converted to binary commands using analog-to digital modulation techniques such as Delta-Sigma-modulation. This paper provides the first systematic comparison between these two paradigms for binary thruster control, contrasting continuous Model Predictive Control (MPC) with Delta-Sigma modulation against direct Mixed-Integer MPC (MIMPC) approaches. Furthermore, we propose a new variant of MPC for binary actuated systems, which is informed using the state of the Delta-Sigma Modulator. The two variations for the continuous MPC along with the MIMPC are evaluated through extensive simulations using ESA's REACSA platform. Results demonstrate that while all approaches perform similarly in high-thrust regimes, MIMPC achieves superior fuel efficiency in low-thrust conditions. Continuous MPC with modulation shows instabilities at higher thrust levels, while binary informed MPC, which incorporates modulator dynamics, improves robustness and reduces the efficiency gap to the MIMPC. It can be seen from the simulated and real-system experiments that MIMPC offers complete stability and fuel efficiency benefits, particularly for resource-constrained missions, while continuous control methods remain attractive for computationally limited applications.
comment: Accepted to CEAS EuroGNC 2026
Accurate Open-Loop Control of a Soft Continuum Robot Through Visually Learned Latent Representations
This work addresses open-loop control of a soft continuum robot (SCR) from video-learned latent dynamics. Visual Oscillator Networks (VONs) from previous work are used, that provide mechanistically interpretable 2D oscillator latents through an attention broadcast decoder (ABCD). Open-loop, single-shooting optimal control is performed in latent space to track image-specified waypoints without camera feedback. An interactive SCR live simulator enables design of static, dynamic, and extrapolated targets and maps them to model-specific latent waypoints. On a two-segment pneumatic SCR, Koopman, MLP, and oscillator dynamics, each with and without ABCD, are evaluated on setpoint and dynamic trajectories. ABCD-based models consistently reduce image-space tracking error. The VON and ABCD-based Koopman models attains the lowest MSEs. Using an ablation study, we demonstrate that several architecture choices and training settings contribute to the open-loop control performance. Simulation stress tests further confirm static holding, stable extrapolated equilibria, and plausible relaxation to the rest state. To the best of our knowledge, this is the first demonstration that interpretable, video-learned latent dynamics enable reliable long-horizon open-loop control of an SCR.
Heavy-Tailed and Long-Range Dependent Noise in Stochastic Approximation: A Finite-Time Analysis
Stochastic approximation (SA) is a fundamental iterative framework with broad applications in reinforcement learning and optimization. Classical analyses typically rely on martingale difference or Markov noise with bounded second moments, but many practical settings, including finance and communications, frequently encounter heavy-tailed and long-range dependent (LRD) noise. In this work, we study SA for finding the root of a strongly monotone operator under these non-classical noise models. We establish the first finite-time moment bounds in both settings, providing explicit convergence rates that quantify the impact of heavy tails and temporal dependence. Our analysis employs a noise-averaging argument that regularizes the impact of noise without modifying the iteration. Finally, we apply our general framework to stochastic gradient descent (SGD) and gradient play, and corroborate our finite-time analysis through numerical experiments.
comment: Submitted to IEEE Transactions on Automatic Control
On the existence of fair zero-determinant strategies in the periodic prisoner's dilemma game
Repeated games are a framework for investigating long-term interdependence of multi-agent systems. In repeated games, zero-determinant (ZD) strategies attract much attention in evolutionary game theory, since they can unilaterally control payoffs. Especially, fair ZD strategies unilaterally equalize the payoff of the focal player and the average payoff of the opponents, and they were found in several games including the social dilemma games. Although the existence condition of ZD strategies in repeated games was specified, its extension to stochastic games is almost unclear. Stochastic games are an extension of repeated games, where a state of an environment exists, and the state changes to another one according to an action profile of players. Because of the transition of an environmental state, the existence condition of ZD strategies in stochastic games is more complicated than that in repeated games. Here, we investigate the existence condition of fair ZD strategies in the periodic prisoner's dilemma game, which is one of the simplest stochastic games. We show that fair ZD strategies do not necessarily exist in the periodic prisoner's dilemma game, in contrast to the repeated prisoner's dilemma game. Furthermore, we also prove that the Tit-for-Tat strategy, which imitates the opponent's action, is not necessarily a fair ZD strategy in the periodic prisoner's dilemma game, whereas the Tit-for-Tat strategy is always a fair ZD strategy in the repeated prisoner's dilemma game. Our results highlight difference between ZD strategies in the periodic prisoner's dilemma game and ones in the standard repeated prisoner's dilemma game.
comment: 25 pages
ContractionPPO: Certified Reinforcement Learning via Differentiable Contraction Layers
Legged locomotion in unstructured environments demands not only high-performance control policies but also formal guarantees to ensure robustness under perturbations. Control methods often require carefully designed reference trajectories, which are challenging to construct in high-dimensional, contact-rich systems such as quadruped robots. In contrast, Reinforcement Learning (RL) directly learns policies that implicitly generate motion, and uniquely benefits from access to privileged information, such as full state and dynamics during training, that is not available at deployment. We present ContractionPPO, a framework for certified robust planning and control of legged robots by augmenting Proximal Policy Optimization (PPO) RL with a state-dependent contraction metric layer. This approach enables the policy to maximize performance while simultaneously producing a contraction metric that certifies incremental exponential stability of the simulated closed-loop system. The metric is parameterized as a Lipschitz neural network and trained jointly with the policy, either in parallel or as an auxiliary head of the PPO backbone. While the contraction metric is not deployed during real-world execution, we derive upper bounds on the worst-case contraction rate and show that these bounds ensure the learned contraction metric generalizes from simulation to real-world deployment. Our hardware experiments on quadruped locomotion demonstrate that ContractionPPO enables robust, certifiably stable control even under strong external perturbations.
comment: Accepted to RA-L journal
Grid-following and Grid-forming Switching Control for Grid-connected Inverters Considering Small-signal Security Region
In high-penetration renewable power systems with complex and highly variable operating scenarios, grid-connected inverters (GCIs) may transition between different control modes to adapt to diverse grid conditions. Among these, the switching between grid-following (GFL) and grid-forming (GFM) control modes is particularly critical. Nevertheless, safe and robust GFL-GFM switching control strategies for GCIs remain largely unexplored. To overcome this challenge, this paper establishes a full-order small-signal state-space model for the GFL-GFM switched system, precisely reflecting all internal circuit and control dynamics. Subsequently, the small-signal security region (SSSR) of the switched system is defined and characterized, followed by an in-depth investigation into the multi-parameter impacts on the SSSRs and internal stability margin distributions (ISMDs). Furthermore, a novel comprehensive stability index (CSI) is proposed by integrating the stability margin, parameter sensitivity, and boundary distance. Based on this CSI, a multi-objective adaptive GFL-GFM switching control strategy is designed to guarantee the dynamic security and robustness of the system. Finally, the proposed SSSR analysis method for the GFL-GFM switched system and the designed CSI-based switching control mechanism are validated through electromagnetic transient (EMT) simulations.
comment: 10 pages, 11 figures
PowerLens: Taming LLM Agents for Safe and Personalized Mobile Power Management
Battery life remains a critical challenge for mobile devices, yet existing power management mechanisms rely on static rules or coarse-grained heuristics that ignore user activities and personal preferences. We present PowerLens, a system that tames the reasoning power of Large Language Models (LLMs) for safe and personalized mobile power management on Android devices. The key idea is that LLMs' commonsense reasoning can bridge the semantic gap between user activities and system parameters, enabling zero-shot, context-aware policy generation that adapts to individual preferences through implicit feedback. PowerLens employs a multi-agent architecture that recognizes user context from UI semantics and generates holistic power policies across 18 device parameters. A PDL-based constraint framework verifies every action before execution, while a two-tier memory system learns individualized preferences from implicit user overrides through confidence-based distillation, requiring no explicit configuration and converging within 3--5 days. Extensive experiments on a rooted Android device show that PowerLens achieves 81.7% action accuracy and 38.8% energy saving over stock Android, outperforming rule-based and LLM-based baselines, with high user satisfaction, fast preference convergence, and strong safety guarantees, with the system itself consuming only 0.5% of daily battery capacity.
Direct Digital-to-Physical Synthesis: From mmWave Transmitter to Qubit Control
The increasing demand for high-speed wireless connectivity and scalable quantum information processing has driven parallel advancements in millimeter-wave (MMW) communication transmitters and cryogenic qubit controllers. Despite serving different applications, both systems rely on the precise generation of radio frequency (RF) waveforms with stringent requirements on spectral purity, timing, and amplitude control. Recent architecture eliminates conventional methods by embedding digital signal generation and processing directly into the RF path, transforming digital bits into physical waveforms for either electromagnetic transmission or quantum state control. This article presents a unified analysis of direct-digital modulation techniques across both domains, showing the synergy and similarities between these two domains. The article also focuses on four core architectures: Cartesian I/Q, Polar, RF- Digital-to-Analog Converter (DAC), and harmonic/subharmonic modulation across both domains. We analyze their respective trade-offs in energy efficiency, signal integrity, waveform synthesis, error mitigations, and highlight how architectural innovations in one domain can accelerate progress in the other
Verifiable Error Bounds for Physics-Informed Neural Network Solutions of Lyapunov and Hamilton-Jacobi-Bellman Equations
Many core problems in nonlinear systems analysis and control can be recast as solving partial differential equations (PDEs) such as Lyapunov and Hamilton-Jacobi-Bellman (HJB) equations. Physics-informed neural networks (PINNs) have emerged as a promising mesh-free approach for approximating their solutions, but in most existing works there is no rigorous guarantee that a small PDE residual implies a small solution error. This paper develops verifiable error bounds for approximate solutions of Lyapunov and HJB equations, with particular emphasis on PINN-based approximations. For both the Lyapunov and HJB PDEs, we show that a verifiable residual bound yields relative error bounds with respect to the true solutions as well as computable a posteriori estimates in terms of the approximate solutions. For the HJB equation, this also yields certified upper and lower bounds on the optimal value function on compact sublevel sets and quantifies the optimality gap of the induced feedback policy. We further show that one-sided residual bounds already imply that the approximation itself defines a valid Lyapunov or control Lyapunov function. We illustrate the results with numerical examples.
MeanFlow Meets Control: Scaling Sampled-Data Control for Swarms
Steering large-scale swarms in only a few control updates is challenging because real systems operate in sampled-data form: control inputs are updated intermittently and applied over finite intervals. In this regime, the natural object is not an instantaneous velocity field, but a finite-window control quantity that captures the system response over each sampling interval. Inspired by MeanFlow, we introduce a control-space learning framework for swarm steering under linear time-invariant dynamics. The learned object is the coefficient that parameterizes the finite-horizon minimum-energy control over each interval. We show that this coefficient admits both an integral representation and a local differential identity along bridge trajectories, which leads to a simple stop-gradient training objective. At implementation time, the learned coefficient is used directly in sampled-data updates, so the prescribed dynamics and actuation map are respected by construction. The resulting framework provides a scalable approach to few-step swarm steering that is consistent with the sampled-data structure of real control systems.
Robust Linear Quadratic Optimal Control of Cementitious Material Extrusion
Extrusion-based 3D printing of cementitious materials enables fabrication of complex structures, however it is highly sensitive to disturbances, material property variations, and process uncertainties that decrease flow stability and dimensional fidelity. To address these challenges, this study proposes a robust linear quadratic optimal control framework for regulating material extrusion in cementitious direct ink writing systems. The printer is modeled using two coupled subsystems: an actuation system representing nozzle flow dynamics and a printing system describing the printed strand flow on the build plate. A hybrid control architecture combining sliding mode control for disturbance rejection with linear quadratic optimal feedback for energy-efficient tracking is developed to ensure robustness and optimality. In simulation case studies, the control architecture guarantees acceptable convergence of nozzle and strand flow tracking errors under bounded disturbances.
Design-OS: A Specification-Driven Framework for Engineering System Design with a Control-Systems Design Case
Engineering system design -- whether mechatronic, control, or embedded -- often proceeds in an ad hoc manner, with requirements left implicit and traceability from intent to parameters largely absent. Existing specification-driven and systematic design methods mostly target software, and AI-assisted tools tend to enter the workflow at solution generation rather than at problem framing. Human--AI collaboration in the design of physical systems remains underexplored. This paper presents Design-OS, a lightweight, specification-driven workflow for engineering system design organized in five stages: concept definition, literature survey, conceptual design, requirements definition, and design definition. Specifications serve as the shared contract between human designers and AI agents; each stage produces structured artifacts that maintain traceability and support agent-augmented execution. We position Design-OS relative to requirements-driven design, systematic design frameworks, and AI-assisted design pipelines, and demonstrate it on a control systems design case using two rotary inverted pendulum platforms -- an open-source SimpleFOC reaction wheel and a commercial Quanser Furuta pendulum -- showing how the same specification-driven workflow accommodates fundamentally different implementations. A blank template and the full design-case artifacts are shared in a public repository to support reproducibility and reuse. The workflow makes the design process visible and auditable, and extends specification-driven orchestration of AI from software to physical engineering system design.
comment: 2 figures, 11 pages, Submitted to ASME IDETC 2026 - DAC-09
A Controller Synthesis Framework for Weakly-Hard Control Systems
Deadline misses are more common in real-world systems than one may expect. The weakly-hard task model has become a standard abstraction to describe and analyze how often these misses occur, and has been especially used in control applications. Most existing control approaches check whether a controller manages to stabilize the system it controls when its implementation occasionally misses deadlines. However, they usually do not incorporate deadline-overrun knowledge during the controller synthesis process. In this paper, we present a framework that explicitly integrates weakly-hard constraints into the control design. Our method supports various overrun handling strategies and guarantees stability and performance under weakly-hard constraints. We validate the synthesized controllers on a Furuta pendulum, a representative control benchmark. The results show that constraint-aware controllers significantly outperform traditional designs, demonstrating the benefits of proactive and informed synthesis for overrun-aware real-time control.
comment: accepted for publication at RTAS 2026
Distributed State Estimation for Discrete-time LTI Systems: the Design Trilemma and a Novel Framework
With the advancement of IoT technologies and the rapid expansion of cyber-physical systems, there is increasing interest in distributed state estimation, where multiple sensors collaboratively monitor large-scale dynamic systems. Compared with its continuous-time counterpart, a discrete-time distributed observer faces greater challenges, as it cannot exploit high-gain mechanisms or instantaneous communication. Existing approaches depend on three tightly coupled factors: (i) system observability, (ii) communication frequency and dimension of the exchanged information, and (iii) network connectivity. However, the interdependence among these factors remains underexplored. This paper identifies a fundamental trilemma among these factors and introduces a general design framework that balances them through an iterative semidefinite programming approach. As such, the proposed method mitigates the restrictive assumptions present in existing works. The effectiveness and generality of the proposed approach are demonstrated through a simulation example.
An Agentic Multi-Agent Architecture for Cybersecurity Risk Management
Getting a real cybersecurity risk assessment for a small organization is expensive -- a NIST CSF-aligned engagement runs $15,000 on the low end, takes weeks, and depends on practitioners who are genuinely scarce. Most small companies skip it entirely. We built a six-agent AI system where each agent handles one analytical stage: profiling the organization, mapping assets, analyzing threats, evaluating controls, scoring risks, and generating recommendations. Agents share a persistent context that grows as the assessment proceeds, so later agents build on what earlier ones concluded -- the mechanism that distinguishes this from standard sequential agent pipelines. We tested it on a 15-person HIPAA-covered healthcare company and compared outputs to independent assessments by three CISSP practitioners -- the system agreed with them 85% of the time on severity classifications, covered 92% of identified risks, and finished in under 15 minutes. We then ran 30 repeated single-agent assessments across five synthetic but sector-realistic organizational profiles in healthcare, fintech, manufacturing, retail, and SaaS, comparing a general-purpose Mistral-7B against a domain fine-tuned model. Both completed every run. The fine-tuned model flagged threats the baseline could not see at all: PHI exposure in healthcare, OT/IIoT vulnerabilities in manufacturing, platform-specific risks in retail. The full multi-agent pipeline, however, failed every one of 30 attempts on a Tesla T4 with its 4,096-token default context window -- context capacity, not model quality, turned out to be the binding constraint.
comment: 15 pages, 1 figure, 2 tables. Submitted to AICTC 2026 (Springer LNCS)
Grid-Constrained Smart Charging of Large EV Fleets: Comparative Study of Sequential DP and a Full Fleet Solver
This paper presents a comparative optimization framework for smart charging of electrified vehicle fleets. Using heuristic sequential dynamic programming (SeqDP), the framework minimizes electricity costs while adhering to constraints related to the power grid, charging infrastructure, vehicle availability, and simple considerations of battery aging. Based on real-world operational data, the model incorporates discrete energy states, time-varying tariffs, and state-of-charge (SoC) targets to deliver a scalable and cost-effective solution. Classical DP approach suffers from exponential computational complexity as the problem size increases. This becomes particularly problematic when conducting monthly-scale analyses aimed at minimizing peak power demand across all vehicles. The extended time horizon, coupled with multi-state decision-making, renders exact optimization impractical at larger scales. To address this, a heuristic method is employed to enable systematic aggregation and tractable computation for the Non-Linear Programming (NLP) problem. Rather than seeking a globally optimal solution, this study focuses on a time-efficient smart charging strategy that aims to minimize energy cost while flattening the overall power profile. In this context, a sequential heuristic DP approach is proposed. Its performance is evaluated against a full-fleet solver using Gurobi, a widely used commercial solver in both academia and industry. The proposed algorithm achieves a reduction of the overall cost and peak power by more than 90% compared to uncontrolled schedules. Its relative cost remains within 9\% of the optimal values obtained from the full-fleet solver, and its relative peak-power deviation stays below 15% for larger fleets.
Online Feedback Optimization of Energy Storage to Smooth Data Center Grid Impacts
The growing electricity demand of AI data centers introduces significant voltage variability in power networks, affecting not only their own operation but also the experience of all users sharing the network. To smooth data center impacts on power networks, we develop an online feedback optimization approach that controls distributed battery energy storage systems to mitigate voltage issues induced by data center operations. The controller adjusts the active and reactive power setpoints of distributed battery systems in response to voltage measurements, with a two-fold objective: managing voltage to minimize the magnitude of constraint violations and smoothing voltage profiles. Control performance is evaluated in a high-fidelity simulation environment that integrates a three-phase distribution feeder and a detailed battery system model, and benchmarked against a local control approach with similar objectives but without optimality guarantees and constraint enforcement. We show that the proposed controller delivers consistent voltage regulation in the long term, while the local control approach pursues the objectives more aggressively but quickly hits the storage limits.
comment: 8 pages, 6 figures
Sustainable Load Balancing for Wireless Networks With Renewable Energy Sources
Future wireless networks powered by renewable energy sources and storage systems (e.g., batteries) require energy-aware mechanisms to ensure stability in critical and high-demand scenarios. These include large-scale user gatherings, especially during evening hours when solar generation is unavailable, and days with poor wind conditions that limit the effectiveness of wind-based energy harvesting. Maintaining network performance under such constraints, while preserving stored energy, remains a key challenge. This work proposes an enhanced Proactive-Reactive Load Balancing algorithm that integrates energy conditions into mobility management. By leveraging standardized mobility events, the algorithm optimizes traffic distribution and energy utilization (avoiding complete drainage of stored energy), thereby preventing service degradation. Simulations show improved energy sustainability and network performance under congestion and limited solar availability.
Performance Guarantees for Data-Driven Sequential Decision-Making
The solutions to many sequential decision-making problems are characterized by dynamic programming and Bellman's principle of optimality. However, due to the inherent complexity of solving Bellman's equation exactly, there has been significant interest in developing various approximate dynamic programming (ADP) schemes to obtain near-optimal solutions. A fundamental question that arises is: how close are the objective values produced by ADP schemes relative to the true optimal objective values? In this paper, we develop a general framework that provides performance guarantees for ADP schemes in the form of ratio bounds. Specifically, we show that the objective value under an ADP scheme is at least a computable fraction of the optimal value. We further demonstrate the applicability of our theoretical framework through two applications: data-driven robot path planning and multi-agent sensor coverage.
High-Speed, All-Terrain Autonomy: Ensuring Safety at the Limits of Mobility
A novel local trajectory planner, capable of controlling an autonomous off-road vehicle on rugged terrain at high-speed is presented. Autonomous vehicles are currently unable to safely operate off-road at high-speed, as current approaches either fail to predict and mitigate rollovers induced by rough terrain or are not real-time feasible. To address this challenge, a novel model predictive control (MPC) formulation is developed for local trajectory planning. A new dynamics model for off-road vehicles on rough, non-planar terrain is derived and used for prediction. Extreme mobility, including tire liftoff without rollover, is safely enabled through a new energy-based constraint. The formulation is analytically shown to mitigate rollover types ignored by many state-of-the-art methods, and real-time feasibility is achieved through parallelized GPGPU computation. The planner's ability to provide safe, extreme trajectories is studied through both simulated trials and full-scale physical experiments. The results demonstrate fewer rollovers and more successes compared to a state-of-the-art baseline across several challenging scenarios that push the vehicle to its mobility limits.
comment: 19 pages, 16 figures, submitted to IEEE Transactions on Robotics
A Control Architecture for Fast Frequency Regulation with Increasing Penetration of Inverter Based Resources
This paper addresses frequency regulation under operational constraints in interconnected power systems with high penetration of inverter-based renewable generation. A two-layer control architecture is proposed that combines optimized droop and Virtual Synchronous Machine (VSM) primary control with a Model Predictive Control (MPC) secondary layer operating at realistic control-room update rates. Unlike recent proposed approaches, the proposed framework integrates MPC within existing grid control structures, enabling constraint-aware coordination. A reduced-order frequency response model is systematically derived from a high-fidelity grid model using Hankel singular values, and a reduced-order Kalman-Bucy observer enables state and disturbance estimation using only measurable outputs. Validation using representative data from the Kingdom of Saudi Arabia demonstrates effective frequency regulation under realistic operating conditions.
comment: Under Review in IEEE Transactions on Sustainable Energy
Flow-based Polynomial Chaos Expansion for Uncertainty Quantification in Power System Dynamic Simulation
The large-scale integration of renewable energy sources introduces significant operational uncertainty into power systems. Although Polynomial Chaos Expansion (PCE) provides an efficient tool for uncertainty quantification (UQ) in power system dynamics, its accuracy depends critically on the faithful representation of input uncertainty, an assumption that is oftern violated in practice due to correlated, non-Gaussian, and otherwise complex data distributions. In contrast to purely data-driven surrogates that often overlook rigorous input distribution modelling, this paper introduces flow-based PCE, a unified framework that couples expressive input modelling with efficient uncertainty propagation. Specifically, normalising flows are employed to learn an invertible transport map from a simple base distribution to the empirical joint distribution of uncertain inputs, and this map is then integrated directly into the PCE construction. In addition, the Map Smoothness Index (MSI) is introduced as a new metric to quantify the quality of the learned map, and smoother transformations are shown to yield more accurate PCE surrogates. The proposed Flow-based PCE framework is validated on benchmark dynamic models, including the IEEE 14-bus system and the Great Britain transmission system, under a range of uncertainty scenarios.
Performance Analysis of LEO-Terrestrial Systems in Presence of Doppler Effect
In this paper, we present a novel stochastic geometry-based approach to analyze the effect of residual Doppler shift on orthogonal frequency-division multiple access (OFDMA) systems in low earth orbit (LEO) satellite-terrestrial networks. Focusing on multiuser systems employing common Doppler compensation, we analytically formulate the coverage probability by explicitly capturing the loss of OFDMA subcarrier orthogonality caused by geometry-induced residual Doppler through inter-carrier interference. The analysis accounts for the spatial distribution of ground terminals within the serving satellite's cell and is validated through extensive Monte-Carlo simulations for both S-band and Ka-band settings. The results demonstrate the high accuracy of both the Doppler shift approximation and the derived coverage probability expression, while also highlighting the significant impact of residual Doppler shift, even after compensation, emphasizing the necessity of considering this effect in the design of future satellite networks.
comment: This work has been submitted to IEEE Wireless Communications Letters
Verifiable Error Bounds for Physics-Informed Neural KKL Observers
This paper proposes a computable state-estimation error bound for learning-based Kazantzis--Kravaris/Luenberger (KKL) observers. Recent work learns the KKL transformation map with a physics-informed neural network (PINN) and a corresponding left-inverse map with a conventional neural network. However, no computable state-estimation error bounds are currently available for this approach. We derive a state-estimation error bound that depends only on quantities that can be certified over a prescribed region using neural network verification. We further extend the result to bounded additive measurement noise and demonstrate the guarantees on nonlinear benchmark systems.
comment: 6 pages, 4 figures
Activate the Dual Cones: A Tight Reformulation of Conic ACOPF Constraints
By exploiting the observed tightness of dual rotated second-order cone (RSOC) constraints, this paper transforms the dual of a conic ACOPF relaxation into an equivalent, non-conic problem where dual constraints are implicitly enforced through eliminated dual RSOC variables. To accomplish this, we apply the RSOC-based Jabr relaxation of ACOPF, pose its dual, and then show that all dual RSOC constraints must be tight (i.e., active) at optimality. We then construct a reduced dual maximization problem with only non-negativity constraints, avoiding the explicit RSOC inequality constraints. Numerical experiments confirm that the tight formulation recovers the same dual objective values as a mature conic solver (e.g., MOSEK via PowerModels) on various PGLib benchmark test systems (ranging from 3- to 1354-buses). The proposed formulation has useful performance benefits, compared with its conic counterpart, and it allows us to define a bounding function which provides a guaranteed lower bound on system cost. While this paper focuses on demonstrating the correctness and validity of the proposed structural simplification, it lays the groundwork for future GPU-accelerated first-order optimization methods which can exploit the unconstrained nature of the proposed formulation.
Meta-Learning for Repeated Bayesian Persuasion
Classical Bayesian persuasion studies how a sender influences receivers through carefully designed signaling policies within a single strategic interaction. In many real-world environments, such interactions are repeated across multiple games, creating opportunities to exploit structural similarity across tasks. In this work, we introduce Meta-Persuasion algorithms, establishing the first line of theoretical results for both full-feedback and bandit-feedback settings in the Online Bayesian Persuasion (OBP) and Markov Persuasion Process (MPP) frameworks. We show that our proposed meta-persuasion algorithms achieve provably sharper regret rates under natural notions of task similarity, improving upon the best-known convergence rates for both OBP and MPP. At the same time, they recover the standard single-game guarantees when the sequence of games is picked arbitrarily. Finally, we complement our theoretical analysis with numerical experiments that highlight our regret improvements and the benefits of meta-learning in repeated persuasion environments.
comment: 40 pages
A Unified Family-optimal Solution to Covariance Intersection Problems with Semidefinite Programming
Covariance intersection (CI) methods provide a principled approach to fusing estimates with unknown cross-correlations by minimizing a worst-case measure of uncertainty that is consistent with the available information. This paper introduces a generalized CI framework, called overlapping covariance intersection (OCI), which unifies several existing CI formulations within a single optimization-based framework. This unification enables the characterization of family-optimal solutions for multiple CI variants, including standard CI and split covariance intersection (SCI), as solutions to a semidefinite program, for which efficient off-the-shelf solvers are available. When specialized to the corresponding settings, the proposed family-optimal solutions recover the state-of-the-art family-optimal solutions previously reported for CI and SCI. The resulting formulation facilitates the systematic design and real-time implementation of CI-based fusion methods in large-scale distributed estimation problems, such as cooperative localization.
Distributed Safety Critical Control among Uncontrollable Agents using Reconstructed Control Barrier Functions
This paper investigates the distributed safety critical control for multi-agent systems (MASs) in the presence of uncontrollable agents with uncertain behaviors. To ensure system safety, the control barrier function (CBF) is employed in this paper. However, a key challenge is that the CBF constraints are coupled when MASs perform collaborative tasks, which depend on information from multiple agents and impede the design of a fully distributed safe control scheme. To overcome this, a novel reconstructed CBF approach is proposed. In this method, the coupled CBF is reconstructed by leveraging state estimates of other agents obtained from a distributed adaptive observer. Furthermore, a prescribed performance adaptive parameter is designed to modify this reconstruction, ensuring that satisfying the reconstructed CBF constraint is sufficient to meet the original coupled one. Based on the reconstructed CBF, we design a safety-critical quadratic programming (QP) controller and prove that the proposed distributed control scheme rigorously guarantees the safety of the MAS, even in the uncertain dynamic environments involving uncontrollable agents. The effectiveness of the proposed method is illustrated through a simulation.
End-to-end guarantees for indirect data-driven control of bilinear systems with finite stochastic data
In this paper we propose an end-to-end algorithm for indirect data-driven control for bilinear systems with stability guarantees. We consider the case where the collected i.i.d. data is affected by probabilistic noise with possibly unbounded support and leverage tools from statistical learning theory to derive finite sample identification error bounds. To this end, we solve the bilinear identification problem by solving a set of linear and affine identification problems, by a particular choice of a control input during the data collection phase. We provide a priori as well as data-dependent finite sample identification error bounds on the individual matrices as well as ellipsoidal bounds, both of which are structurally suitable for control. Further, we integrate the structure of the derived identification error bounds in a robust controller design to obtain an exponentially stable closed-loop. By means of an extensive numerical study we showcase the interplay between the controller design and the derived identification error bounds. Moreover, we note appealing connections of our results to indirect data-driven control of general nonlinear systems through Koopman operator theory and discuss how our results may be applied in this setup.
comment: Accepted for publication in Automatica
PowerDAG: Reliable Agentic AI System for Automating Distribution Grid Analysis
This paper introduces PowerDAG, an agentic AI system for automating complex distribution-grid analysis. We address the reliability challenges of state-of-the-art agentic systems in automating complex engineering workflows by introducing two innovative active mechanisms: adaptive retrieval, which uses a similarity-decay cutoff algorithm to dynamically select the most relevant annotated exemplars as context, and just-in-time (JIT) supervision, which actively intercepts and corrects tool-usage violations during execution. On a benchmark of unseen distribution grid analysis queries, PowerDAG achieves a 100% success rate with GPT-5.2 and 94.4--96.7% with smaller open-source models, outperforming base ReAct (41-88%), LangChain (30-90%), and CrewAI (9-41%) baselines by margins of 6-50 percentage points.
Virtual Sensing for Solder Layer Degradation and Temperature Monitoring in IGBT Modules
Monitoring the degradation state of Insulated Gate Bipolar Transistor (IGBT) modules is essential for ensuring the reliability and longevity of power electronic systems, especially in safety-critical and high-performance applications. However, direct measurement of key degradation indicators - such as junction temperature, solder fatigue or delamination - remains challenging due to the physical inaccessibility of internal components and the harsh environment. In this context, machine learning-based virtual sensing offers a promising alternative by bridging the gap from feasible sensor placement to the relevant but inaccessible locations. This paper explores the feasibility of estimating the degradation state of solder layers, and the corresponding full temperature maps based on a limited number of physical sensors. Based on synthetic data of a specific degradation mode, we obtain a high accuracy in the estimation of the degraded solder area (1.17% mean absolute error), and are able to reproduce the surface temperature of the IGBT with a maximum relative error of 4.56% (corresponding to an average relative error of 0.37%).
comment: Andrea Urgolo and Monika Stipsitz contributed equally to this work
On Policy Stochasticity in Mutual Information Optimal Control of Linear Systems
In recent years, mutual information optimal control has been proposed as an extension of maximum entropy optimal control. Both approaches introduce regularization terms to render the policy stochastic, and it is important to theoretically clarify the relationship between the temperature parameter (i.e., the coefficient of the regularization term) and the stochasticity of the policy. Unlike in maximum entropy optimal control, this relationship remains unexplored in mutual information optimal control. In this paper, we investigate this relationship for a mutual information optimal control problem (MIOCP) of discrete-time linear systems. After extending the result of a previous study of the MIOCP, we establish the existence of an optimal policy of the MIOCP, and then derive the respective conditions on the temperature parameter under which the optimal policy becomes stochastic and deterministic. Furthermore, we also derive the respective conditions on the temperature parameter under which the policy obtained by an alternating optimization algorithm becomes stochastic and deterministic. The validity of the theoretical results is demonstrated through numerical experiments.
comment: 18 pages. Revised potentially misleading phrasing from v1. The main arguments and discussions remain unchanged
Estimation of Cell-to-Cell Variation and State of Health for Battery Modules with Parallel-Connected Cells
Estimating cell-to-cell variation (CtCV) and state of health (SoH) for battery modules with parallel-connected cells is challenging when only module-level signals are measurable and individual cell behaviors remain unobserved. Although progress has been made in SoH estimation, CtCV estimation remains unresolved in the literature. This paper proposes a unified framework that accurately estimates both CtCV and SoH for modules using only module-level information extracted from incremental capacity analysis (ICA) and differential voltage analysis (DVA). With the proposed framework, CtCV and SoH estimations can be decoupled into two separate tasks, allowing each to be solved with dedicated algorithms without mutual interference and providing greater design flexibility. The framework also exhibits strong versatility in accommodating different CtCV metrics, highlighting its general-purpose nature. Experimental validation on modules with three parallel-connected cells demonstrates that the proposed framework can systematically select optimal module-level features for CtCV and SoH estimations, deliver accurate CtCV and SoH estimates with high confidence and low computational complexity, remain effective across different C-rates, and be suitable for onboard implementation.
comment: Corrected some typos in the reference section
Optimization via a Control-Centric Framework
Optimization plays a central role in intelligent systems and cyber-physical technologies, where speed and reliability of convergence directly impact performance. In control theory, optimization-centric methods are standard: controllers are designed by repeatedly solving optimization problems, as in linear quadratic regulation, $H_\infty$ control, and model predictive control. In contrast, this paper develops a control-centric framework for optimization itself, where algorithms are constructed directly from Lyapunov stability principles rather than being proposed first and analyzed afterward. A key element is the stationarity vector, which encodes first-order optimality conditions and enables Lyapunov-based convergence analysis. By pairing a Lyapunov function with a selectable decay law, we obtain continuous-time dynamics with guaranteed exponential, finite-time, fixed-time, or prescribed-time convergence. Within this framework, we introduce three feedback realizations of increasing restrictiveness: the Hessian-gradient, Newton, and gradient dynamics. Each realization shapes the decay of the stationarity vector to achieve the desired rate. These constructions unify unconstrained optimization, extend naturally to constrained problems via Lyapunov-consistent primal-dual dynamics, and broaden the results for minimax and generalized Nash equilibrium seeking problems beyond exponential stability. The framework provides systematic design tools for optimization algorithms in control and game-theoretic problems.
comment: This work has been submitted to the IEEE for possible publication. 12 pages, 3 figures
A Hybrid Systems Model of Feedback Optimization for Linear Systems: Convergence and Robustness
Feedback optimization algorithms compute inputs to a system using real-time output measurements, which helps mitigate the effects of disturbances. However, existing work often models both system dynamics and computations in either discrete or continuous time, which may not accurately model some applications. In this work, we model linear system dynamics in continuous time, and we model the computations of inputs in discrete time. Therefore, we present a novel hybrid systems model of feedback optimization. We first establish the well-posedness of this hybrid model and establish completeness of solutions while ruling out Zeno behavior. Then we show the state of the system converges exponentially fast to a ball of known radius about a desired goal state. Next we analytically show that this system is robust to perturbations in (i) the values of measured outputs, (ii) the matrices that model the linear time-invariant system, and (iii) the times at which inputs are applied to the system. Simulation results confirm that this approach successfully mitigates the effects of disturbances.
comment: 16 Pages, 2 Figures, 1 Table, submitted to American Control Conference 2026
Schrödinger Bridge Over A Compact Connected Lie Group
This work studies the Schrödinger bridge problem for the kinematic equation on a compact connected Lie group. The objective is to steer a controlled diffusion between given initial and terminal densities supported over the Lie group while minimizing the control effort. We develop a coordinate-free formulation of this stochastic optimal control problem that respects the underlying geometric structure of the Lie group, thereby avoiding limitations associated with local parameterizations or embeddings in Euclidean spaces. We establish the existence and uniqueness of solution to the corresponding Schrödinger system. Our results are constructive in that they derive a geometric controller that optimally interpolates probability densities supported over the Lie group. To illustrate the results, we provide numerical examples on $\mathsf{SO}(2)$ and $\mathsf{SO}(3)$. The codes and animations are publicly available at https://github.com/gradslab/SbpLieGroups.git .
A Converse Control Lyapunov Theorem for Joint Safety and Stability
We show that the existence of a strictly compatible pair of control Lyapunov and control barrier functions is equivalent to the existence of a single smooth Lyapunov function that certifies both asymptotic stability and safety. This characterization complements existing literature on converse Lyapunov functions by establishing a partial differential equation (PDE) characterization with prescribed boundary conditions on the safe set, ensuring that the safe set is exactly certified by this Lyapunov function. The result also implies that if a safety and stability specification cannot be certified by a single Lyapunov function, then any pair of control Lyapunov and control barrier functions necessarily leads to a conflict and cannot be satisfied simultaneously in a robust sense.
comment: This version is to appear in the Proceedings of the 2026 American Control Conference (ACC)
Saddle Point Evasion via Curvature-Regularized Gradient Dynamics
Nonconvex optimization underlies many modern machine learning and control tasks, where saddle points pose the dominant obstacle to reliable convergence in high-dimensional settings. Escaping these saddle points deterministically and at a controllable rate remains an open challenge: gradient descent is blind to curvature, stochastic perturbation methods lack deterministic guarantees, and Newton-type approaches suffer from Hessian singularity. We present Curvature-Regularized Gradient Dynamics (CRGD), which augments the objective with a smooth penalty on the most negative Hessian eigenvalue, yielding an augmented cost that serves as an optimization Lyapunov function with user-selectable convergence rates to second-order stationary points. Numerical experiments on a nonconvex matrix factorization example confirm that CRGD escapes saddle points across all tested configurations, with escape time that decreases with the eigenvalue gap, in contrast to gradient descent, whose escape time grows inversely with the gap.
comment: This work has been submitted to the IEEE for possible publication. 6 pages, 3 figures
The FABRIC Strategy for Verifying Neural Feedback Systems
Forward reachability analysis is a dominant approach for verifying reach-avoid specifications in neural feedback systems, i.e., dynamical systems controlled by neural networks, and a number of directions have been proposed and studied. In contrast, far less attention has been given to backward reachability analysis for these systems, in part because of the limited scalability of known techniques. In this work, we begin to address this gap by introducing new algorithms for computing both over- and underapproximations of backward reachable sets for nonlinear neural feedback systems. We also describe and implement an integration of these backward reachability techniques with existing ones for forward analysis. We call the resulting algorithm Forward and Backward Reachability Integration for Certification (FaBRIC). We evaluate our algorithms on a representative set of benchmarks and show that they significantly outperform the prior state of the art.
Robotics
A Passive Elastic-Folding Mechanism for Stackable Airdrop Sensors ICRA 2026
Air-dispersed sensor networks deployed from aerial robotic systems (e.g., UAVs) provide a low-cost approach to wide-area environmental monitoring. However, existing methods often rely on active actuators for mid-air shape or trajectory control, increasing both power consumption and system cost. Here, we introduce a passive elastic-folding hinge mechanism that transforms sensors from a flat, stackable form into a three-dimensional structure upon release. Hinges are fabricated by laminating commercial sheet materials with rigid printed circuit boards (PCBs) and programming fold angles through a single oven-heating step, enabling scalable production without specialized equipment. Our geometric model links laminate geometry, hinge mechanics, and resulting fold angle, providing a predictive design methodology for target configurations. Laboratory tests confirmed fold angles between 10 degrees and 100 degrees, with a standard deviation of 4 degrees and high repeatability. Field trials further demonstrated reliable data collection and LoRa transmission during dispersion, while the Horizontal Wind Model (HWM)-based trajectory simulations indicated strong potential for wide-area sensing exceeding 10 km.
comment: 8 pages, 8 figures, The 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)
V-Dreamer: Automating Robotic Simulation and Trajectory Synthesis via Video Generation Priors
Training generalist robots demands large-scale, diverse manipulation data, yet real-world collection is prohibitively expensive, and existing simulators are often constrained by fixed asset libraries and manual heuristics. To bridge this gap, we present V-Dreamer, a fully automated framework that generates open-vocabulary, simulation-ready manipulation environments and executable expert trajectories directly from natural language instructions. V-Dreamer employs a novel generative pipeline that constructs physically grounded 3D scenes using large language models and 3D generative models, validated by geometric constraints to ensure stable, collision-free layouts. Crucially, for behavior synthesis, we leverage video generation models as rich motion priors. These visual predictions are then mapped into executable robot trajectories via a robust Sim-to-Gen visual-kinematic alignment module utilizing CoTracker3 and VGGT. This pipeline supports high visual diversity and physical fidelity without manual intervention. To evaluate the generated data, we train imitation learning policies on synthesized trajectories encompassing diverse object and environment variations. Extensive evaluations on tabletop manipulation tasks using the Piper robotic arm demonstrate that our policies robustly generalize to unseen objects in simulation and achieve effective sim-to-real transfer, successfully manipulating novel real-world objects.
comment: 8 pages, 6 figures
"You've got a friend in me": Co-Designing a Peer Social Robot for Young Newcomers' Language and Cultural Learning
Community literacy programs supporting young newcomer children in Canada face limited staffing and scarce one-to-one time, which constrains personalized English and cultural learning support. This paper reports on a co-design study with United for Literacy tutors that informed Maple, a table-top, peer-like Socially Assistive Robot (SAR) designed as a practice partner within tutor-mediated sessions. From shadowing and co-design interviews, we derived newcomer-specific requirements and added them in an integrated prototype that uses short story-based activities, multi-modal scaffolding (speech, facial feedback, gesture), and embedded quizzes that support attention while producing tutor-actionable formative signals. We contribute system design implications for tutor-in-the-loop SARs supporting language socialization in community settings and outline directions for child-centered evaluation in authentic programs.
ViTac-Tracing: Visual-Tactile Imitation Learning of Deformable Object Tracing ICRA2026
Deformable objects often appear in unstructured configurations. Tracing deformable objects helps bringing them into extended states and facilitating the downstream manipulation tasks. Due to the requirements for object-specific modeling or sim-to-real transfer, existing tracing methods either lack generalizability across different categories of deformable objects or struggle to complete tasks reliably in the real world. To address this, we propose a novel visual-tactile imitation learning method to achieve one-dimensional (1D) and two-dimensional (2D) deformable object tracing with a unified model. Our method is designed from both local and global perspectives based on visual and tactile sensing. Locally, we introduce a weighted loss that emphasizes actions maintaining contact near the center of the tactile image, improving fine-grained adjustment. Globally, we propose a tracing task loss that helps the policy to regulate task progression. On the hardware side, to compensate for the limited features extracted from visual information, we integrate tactile sensing into a low-cost teleoperation system considering both the teleoperator and the robot. Extensive ablation and comparative experiments on diverse 1D and 2D deformable objects demonstrate the effectiveness of our approach, achieving an average success rate of 80% on seen objects and 65% on unseen objects.
comment: The paper has been accepted by ICRA2026
Empathetic Motion Generation for Humanoid Educational Robots via Reasoning-Guided Vision--Language--Motion Diffusion Architecture
This article suggests a reasoning-guided vision-language-motion diffusion framework (RG-VLMD) for generating instruction-aware co-speech gestures for humanoid robots in educational scenarios. The system integrates multi-modal affective estimation, pedagogical reasoning, and teaching-act-conditioned motion synthesis to enable adaptive and semantically consistent robot behavior. A gated mixture-of-experts model predicts Valence/Arousal from input text, visual, and acoustic features, which then mapped to discrete teaching-act categories through an affect-driven policy.These signals condition a diffusion-based motion generator using clip-level intent and frame-level instructional schedules via additive latent restriction with auxiliary action-group supervision. Compared to a baseline diffusion model, our proposed method produces more structured and distinctive motion patterns, as verified by motion statics and pairwise distance analysis. Generated motion sequences remain physically plausible and can be retargeted to a NAO robot for real-time execution. The results reveal that reasoning-guided instructional conditioning improves gesture controllability and pedagogical expressiveness in educational human-robot interaction.
ROFT-VINS: Robust Feature Tracking-based Visual-Inertial State Estimation for Harsh Environment
SLAM (Simultaneous Localization and Mapping) and Odometry are important systems for estimating the position of mobile devices, such as robots and cars, utilizing one or more sensors. Particularly in camera-based SLAM or Odometry, effectively tracking visual features is important as it significantly impacts system performance. In this paper, we propose a method that leverages deep learning to robustly track visual features in monocular camera images. This method operates reliably even in textureless environments and situations with rapid lighting changes. Additionally, we evaluate the performance of our proposed method by integrating it into VINS-Fusion (Monocular-Inertial), a commonly used Visual-Inertial Odometry (VIO) system.
comment: 6 pages, published ICCAS 2024
CSSDF-Net: Safe Motion Planning Based on Neural Implicit Representations of Configuration Space Distance Field
High-dimensional manipulator operation in unstructured environments requires a differentiable, scene-agnostic distance query mechanism to guide safe motion generation. Existing geometric collision checkers are typically non-differentiable, while workspace-based implicit distance models are hindered by the highly nonlinear workspace--configuration mapping and often suffer from poor convergence; moreover, self-collision and environment collision are commonly handled as separate constraints. We propose Configuration-Space Signed Distance Field-Net (CSSDF-Net), which learns a continuous signed distance field directly in configuration space to provide joint-space distance and gradient queries under a unified geometric notion of safety. To enable zero-shot generalization without environment-specific retraining, we introduce a spatial-hashing-based data generation pipeline that encodes robot-centric geometric priors and supports efficient retrieval of risk configurations for arbitrary obstacle point sets. The learned distance field is integrated into safety-constrained trajectory optimization and receding-horizon MPC, enabling both offline planning and online reactive avoidance. Experiments on a planar arm and a 7-DoF manipulator demonstrate stable gradients, effective collision avoidance in static and dynamic scenes, and practical inference latency for large-scale point-cloud queries, supporting deployment in previously unseen environments.
REST: Receding Horizon Explorative Steiner Tree for Zero-Shot Object-Goal Navigation
Zero-shot object-goal navigation (ZSON) requires navigating unknown environments to find a target object without task-specific training. Prior hierarchical training-free solutions invest in scene understanding (\textit{belief}) and high-level decision-making (\textit{policy}), yet overlook the design of \textit{option}, i.e., a subgoal candidate proposed from evolving belief and presented to policy for selection. In practice, options are reduced to isolated waypoints scored independently: single destinations hide the value gathered along the journey; an unstructured collection obscures the relationships among candidates. Our insight is that the option space should be a \textit{tree of paths}. Full paths expose en-route information gain that destination-only scoring systematically neglects; a tree of shared segments enables coarse-to-fine LLM reasoning that dismisses or pursues entire branches before examining individual leaves, compressing the combinatorial path space into an efficient hierarchy. We instantiate this insight in \textbf{REST} (Receding Horizon Explorative Steiner Tree), a training-free framework that (1) builds an explicit open-vocabulary 3D map from online RGB-D streams; (2) grows an agent-centric tree of safe and informative paths as the option space via sampling-based planning; and (3) textualizes each branch into a spatial narrative and selects the next-best path through chain-of-thought LLM reasoning. Across the Gibson, HM3D, and HSSD benchmarks, REST consistently ranks among the top methods in success rate while achieving the best or second-best path efficiency, demonstrating a favorable efficiency-success balance.
Benchmarking Visual Feature Representations for LiDAR-Inertial-Visual Odometry Under Challenging Conditions
Accurate localization in autonomous driving is critical for successful missions including environmental mapping and survivor searches. In visually challenging environments, including low-light conditions, overexposure, illumination changes, and high parallax, the performance of conventional visual odometry methods significantly degrade undermining robust robotic navigation. Researchers have recently proposed LiDAR-inertial-visual odometry (LIVO) frameworks, that integrate LiDAR, IMU, and camera sensors, to address these challenges. This paper extends the FAST-LIVO2-based framework by introducing a hybrid approach that integrates direct photometric methods with descriptor-based feature matching. For the descriptor-based feature matching, this work proposes pairs of ORB with the Hamming distance, SuperPoint with SuperGlue, SuperPoint with LightGlue, and XFeat with the mutual nearest neighbor. The proposed configurations are benchmarked by accuracy, computational cost, and feature tracking stability, enabling a quantitative comparison of the adaptability and applicability of visual descriptors. The experimental results reveal that the proposed hybrid approach outperforms the conventional sparse-direct method. Although the sparse-direct method often fails to converge in regions where photometric inconsistency arises due to illumination changes, the proposed approach still maintains robust performance under the same conditions. Furthermore, the hybrid approach with learning-based descriptors enables robust and reliable visual state estimation across challenging environments.
comment: 14 pages, Publised IEEE Access2026
TiBCLaG: A Trigger-induced Bistable Compliant Laparoscopic Grasper
Industrial laparoscopic graspers use multi-link rigid mechanisms manufactured to tight tolerances, resulting in high manufacturing and assembly costs. This work presents the design and proof-of-concept validation of a monolithic, fully compliant, bistable, laparoscopic grasper that eliminates the need for multiple rigid links, thereby reducing part count. The device integrates a compliant trigger and a compliant gripper end-effector, coupled via a control push-rod, to achieve stable grasping without continuous user input. The trigger mechanism is synthesized using a Two-Element Beam Constraint Model as a design framework to control the deformation and stiffness of V-beam-like elements. This technique enables elastic energy storage while preventing snap-through instability. The end-effector is designed as a compliant gripper to achieve adaptive grasping through elastic deformation. Jaws' opening-and-closing performance is demonstrated using nonlinear finite element analysis. The laparoscopic design presented here is fabricated using fused deposition 3D printing. The fabricated prototype demonstrates reliable bistable actuation, confirming the feasibility of such compliant laparoscopic grasper architectures.
comment: 17 pages, 13 figures
Inductance-Based Force Self-Sensing in Fiber-Reinforced Pneumatic Twisted-and-Coiled Actuators
Fiber-reinforced pneumatic twisted-and-coiled actuators (FR-PTCAs) offer high power density and compliance but their strong hysteresis and lack of intrinsic proprioception limit effective closed-loop control. This paper presents a self-sensing FR-PTCA integrated with a conductive nickel wire that enables intrinsic force estimation and indirect displacement inference via inductance feedback. Experimental characterization reveals that the inductance of the actuator exhibits a deterministic, low-hysteresis inductance-force relationship at constant pressures, in contrast to the strongly hysteretic inductance-length behavior. Leveraging this property, this paper develops a parametric self-sensing model and a nonlinear hybrid observer that integrates an Extended Kalman Filter (EKF) with constrained optimization to resolve the ambiguity in the inductance-force mapping and estimate actuator states. Experimental results demonstrate that the proposed approach achieves force estimation accuracy comparable to that of external load cells and maintains robust performance under varying load conditions.
HEP Statistical Inference for UAV Fault Detection: CLs, LRT, and SBI Applied to Blade Damage
This paper transfers three statistical methods from particle physics to multirotor propeller fault detection: the likelihood ratio test (LRT) for binary detection, the CLs modified frequentist method for false alarm rate control, and sequential neural posterior estimation (SNPE) for quantitative fault characterization. Operating on spectral features tied to rotor harmonic physics, the system returns three outputs: binary detection, controlled false alarm rates, and calibrated posteriors over fault severity and motor location. On UAV-FD, a hexarotor dataset of 18 real flights with 5% and 10% blade damage, leave-one-flight-out cross-validation gives AUC 0.862 +/- 0.007 (95% CI: 0.849--0.876), outperforming CUSUM (0.708 +/- 0.010), autoencoder (0.753 +/- 0.009), and LSTM autoencoder (0.551). At 5% false alarm rate the system detects 93% of significant and 81% of subtle blade damage. On PADRE, a quadrotor platform, AUC reaches 0.986 after refitting only the generative models. SNPE gives a full posterior over fault severity (90% credible interval coverage 92--100%, MAE 0.012), so the output includes uncertainty rather than just a point estimate or fault flag. Per-flight sequential detection achieves 100% fault detection with 94% overall accuracy.
comment: 12 Pages, 8 Figures
Scaling Sim-to-Real Reinforcement Learning for Robot VLAs with Generative 3D Worlds
The strong performance of large vision-language models (VLMs) trained with reinforcement learning (RL) has motivated similar approaches for fine-tuning vision-language-action (VLA) models in robotics. Many recent works fine-tune VLAs directly in the real world to avoid addressing the sim-to-real gap. While real-world RL circumvents sim-to-real issues, it inherently limits the generality of the resulting VLA, as scaling scene and object diversity in the physical world is prohibitively difficult. This leads to the paradoxical outcome of transforming a broadly pretrained model into an overfitted, scene-specific policy. Training in simulation can instead provide access to diverse scenes, but designing those scenes is also costly. In this work, we show that VLAs can be RL fine-tuned without sacrificing generality and with reduced labor by leveraging 3D world generative models. Using these models together with a language-driven scene designer, we generate hundreds of diverse interactive scenes containing unique objects and backgrounds, enabling scalable and highly parallel policy learning. Starting from a pretrained imitation baseline, our approach increases simulation success from 9.7% to 79.8% while achieving a 1.25$\times$ speedup in task completion time. We further demonstrate successful sim-to-real transfer enabled by the quality of the generated digital twins together with domain randomization, improving real-world success from 21.7% to 75% and achieving a 1.13$\times$ speedup. Finally, we further highlight the benefits of leveraging the effectively unlimited data from 3D world generative models through an ablation study showing that increasing scene diversity directly improves zero-shot generalization.
Robotic Agentic Platform for Intelligent Electric Vehicle Disassembly
Electric vehicles (EV) create an urgent need for scalable battery recycling, yet disassembly of EV battery packs remains largely manual due to high design variability. We present our Robotic Agentic Platform for Intelligent Disassembly (RAPID), designed to investigate perception-driven manipulation, flexible automation, and AI-assisted robot programming in realistic recycling scenarios. The system integrates a gantry-mounted industrial manipulator, RGB-D perception, and an automated nut-running tool for fastener removal on a full-scale EV battery pack. An open-vocabulary object detection pipeline achieves 0.9757 mAP50, enabling reliable identification of screws, nuts, busbars, and other components. We experimentally evaluate (n=204) three one-shot fastener removal strategies: taught-in poses (97% success rate, 24 min duration), one-shot vision execution (57%, 29 min), and visual servoing (83%, 36 min), comparing success rate and disassembly time for the battery's top cover fasteners. To support flexible interaction, we introduce agentic AI specifications for robotic disassembly tasks, allowing LLM agents to translate high-level instructions into robot actions through structured tool interfaces and ROS services. We evaluate SmolAgents with GPT-4o-mini and Qwen 3.5 9B/4B on edge hardware. Tool-based interfaces achieve 100% task completion, while automatic ROS service discovery shows 43.3% failure rates, highlighting the need for structured robot APIs for reliable LLM-driven control. This open-source platform enables systematic investigation of human-robot collaboration, agentic robot programming, and increasingly autonomous disassembly workflows, providing a practical foundation for research toward scalable robotic battery recycling.
Computationally Efficient Density-Driven Optimal Control via Analytical KKT Reduction and Contractive MPC
Efficient coordination for collective spatial distribution is a fundamental challenge in multi-agent systems. Prior research on Density-Driven Optimal Control (D2OC) established a framework to match agent trajectories to a desired spatial distribution. However, implementing this as a predictive controller requires solving a large-scale Karush-Kuhn-Tucker (KKT) system, whose computational complexity grows cubically with the prediction horizon. To resolve this, we propose an analytical structural reduction that transforms the T-horizon KKT system into a condensed quadratic program (QP). This formulation achieves O(T) linear scalability, significantly reducing the online computational burden compared to conventional O(T^3) approaches. Furthermore, to ensure rigorous convergence in dynamic environments, we incorporate a contractive Lyapunov constraint and prove the Input-to-State Stability (ISS) of the closed-loop system against reference propagation drift. Numerical simulations verify that the proposed method facilitates rapid density coverage with substantial computational speed-up, enabling long-horizon predictive control for large-scale multi-agent swarms.
MemoAct: Atkinson-Shiffrin-Inspired Memory-Augmented Visuomotor Policy for Robotic Manipulation
Memory-augmented robotic policies are essential in handling memory-dependent tasks. However, existing approaches typically rely on simple observation window extensions, struggling to simultaneously achieve precise task state tracking and robust long-horizon retention. To overcome these challenges, inspired by the Atkinson-Shiffrin memory model, we propose MemoAct, a hierarchical memory-based policy that leverages distinct memory tiers to tackle specific bottlenecks. Specifically, lossless short-term memory ensures precise task state tracking, while compressed long-term memory enables robust long-horizon retention. To enrich the evaluation landscape, we construct MemoryRTBench based on RoboTwin 2.0, specifically tailored to assess policy capabilities in task state tracking and long-horizon retention. Extensive experiments across simulated and real-world scenarios demonstrate that MemoAct achieves superior performance compared to both existing Markovian baselines and history-aware policies. The project page is \href{https://tlf-tlf.github.io/MemoActPage/}{available}.
Fundamental Limits for Sensor-Based Control via the Gibbs Variational Principle
Fundamental limits on the performance of feedback controllers are essential for benchmarking algorithms, guiding sensor selection, and certifying task feasibility -- yet few general-purpose tools exist for computing them. Existing information-theoretic approaches overestimate the information a sensor must provide by evaluating it against the uncontrolled system, producing bounds that degrade precisely when feedback is most valuable. We derive a lower bound on the minimum expected cost of any causal feedback controller under partial observations by applying the Gibbs variational principle to the joint path measure over states and observations. The bound applies to nonlinear, nonholonomic, and hybrid dynamics with unbounded costs and admits a self-consistent refinement: any good controller concentrates the state, which limits the information the sensor can extract, which tightens the bound. The resulting fixed-point equation has a unique solution computable by bisection, and we provide conditions under which the free energy minimization is provably convex, yielding a certifiably correct numerical bound. On a nonlinear Dubins car tracking problem, the self-consistent bound captures most of the optimal cost across sensor noise levels, while the open-loop variant is vacuous at low noise.
comment: 6 pages, 1 figure
Efficient and Versatile Quadrupedal Skating: Optimal Co-design via Reinforcement Learning and Bayesian Optimization
In this paper, we present a hardware-control co-design approach that enables efficient and versatile roller skating on quadrupedal robots equipped with passive wheels. Passive-wheel skating reduces leg inertia and improves energy efficiency, particularly at high speeds. However, the absence of direct wheel actuation tightly couples mechanical design and control. To unlock the full potential of this modality, we formulate a bilevel optimization framework: an upper-level Bayesian Optimization searches the mechanical design space, while a lower-level Reinforcement Learning trains a motor control policy for each candidate design. The resulting design-policy pairs not only outperform human-engineered baselines, but also exhibit versatile behaviors such as hockey stop (rapid braking by turning sideways to maximize friction) and self-aligning motion (automatic reorientation to improve energy efficiency in the direction of travel), offering the first system-level study of dynamic skating motion on quadrupedal robots.
Graph-of-Constraints Model Predictive Control for Reactive Multi-agent Task and Motion Planning ICRA 2026
Sequences of interdependent geometric constraints are central to many multi-agent Task and Motion Planning (TAMP) problems. However, existing methods for handling such constraint sequences struggle with partially ordered tasks and dynamic agent assignments. They typically assume static assignments and cannot adapt when disturbances alter task allocations. To overcome these limitations, we introduce Graph-of-Constraints Model Predictive Control (GoC-MPC), a generalized sequence-of-constraints framework integrated with MPC. GoC-MPC naturally supports partially ordered tasks, dynamic agent coordination, and disturbance recovery. By defining constraints over tracked 3D keypoints, our method robustly solves diverse multi-agent manipulation tasks-coordinating agents and adapting online from visual observations alone, without relying on training data or environment models. Experiments demonstrate that GoC-MPC achieves higher success rates, significantly faster TAMP computation, and shorter overall paths compared to recent baselines, establishing it as an efficient and robust solution for multi-agent manipulation under real-world disturbances. Our supplementary video and code can be found at https://sites.google.com/view/goc-mpc/home .
comment: 8 main content pages, 4 main content figures, camera ready version submitted to IEEE International Conference on Robotics and Automation (ICRA 2026)
RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach
Bus holding control is challenging due to stochastic traffic and passenger demand. While deep reinforcement learning (DRL) shows promise, standard actor-critic algorithms suffer from Q-value instability in volatile environments. A key source of this instability is the conflation of two distinct uncertainties: aleatoric uncertainty (irreducible noise) and epistemic uncertainty (data insufficiency). Treating these as a single risk leads to value underestimation in noisy states, causing catastrophic policy collapse. We propose a robust ensemble soft actor-critic (RE-SAC) framework to explicitly disentangle these uncertainties. RE-SAC applies Integral Probability Metric (IPM)-based weight regularization to the critic network to hedge against aleatoric risk, providing a smooth analytical lower bound for the robust Bellman operator without expensive inner-loop perturbations. To address epistemic risk, a diversified Q-ensemble penalizes overconfident value estimates in sparsely covered regions. This dual mechanism prevents the ensemble variance from misidentifying noise as a data gap, a failure mode identified in our ablation study. Experiments in a realistic bidirectional bus corridor simulation demonstrate that RE-SAC achieves the highest cumulative reward (approx. -0.4e6) compared to vanilla SAC (-0.55e6). Mahalanobis rareness analysis confirms that RE-SAC reduces Oracle Q-value estimation error by up to 62% in rare out-of-distribution states (MAE of 1647 vs. 4343), demonstrating superior robustness under high traffic variability.
Contact Status Recognition and Slip Detection with a Bio-inspired Tactile Hand
Stable and reliable grasp is critical to robotic manipulations especially for fragile and glazed objects, where the grasp force requires precise control as too large force possibly damages the objects while small force leads to slip and fall-off. Although it is assumed the objects to manipulate is grasped firmly in advance, slip detection and timely prevention are necessary for a robot in unstructured and universal environments. In this work, we addressed this issue by utilizing multimodal tactile feedback from a five-fingered bio-inspired hand. Motivated by human hands, the tactile sensing elements were distributed and embedded into the soft skin of robotic hand, forming 24 tactile channels in total. Different from the threshold method that was widely employed in most existing works, we converted the slip detection problem to contact status recognition in combination with binning technique first and then detected the slip onset time according to the recognition results. After the 24-channel tactile signals passed through discrete wavelet transform, 17 features were extracted from different time and frequency bands. With the optimal 120 features employed for status recognition, the test accuracy reached 96.39% across three different sliding speeds and six kinds of materials. When applied to four new unseen materials, a high accuracy of 91.95% was still achieved, which further validated the generalization of our proposed method. Finally, the performance of slip detection is verified based on the trained model of contact status recognition.
comment: 7 pages, 9 figures
Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.
comment: 31 pages, 12 figures
Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models ICLR
Vision-Language-Action (VLA) models combine perception, language, and motor control in a single architecture, yet how they translate multimodal inputs into actions remains poorly understood. We apply activation injection, sparse autoencoders (SAEs), and linear probes to six models spanning 80M--7B parameters across 394,000+ rollout episodes on four benchmarks. The visual pathway dominates action generation across all architectures: injecting baseline activations into null-prompt episodes recovers near-identical behavior, while cross-task injection steers robots toward source-task positions (99.8\% of X-VLA episodes align with the source trajectory), exposing spatially bound motor programs tied to scene coordinates rather than abstract task representations. Language sensitivity depends on task structure, not model design: when visual context uniquely specifies the task, language is ignored; when multiple goals share a scene, language becomes essential (X-VLA \texttt{libero\_goal}: 94\%$\to$10\% under wrong prompts vs.\ \texttt{libero\_object}: 60--100\% regardless). In all three multi-pathway architectures (\pizhalf{}, SmolVLA, GR00T), expert pathways encode motor programs while VLM pathways encode goal semantics ($2\times$ greater behavioral displacement from expert injection), and subspace injection confirms these occupy separable activation subspaces. Per-token SAE processing is essential for action fidelity on most architectures, though mean-pooling improves fidelity on X-VLA. Contrastive identification recovers 82+ manipulation concepts, and causal ablation reveals sensitivity spanning 28--92\% zero-effect rates independent of representation width. We release \textbf{Action Atlas} (https://action-atlas.com) for interactive exploration of VLA representations across all six models.
comment: Accepted to Multimodal Intelligence Workshop @ ICLR
NavTrust: Benchmarking Trustworthiness for Embodied Navigation
There are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world settings. To address this gap, we present NavTrust, a unified benchmark that systematically corrupts input modalities, including RGB, depth, and instructions, in realistic scenarios and evaluates their impact on navigation performance. To our best knowledge, NavTrust is the first benchmark that exposes embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Our extensive evaluation of seven state-of-the-art approaches reveals substantial performance degradation under realistic corruptions, which highlights critical robustness gaps and provides a roadmap toward more trustworthy embodied navigation systems. Furthermore, we systematically evaluate four distinct mitigation strategies to enhance robustness against RGB-Depth and instructions corruptions. Our base models include Uni-NaVid and ETPNav. We deployed them on a real mobile robot and observed improved robustness to corruptions. The project website is: https://navtrust.github.io.
comment: Project Website: https://navtrust.github.io
OmniVTA: Visuo-Tactile World Modeling for Contact-Rich Robotic Manipulation
Contact-rich manipulation tasks, such as wiping and assembly, require accurate perception of contact forces, friction changes, and state transitions that cannot be reliably inferred from vision alone. Despite growing interest in visuo-tactile manipulation, progress is constrained by two persistent limitations: existing datasets are small in scale and narrow in task coverage, and current methods treat tactile signals as passive observations rather than using them to model contact dynamics or enable closed-loop control explicitly. In this paper, we present \textbf{OmniViTac}, a large-scale visuo-tactile-action dataset comprising $21{,}000+$ trajectories across $86$ tasks and $100+$ objects, organized into six physics-grounded interaction patterns. Building on this dataset, we propose \textbf{OmniVTA}, a world-model-based visuo-tactile manipulation framework that integrates four tightly coupled modules: a self-supervised tactile encoder, a two-stream visuo-tactile world model for predicting short-horizon contact evolution, a contact-aware fusion policy for action generation, and a 60Hz reflexive controller that corrects deviations between predicted and observed tactile signals in a closed loop. Real-robot experiments across all six interaction categories show that OmniVTA outperforms existing methods and generalizes well to unseen objects and geometric configurations, confirming the value of combining predictive contact modeling with high-frequency tactile feedback for contact-rich manipulation. All data, models, and code will be made publicly available on the project website at https://mrsecant.github.io/OmniVTA.
comment: TARS Robotics Project Page: https://mrsecant.github.io/OmniVTA
FASTER: Rethinking Real-Time Flow VLAs FAST
Real-time execution is crucial for deploying Vision-Language-Action (VLA) models in the physical world. Existing asynchronous inference methods primarily optimize trajectory smoothness, but neglect the critical latency in reacting to environmental changes. By rethinking the notion of reaction in action chunking policies, this paper presents a systematic analysis of the factors governing reaction time. We show that reaction time follows a uniform distribution determined jointly by the Time to First Action (TTFA) and the execution horizon. Moreover, we reveal that the standard practice of applying a constant schedule in flow-based VLAs can be inefficient and forces the system to complete all sampling steps before any movement can start, forming the bottleneck in reaction latency. To overcome this issue, we propose Fast Action Sampling for ImmediaTE Reaction (FASTER). By introducing a Horizon-Aware Schedule, FASTER adaptively prioritizes near-term actions during flow sampling, compressing the denoising of the immediate reaction by tenfold (e.g., in $π_{0.5}$ and X-VLA) into a single step, while preserving the quality of long-horizon trajectory. Coupled with a streaming client-server pipeline, FASTER substantially reduces the effective reaction latency on real robots, especially when deployed on consumer-grade GPUs. Real-world experiments, including a highly dynamic table tennis task, prove that FASTER unlocks unprecedented real-time responsiveness for generalist policies, enabling rapid generation of accurate and smooth trajectories.
comment: Project page: https://innovator-zero.github.io/FASTER
Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models
Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, their generalization is inconsistent: while these models can perform impressively in some settings, fine-tuned variants often fail on novel objects, scenes, and instructions. We apply mechanistic interpretability techniques to better understand the inner workings of VLA models. To probe internal representations, we train Sparse Autoencoders (SAEs) on hidden layer activations of the VLA. SAEs learn a sparse dictionary whose features act as a compact, interpretable basis for the model's computation. We find that the large majority of extracted SAE features correspond to memorized sequences from specific training demonstrations. However, some features correspond to interpretable, general, and steerable motion primitives and semantic properties, offering a promising glimpse toward VLA generalizability. We propose a metric to categorize features according to whether they represent generalizable transferable primitives or episode-specific memorization. We validate these findings through steering experiments on the LIBERO benchmark. We show that individual SAE features causally influence robot behavior. Steering general features induces behaviors consistent with their semantic meaning and can be applied across tasks and scenes. This work provides the first mechanistic evidence that VLAs can learn generalizable features across tasks and scenes. We observe that supervised fine-tuning on small robotics datasets disproportionately amplifies memorization. In contrast, training on larger, more diverse datasets (e.g., DROID) or using knowledge insulation promotes more general features. We provide an open-source codebase and user-friendly interface for activation collection, SAE training, and feature steering. Our project page is located at http://drvla.github.io
comment: 25 pages, 12 figures
ADMM-Based Distributed MPC with Control Barrier Functions for Safe Multi-Robot Quadrupedal Locomotion
This paper proposes a fully decentralized model predictive control (MPC) framework with control barrier function (CBF) constraints for safety-critical trajectory planning in multi-robot legged systems. The incorporation of CBF constraints introduces explicit inter-agent coupling, which prevents direct decomposition of the resulting optimal control problems. To address this challenge, we reformulate the centralized safety-critical MPC problem using a structured distributed optimization framework based on the alternating direction method of multipliers (ADMM). By introducing a novel node-edge splitting formulation with consensus constraints, the proposed approach decomposes the global problem into independent node-local and edge-local quadratic programs that can be solved in parallel using only neighbor-to-neighbor communication. This enables fully decentralized trajectory optimization with symmetric computational load across agents while preserving safety and dynamic feasibility. The proposed framework is integrated into a hierarchical locomotion control architecture for quadrupedal robots, combining high-level distributed trajectory planning, mid-level nonlinear MPC enforcing single rigid body dynamics, and low-level whole-body control enforcing full-order robot dynamics. The effectiveness of the proposed approach is demonstrated through hardware experiments on two Unitree Go2 quadrupedal robots and numerical simulations involving up to four robots navigating uncertain environments with rough terrain and external disturbances. The results show that the proposed distributed formulation achieves performance comparable to centralized MPC while reducing the average per-cycle planning time by up to 51% in the four-agent case, enabling efficient real-time decentralized implementation.
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state-of-the-art VLM-based grounding approaches struggle with complex metric-semantic language queries. To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these grounded outputs to produce metrically consistent, actionable decisions in 3D space. We evaluate MAPG on the HM-EQA benchmark and show consistent performance improvements over strong baselines. Furthermore, we introduce a new benchmark, MAPG-Bench, specifically designed to evaluate metric-semantic goal grounding, addressing a gap in existing language grounding evaluations. We also present a real-world robot demonstration showing that MAPG transfers beyond simulation when a structured scene representation is available.
comment: Equal contribution: Swagat Padhan and Lakshya Jain, 9 pages, 6 figures, paper website: https://lakshya-asu.github.io/Meanings-Measurements-Multi-Agent-Probabilistic-Grounding/
GSMem: 3D Gaussian Splatting as Persistent Spatial Memory for Zero-Shot Embodied Exploration and Reasoning
Effective embodied exploration requires agents to accumulate and retain spatial knowledge over time. However, existing scene representations, such as discrete scene graphs or static view-based snapshots, lack \textit{post-hoc re-observability}. If an initial observation misses a target, the resulting memory omission is often irrecoverable. To bridge this gap, we propose \textbf{GSMem}, a zero-shot embodied exploration and reasoning framework built upon 3D Gaussian Splatting (3DGS). By explicitly parameterizing continuous geometry and dense appearance, 3DGS serves as a persistent spatial memory that endows the agent with \textit{Spatial Recollection}: the ability to render photorealistic novel views from optimal, previously unoccupied viewpoints. To operationalize this, GSMem employs a retrieval mechanism that simultaneously leverages parallel object-level scene graphs and semantic-level language fields. This complementary design robustly localizes target regions, enabling the agent to ``hallucinate'' optimal views for high-fidelity Vision-Language Model (VLM) reasoning. Furthermore, we introduce a hybrid exploration strategy that combines VLM-driven semantic scoring with a 3DGS-based coverage objective, balancing task-aware exploration with geometric coverage. Extensive experiments on embodied question answering and lifelong navigation demonstrate the robustness and effectiveness of our framework
comment: Project page at https://vulab-ai.github.io/GSMem/
Introducing M: A Modular, Modifiable Social Robot
We present M, an open-source, low-cost social robot platform designed to reduce platform friction that slows social robotics research by making robots easier to reproduce, modify, and deploy in real-world settings. M combines a modular mechanical design, multimodal sensing, and expressive yet mechanically simple actuation architecture with a ROS2-native software package that cleanly separates perception, expression control, and data management. The platform includes a simulation environment with interface equivalence to hardware to support rapid sim-to-real transfer of interaction behaviors. We demonstrate extensibility through additional sensing/actuation modules and provide example interaction templates for storytelling and two-way conversational coaching. Finally, we report real-world use in participatory design and week-long in-home deployments, showing how M can serve as a practical foundation for longitudinal, reproducible social robotics research.
From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models
Vision-Language-Action (VLA) models have recently enabled embodied agents to perform increasingly complex tasks by jointly reasoning over visual, linguistic, and motor modalities. However, we find that the prevailing notion of ``efficiency'' in current VLA research, characterized by parameters, FLOPs, or token decoding throughput, does not reflect actual performance on robotic platforms. In real-world execution, efficiency is determined by system-level embodied behaviors such as task completion time, trajectory smoothness, cumulative joint rotation, and motion energy. Through controlled studies across model compression, token sparsification, and action sequence compression, we make several observations that challenge common assumptions. (1) Methods that reduce computation under conventional metrics often increase end-to-end execution cost or degrade motion quality, despite maintaining task success rates. (2) System-level embodied efficiency metrics reveal performance differences in the learned action policies that remain hidden under conventional evaluations. (3) Common adaptation methods such as in-context prompting or supervised fine-tuning show only mild and metric-specific improvements in embodied efficiency. While these methods can reduce targeted embodied-efficiency metrics such as jerk or action rate, the resulting gains may come with trade-offs in other metrics, such as longer completion time. Taken together, our results suggest that conventional inference efficiency metrics can overlook important aspects of embodied execution. Incorporating embodied efficiency provides a more complete view of policy behavior and practical performance, enabling fairer and more comprehensive comparisons of VLA models.
Tendon-Actuated Robots with a Tapered, Flexible Polymer Backbone: Design, Fabrication, and Modeling
This paper presents the design, modeling, and fabrication of 3D-printed, tendon-actuated continuum robots featuring a flexible, tapered backbone constructed from thermoplastic polyurethane (TPU). Our scalable design incorporates an integrated electronics base housing that enables direct tendon tension control and sensing via actuators and compression load cells. Unlike many continuum robots that are single-purpose and costly, the proposed design prioritizes customizability, rapid assembly, and low cost while enabling high curvature and enhanced distal compliance through geometric tapering, thereby supporting a broad range of compliant robotic inspection and manipulation tasks. We develop a generalized forward kinetostatic model of the tapered backbone based on Cosserat rod theory using a Newtonian approach, extending existing tendon-actuated Cosserat rod formulations to explicitly account for spatially varying backbone cross-sectional geometry. The model captures the graded stiffness profile induced by the tapering and enables systematic exploration of the configuration space as a function of the geometric design parameters. Specifically, we analyze how the backbone taper angle influences the robot's configuration space and manipulability. The model is validated against motion capture data, achieving centimeter-level shape prediction accuracy after calibrating Young's modulus via a line search that minimizes modeling error. We further demonstrate teleoperated grasping using an endoscopic gripper routed along the continuum robot, mounted on a 6-DoF robotic arm. Parameterized iLogic/CAD scripts are provided for rapid geometry generation and scaling. The presented framework establishes a simple, rapid, and reproducible pathway from parametric design to controlled tendon actuation for tapered, tendon-driven continuum robots manufactured using fused deposition modeling 3D printers.
Articulated-Body Dynamics Network: Dynamics-Grounded Prior for Robot Learning
Recent work in reinforcement learning has shown that incorporating structural priors for articulated robots, such as link connectivity, into policy networks improves learning efficiency. However, dynamics properties, despite their fundamental role in determining how forces and motion propagate through the body, remain largely underexplored as an inductive bias for policy learning. To address this gap, we present the Articulated-Body Dynamics Network (ABD-Net), a novel graph neural network architecture grounded in the computational structure of forward dynamics. Specifically, we adapt the inertia propagation mechanism from the Articulated Body Algorithm, systematically aggregating inertial quantities from child to parent links in a tree-structured manner, while replacing physical quantities with learnable parameters. Embedding ABD-NET into the policy actor enables dynamics-informed representations that capture how actions propagate through the body, leading to efficient and robust policy learning. Through experiments with simulated humanoid, quadruped, and hopper robots, our approach demonstrates increased sample efficiency and generalization to dynamics shifts compared to transformer-based and GNN baselines. We further validate the learned policy on real Unitree G1 and Go2 robots, state-of-the-art humanoid and quadruped platforms, generating dynamic, versatile and robust locomotion behaviors through sim-to-real transfer with real-time inference.
comment: Arxiv_r1
DROID-SLAM in the Wild CVPR 2026
We present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 10 FPS. Code and datasets are available at https://github.com/MoyangLi00/DROID-W.git.
comment: CVPR 2026, Project Page: https://moyangli00.github.io/droid-w/
CAMO: A Conditional Neural Solver for the Multi-objective Multiple Traveling Salesman Problem
Robotic systems often require a team of robots to collectively visit multiple targets while optimizing competing objectives, such as total travel cost and makespan. This setting can be formulated as the Multi-Objective Multiple Traveling Salesman Problem (MOMTSP). Although learning-based methods have shown strong performance on the single-agent TSP and multi-objective TSP variants, they rarely address the combined challenges of multi-agent coordination and multi-objective trade-offs, which introduce dual sources of complexity. To bridge this gap, we propose CAMO, a conditional neural solver for MOMTSP that generalizes across varying numbers of targets, agents, and preference vectors, and yields high-quality approximations to the Pareto front (PF). Specifically, CAMO consists of a conditional encoder to fuse preferences into instance representations, enabling explicit control over multi-objective trade-offs, and a collaborative decoder that coordinates all agents by alternating agent selection and node selection to construct multi-agent tours autoregressively. To further improve generalization, we train CAMO with a REINFORCE-based objective over a mixed distribution of problem sizes. Extensive experiments show that CAMO outperforms both neural and conventional heuristics, achieving a closer approximation of PFs. In addition, ablation results validate the contributions of CAMO's key components, and real-world tests on a mobile robot platform demonstrate its practical applicability.
comment: 9 pages, 3 figures
Fire as a Service: Augmenting Robot Simulators with Thermally and Visually Accurate Fire Dynamics
Most existing robot simulators prioritize rigid-body dynamics and photorealistic rendering, but largely neglect the thermally and optically complex phenomena that characterize real-world fire environments. For robots envisioned as future firefighters, this limitation hinders both reliable capability evaluation and the generation of representative training data prior to deployment in hazardous scenarios. To address these challenges, we introduce Fire as a Service (FaaS), a novel, asynchronous co-simulation framework that augments existing robot simulators with high-fidelity and computationally efficient fire simulations. Our pipeline enables robots to experience accurate, multi-species thermodynamic heat transfer and visually consistent volumetric smoke without disrupting high-frequency rigid-body control loops. We demonstrate that our framework can be integrated with diverse robot simulators to generate physically accurate fire behavior, benchmark thermal hazards encountered by robotic platforms, and collect realistic multimodal perceptual data. Crucially, its real-time performance supports human-in-the-loop teleoperation, enabling the successful training of reactive, multimodal policies via Behavioral Cloning. By adding fire dynamics to robot simulations, FaaS provides a scalable pathway toward safer, more reliable deployment of robots in fire scenarios.
ATG-MoE: Autoregressive trajectory generation with mixture-of-experts for assembly skill learning
Flexible manufacturing requires robot systems that can adapt to constantly changing tasks, objects, and environments. However, traditional robot programming is labor-intensive and inflexible, while existing learning-based assembly methods often suffer from weak positional generalization, complex multi-stage designs, and limited multi-skill integration capability. To address these issues, this paper proposes ATG-MoE, an end-to-end autoregressive trajectory generation method with mixture of experts for assembly skill learning from demonstration. The proposed method establishes a closed-loop mapping from multi-modal inputs, including RGB-D observations, natural language instructions, and robot proprioception to manipulation trajectories. It integrates multi-modal feature fusion for scene and task understanding, autoregressive sequence modeling for temporally coherent trajectory generation, and a mixture-of-experts architecture for unified multi-skill learning. In contrast to conventional methods that separate visual perception and control or train different skills independently, ATG-MoE directly incorporates visual information into trajectory generation and supports efficient multi-skill integration within a single model. We train and evaluate the proposed method on eight representative assembly skills from a pressure-reducing valve assembly task. Experimental results show that ATG-MoE achieves strong overall performance in simulation, with an average grasp success rate of 96.3% and an average overall success rate of 91.8%, while also demonstrating strong generalization and effective multi-skill integration. Real-world experiments further verify its practicality for multi-skill industrial assembly. The project page can be found at https://hwh23.github.io/ATG-MoE
comment: 32 pages, 13 figures
MERGE: Guided Vision-Language Models for Multi-Actor Event Reasoning and Grounding in Human-Robot Interaction
We introduce MERGE, a system for situational grounding of actors, objects, and events in dynamic human-robot group interactions. Effective collaboration in such settings requires consistent situational awareness, built on persistent representations of people and objects and an episodic abstraction of events. MERGE achieves this by uniquely identifying physical instances of actors (humans or robots) and objects and structuring them into actor-action-object relations, ensuring temporal consistency across interactions. Central to MERGE is the integration of Vision-Language Models (VLMs) guided with a perception pipeline: a lightweight streaming module continuously processes visual input to detect changes and selectively invokes the VLM only when necessary. This decoupled design preserves the reasoning power and zero-shot generalization of VLMs while improving efficiency, avoiding both the high monetary cost and the latency of frame-by-frame captioning that leads to fragmented and delayed outputs. To address the absence of suitable benchmarks for multi-actor collaboration, we introduce the GROUND dataset, which offers fine-grained situational annotations of multi-person and human-robot interactions. On this dataset, our approach improves the average grounding score by a factor of 2 compared to the performance of VLM-only baselines - including GPT-4o, GPT-5 and Gemini 2.5 Flash - while also reducing run-time by a factor of 4. The code and data are available at www.github.com/HRI-EU/merge.
PRIOR: Perceptive Learning for Humanoid Locomotion with Reference Gait Priors
Training perceptive humanoid locomotion policies that traverse complex terrains with natural gaits remains an open challenge, typically demanding multi-stage training pipelines, adversarial objectives, or extensive real-world calibration. We present PRIOR, an efficient and reproducible framework built on Isaac Lab that achieves robust terrain traversal with human-like gaits through a simple yet effective design: (i) a parametric gait generator that supplies stable reference trajectories derived from motion capture without adversarial training, (ii) a GRU-based state estimator that infers terrain geometry directly from egocentric depth images via self-supervised heightmap reconstruction, and (iii) terrain-adaptive footstep rewards that guide foot placement toward traversable regions. Through systematic analysis of depth image resolution trade-offs, we identify configurations that maximize terrain fidelity under real-time constraints, substantially reducing perceptual overhead without degrading traversal performance. Comprehensive experiments across terrains of varying difficulty-including stairs, boxes, and gaps-demonstrate that each component yields complementary and essential performance gains, with the full framework achieving a 100% traversal success rate. We will open-source the complete PRIOR framework, including the training pipeline, parametric gait generator, and evaluation benchmarks, to serve as a reproducible foundation for humanoid locomotion research on Isaac Lab.
comment: https://prior-iros2026.github.io/
Lightweight Model Predictive Control for Spacecraft Rendezvous Attitude Synchronization
This work introduces two lightweight model predictive control (MPC) approaches for attitude tracking with reaction wheels during spacecraft rendezvous synchronization. Both approaches are based on a novel attitude deviation formulation, which enables the use of inherently linear constraints on angular velocity. We develop a single-loop and a dual-loop MPC; the latter embeds a stabilizing feedback controller within the inner loop, yielding a linear time-invariant system. Both controllers are implemented with CasADi - including automatic code generation - evaluated across various solvers, and validated within the Basilisk astrodynamics simulation framework. The experimental results demonstrate improved tracking accuracy alongside reductions in computational effort and memory consumption. Finally, embedded delivery to an ARM Cortex-M7 - representative of commercial off-the-shelf devices used in New Space platforms - confirms the real-time feasibility of these approaches and highlights their suitability for onboard attitude control in resource-constrained spacecraft rendezvous missions.
comment: Accepted at European Control Conference (ECC 2026)
Safety-Guaranteed Imitation Learning from Nonlinear Model Predictive Control for Spacecraft Close Proximity Operations
This paper presents a safety-guaranteed, runtime-efficient imitation learning framework for spacecraft close proximity control. We leverage Control Barrier Functions (CBFs) for safety certificates and Control Lyapunov Functions (CLFs) for stability as unified design principles across data generation, training, and deployment. First, a nonlinear Model Predictive Control (NMPC) expert enforces CBF constraints to provide safe reference trajectories. Second, we train a neural policy with a novel CBF-CLF-informed loss and DAgger-like rollouts with curriculum weighting, promoting data-efficiency and reducing future safety filter interventions. Third, at deployment a lightweight one-step CBF-CLF quadratic program minimally adjusts the learned control input to satisfy hard safety constraints while encouraging stability. We validate the approach for ESA-compliant close proximity operations, including fly-around with a spherical keep-out zone and final approach inside a conical approach corridor, using the Basilisk high-fidelity simulator with nonlinear dynamics and perturbations. Numerical experiments indicate stable convergence to decision points and strict adherence to safety under the filter, with task performance comparable to the NMPC expert while significantly reducing online computation. A runtime analysis demonstrates real-time feasibility on a commercial off-the-shelf processor, supporting onboard deployment for safety-critical on-orbit servicing.
comment: Accepted at European Control Conference (ECC 2026)
Unlabeled Multi-Robot Motion Planning with Improved Separation Trade-offs
We study unlabeled multi-robot motion planning for unit-disk robots in a polygonal environment. Although the problem is hard in general, polynomial-time solutions exist under appropriate separation assumptions on start and target positions. Banyassady et al. (SoCG'22) guarantee feasibility in simple polygons under start--start and target--target distances of at least $4$, and start--target distances of at least $3$, but without optimality guarantees. Solovey et al. (RSS'15) provide a near-optimal solution in general polygonal domains, under stricter conditions: start/target positions must have pairwise distance at least $4$, and at least $\sqrt{5}\approx2.236$ from obstacles. This raises the question of whether polynomial-time algorithms can be obtained in even more densely packed environments. In this paper we present a generalized algorithm that achieve different trade-offs on the robots-separation and obstacles-separation bounds, all significantly improving upon the state of the art. Specifically, we obtain polynomial-time constant-approximation algorithms to minimize the total path length when (i) the robots-separation is $2\tfrac{2}{3}$ and the obstacles-separation is $1\tfrac{2}{3}$, or (ii) the robots-separation is $\approx3.291$ and the obstacles-separation $\approx1.354$. Additionally, we introduce a different strategy yielding a polynomial-time solution when the robots-separation is only $2$, and the obstacles-separation is $3$. Finally, we show that without any robots-separation assumption, obstacles-separation of at least $1.5$ may be necessary for a solution to exist.
Real-Time Optical Communication Using Event-Based Vision with Moving Transmitters IROS 2026
In multi-robot systems, traditional radio frequency (RF) communication struggles with contention and jamming. Optical communication offers a strong alternative. However, conventional frame-based cameras suffer from limited frame rates, motion blur, and reduced robustness under high dynamic range lighting. Event cameras support microsecond temporal resolution and high dynamic range, making them extremely sensitive to scene changes under fast relative motion with an optical transmitter. Leveraging these strengths, we develop a complete optical communication system capable of tracking moving transmitters and decoding messages in real time. Our system achieves over $95\%$ decoding accuracy for text transmission during motion by implementing a Geometry-Aware Unscented Kalman Filter (GA-UKF), achieving 7x faster processing speed compared to the previous state-of-the-art method, while maintaining equivalent tracking accuracy at transmitting frequencies $\geq$ 1 kHz.
comment: 8 pages, 7 Figures, Submitted to IROS 2026 - Under Review
Can LLMs Prove Robotic Path Planning Optimality? A Benchmark for Research-Level Algorithm Verification
Robotic path planning problems are often NP-hard, and practical solutions typically rely on approximation algorithms with provable performance guarantees for general cases. While designing such algorithms is challenging, formally proving their approximation optimality is even more demanding, which requires domain-specific geometric insights and multi-step mathematical reasoning over complex operational constraints. Recent Large Language Models (LLMs) have demonstrated strong performance on mathematical reasoning benchmarks, yet their ability to assist with research-level optimality proofs in robotic path planning remains under-explored. In this work, we introduce the first benchmark for evaluating LLMs on approximation-ratio proofs of robotic path planning algorithms. The benchmark consists of 34 research-grade proof tasks spanning diverse planning problem types and complexity levels, each requiring structured reasoning over algorithm descriptions, problem constraints, and theoretical guarantees. Our evaluation of state-of-the-art proprietary and open-source LLMs reveals that even the strongest models struggle to produce fully valid proofs without external domain knowledge. However, providing LLMs with task-specific in-context lemmas substantially improves reasoning quality, a factor that is more effective than generic chain-of-thought prompting or supplying the ground-truth approximation ratio as posterior knowledge. We further provide fine-grained error analysis to characterize common logical failures and hallucinations, and demonstrate how each error type can be mitigated through targeted context augmentation.
Exact and Approximate Convex Reformulation of Linear Stochastic Optimal Control with Chance Constraints
In this paper, we present an equivalent convex optimization formulation for discrete-time stochastic linear systems subject to linear chance constraints, alongside a tight convex relaxation for quadratic chance constraints. By lifting the state vector to encode moment information explicitly, the formulation captures linear chance constraints on states and controls across multiple time steps exactly, without conservatism, yielding strict improvements in both feasibility and optimality. For quadratic chance constraints, we derive convex approximations that are provably less conservative than existing methods. We validate the framework on minimum-snap trajectory generation for a quadrotor, demonstrating that the proposed approach remains feasible at noise levels an order of magnitude beyond the operating range of prior formulations.
comment: Under Review
A Closed-Form CLF-CBF Controller for Whole-Body Continuum Soft Robot Collision Avoidance
Safe operation is essential for deploying robots in human-centered 3D environments. Soft continuum manipulators provide passive safety through mechanical compliance, but still require active control to achieve reliable collision avoidance. Existing approaches, such as sampling-based planning, are often computationally expensive and lack formal safety guarantees, which limits their use for real-time whole-body avoidance. This paper presents a closed-form Control Lyapunov Function--Control Barrier Function (CLF--CBF) controller for real-time 3D obstacle avoidance in soft continuum manipulators without online optimization. By analytically embedding safety constraints into the control input, the proposed method ensures stability and safety under the stated modeling assumptions, while avoiding feasibility issues commonly encountered in online optimization-based methods. The resulting controller is up to $10\times$ faster than standard CLF--CBF quadratic-programming approaches and up to $100\times$ faster than traditional sampling-based planners. Simulation and hardware experiments on a tendon-driven soft manipulator demonstrate accurate 3D trajectory tracking and robust obstacle avoidance in cluttered environments. These results show that the proposed framework provides a scalable and provably safe control strategy for soft robots operating in dynamic, safety-critical settings.
Speculative Policy Orchestration: A Latency-Resilient Framework for Cloud-Robotic Manipulation
Cloud robotics enables robots to offload high-dimensional motion planning and reasoning to remote servers. However, for continuous manipulation tasks requiring high-frequency control, network latency and jitter can severely destabilize the system, causing command starvation and unsafe physical execution. To address this, we propose Speculative Policy Orchestration (SPO), a latency-resilient cloud-edge framework. SPO utilizes a cloud-hosted world model to pre-compute and stream future kinematic waypoints to a local edge buffer, decoupling execution frequency from network round-trip time. To mitigate unsafe execution caused by predictive drift, the edge node employs an $ε$-tube verifier that strictly bounds kinematic execution errors. The framework is coupled with an Adaptive Horizon Scaling mechanism that dynamically expands or shrinks the speculative pre-fetch depth based on real-time tracking error. We evaluate SPO on continuous RLBench manipulation tasks under emulated network delays. Results show that even when deployed with learned models of modest accuracy, SPO reduces network-induced idle time by over 60% compared to blocking remote inference. Furthermore, SPO discards approximately 60% fewer cloud predictions than static caching baselines. Ultimately, SPO enables fluid, real-time cloud-robotic control while maintaining bounded physical safety.
comment: 9 pages, 7 figures, conference submission
SOFTMAP: Sim2Real Soft Robot Forward Modeling via Topological Mesh Alignment and Physics Prior
While soft robot manipulators offer compelling advantages over rigid counterparts, including inherent compliance, safe human-robot interaction, and the ability to conform to complex geometries, accurate forward modeling from low-dimensional actuation commands remains an open challenge due to nonlinear material phenomena such as hysteresis and manufacturing variability. We present SOFTMAP, a sim-to-real learning framework for real-time 3D forward modeling of tendon-actuated soft finger manipulators. SOFTMAP combines four components: (1) As-Rigid-As-Possible (ARAP)-based topological alignment that projects simulated and real point clouds into a shared, topologically consistent vertex space; (2) a lightweight MLP forward model pretrained on simulation data to map servo commands to full 3D finger geometry; (3) a residual correction network trained on a small set of real observations to predict per-vertex displacement fields that compensate for sim-to-real discrepancies; and (4) a closed-form linear actuation calibration layer enabling real-time inference at 30 FPS. We evaluate SOFTMAP on both simulated and physical hardware, achieving state-of-the-art shape prediction accuracy with a Chamfer distance of 0.389 mm in simulation and 3.786 mm on hardware, millimeter-level fingertip trajectory tracking across multiple target paths, and a 36.5% improvement in teleoperation task success over the baseline. Our results show that SOFTMAP provides a data-efficient approach for 3D forward modeling and control of soft manipulators.
VAMPO: Policy Optimization for Improving Visual Dynamics in Video Action Models
Video action models are an appealing foundation for Vision--Language--Action systems because they can learn visual dynamics from large-scale video data and transfer this knowledge to downstream robot control. Yet current diffusion-based video predictors are trained with likelihood-surrogate objectives, which encourage globally plausible predictions without explicitly optimizing the precision-critical visual dynamics needed for manipulation. This objective mismatch often leads to subtle errors in object pose, spatial relations, and contact timing that can be amplified by downstream policies. We propose VAMPO, a post-training framework that directly improves visual dynamics in video action models through policy optimization. Our key idea is to formulate multi-step denoising as a sequential decision process and optimize the denoising policy with rewards defined over expert visual dynamics in latent space. To make this optimization practical, we introduce an Euler Hybrid sampler that injects stochasticity only at the first denoising step, enabling tractable low-variance policy-gradient estimation while preserving the coherence of the remaining denoising trajectory. We further combine this design with GRPO and a verifiable non-adversarial reward. Across diverse simulated and real-world manipulation tasks, VAMPO improves task-relevant visual dynamics, leading to better downstream action generation and stronger generalization. The homepage is https://vampo-robot.github.io/VAMPO/.
Multimodal Fused Learning for Solving the Generalized Traveling Salesman Problem in Robotic Task Planning
Effective and efficient task planning is essential for mobile robots, especially in applications like warehouse retrieval and environmental monitoring. These tasks often involve selecting one location from each of several target clusters, forming a Generalized Traveling Salesman Problem (GTSP) that remains challenging to solve both accurately and efficiently. To address this, we propose a Multimodal Fused Learning (MMFL) framework that leverages both graph and image-based representations to capture complementary aspects of the problem, and learns a policy capable of generating high-quality task planning schemes in real time. Specifically, we first introduce a coordinate-based image builder that transforms GTSP instances into spatially informative representations. We then design an adaptive resolution scaling strategy to enhance adaptability across different problem scales, and develop a multimodal fusion module with dedicated bottlenecks that enables effective integration of geometric and spatial features. Extensive experiments show that our MMFL approach significantly outperforms state-of-the-art methods across various GTSP instances while maintaining the computational efficiency required for real-time robotic applications. Physical robot tests further validate its practical effectiveness in real-world scenarios.
comment: 14 pages, 6 figures, under review
From Vocal Instructions to Household Tasks: The Inria TIAGo++ in the euROBIN Service Robots Coopetition
This paper describes the Inria team's integrated robotics system used in the 1st euROBIN \textit{coopetition}, during which service robots performed voice-activated household tasks in a kitchen setting. The team developed a modified TIAGo++ platform that leverages a whole-body control stack for autonomous and teleoperated modes, and an LLM-based pipeline for instruction understanding and task planning. The key contributions (opens-sourced) are the integration of these components and the design of custom teleoperation devices, addressing practical challenges in the deployment of service robots.
TrajBooster: Boosting Humanoid Whole-Body Manipulation via Trajectory-Centric Learning
Recent Vision-Language-Action models show potential to generalize across embodiments but struggle to quickly align with a new robot's action space when high-quality demonstrations are scarce, especially for bipedal humanoids. We present TrajBooster, a cross-embodiment framework that leverages abundant wheeled-humanoid data to boost bipedal VLA. Our key idea is to use end-effector trajectories as a morphology-agnostic interface. TrajBooster (i) extracts 6D dual-arm end-effector trajectories from real-world wheeled humanoids, (ii) retargets them in simulation to Unitree G1 with a whole-body controller trained via a heuristic-enhanced harmonized online DAgger to lift low-dimensional trajectory references into feasible high-dimensional whole-body actions, and (iii) forms heterogeneous triplets that couple source vision/language with target humanoid-compatible actions to post-pre-train a VLA, followed by only 10 minutes of teleoperation data collection on the target humanoid domain. Deployed on Unitree G1, our policy achieves beyond-tabletop household tasks, enabling squatting, cross-height manipulation, and coordinated whole-body motion with markedly improved robustness and generalization. Results show that TrajBooster allows existing wheeled-humanoid data to efficiently strengthen bipedal humanoid VLA performance, reducing reliance on costly same-embodiment data while enhancing action space understanding and zero-shot skill transfer capabilities. For more details, For more details, please refer to our \href{https://jiachengliu3.github.io/TrajBooster/}.
Accelerated Multi-Modal Motion Planning Using Context-Conditioned Diffusion Models ICRA 2026
Classical methods in robot motion planning, such as sampling-based and optimization-based methods, often struggle with scalability towards higher-dimensional state spaces and complex environments. Diffusion models, known for their capability to learn complex, high-dimensional and multi-modal data distributions, provide a promising alternative when applied to motion planning problems and have already shown interesting results. However, most of the current approaches train their model for a single environment, limiting their generalization to environments not seen during training. The techniques that do train a model for multiple environments rely on a specific camera to provide the model with the necessary environmental information and therefore always require that sensor. To effectively adapt to diverse scenarios without the need for retraining, this research proposes Context-Aware Motion Planning Diffusion (CAMPD). CAMPD leverages a classifier-free denoising probabilistic diffusion model, conditioned on sensor-agnostic contextual information. An attention mechanism, integrated in the well-known U-Net architecture, conditions the model on an arbitrary number of contextual parameters. CAMPD is evaluated on a 7-DoF robot manipulator and benchmarked against state-of-the-art approaches on real-world tasks, showing its ability to generalize to unseen environments and generate high-quality, multi-modal trajectories, at a fraction of the time required by existing methods.
comment: Accepted for publication at the 2026 IEEE International Conference on Robotics & Automation (ICRA 2026)
RoboForge: Physically Optimized Text-guided Whole-Body Locomotion for Humanoids
While generative models have become effective at producing human-like motions from text, transferring these motions to humanoid robots for physical execution remains challenging. Existing pipelines are often limited by retargeting, where kinematic quality is undermined by physical infeasibility, contact-transition errors, and the high cost of real-world dynamical data. We present a unified latent-driven framework that bridges natural language and whole-body humanoid locomotion through a retarget-free, physics-optimized pipeline. Rather than treating generation and control as separate stages, our key insight is to couple them bidirectionally under physical constraints.We introduce a Physical Plausibility Optimization (PP-Opt) module as the coupling interface. In the forward direction, PP-Opt refines a teacher-student distillation policy with a plausibility-centric reward to suppress artifacts such as floating, skating, and penetration. In the backward direction, it converts reward-optimized simulation rollouts into high-quality explicit motion data, which is used to fine-tune the motion generator toward a more physically plausible latent distribution. This bidirectional design forms a self-improving cycle: the generator learns a physically grounded latent space, while the controller learns to execute latent-conditioned behaviors with dynamical integrity.Extensive experiments on the Unitree G1 humanoid show that our bidirectional optimization improves tracking accuracy and success rates. Across IsaacLab and MuJoCo, the implicit latent-driven pipeline consistently outperforms conventional explicit retargeting baselines in both precision and stability. By coupling diffusion-based motion generation with physical plausibility optimization, our framework provides a practical path toward deployable text-guided humanoid intelligence.
comment: 10 pages, 5 figures
TwinRL-VLA: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation
Despite strong generalization capabilities, Vision-Language-Action (VLA) models remain constrained by the high cost of expert demonstrations and insufficient real-world interaction. While online reinforcement learning (RL) has shown promise in improving general foundation models, applying RL to VLA manipulation in real-world settings is still hindered by low exploration efficiency and a restricted exploration space. Through systematic real-world experiments, we observe that the effective exploration space of online RL is closely tied to the data distribution of supervised fine-tuning (SFT). Motivated by this observation, we propose TwinRL, a digital twin-real-world collaborative RL framework designed to scale and guide exploration for VLA models. First, a high-fidelity digital twin is efficiently reconstructed from smartphone-captured scenes, enabling realistic bidirectional transfer between real and simulated environments. During the SFT warm-up stage, we introduce an exploration space expansion strategy using digital twins to broaden the support of the data trajectory distribution. Building on this enhanced initialization, we propose a sim-to-real guided exploration strategy to further accelerate online RL. Specifically, TwinRL performs efficient and parallel online RL in the digital twin prior to deployment, effectively bridging the gap between offline and online training stages. Subsequently, we exploit efficient digital twin sampling to identify failure-prone yet informative configurations, which are used to guide targeted human-in-the-loop rollouts on the real robot. In our experiments, TwinRL approaches 100% success in both in-distribution regions covered by real-world demonstrations and out-of-distribution regions, delivering at least a 30% speedup over prior real-world RL methods and requiring only about 20 minutes on average across four tasks.
FoldNet: Learning Generalizable Closed-Loop Policy for Garment Folding via Keypoint-Driven Asset and Demonstration Synthesis
Due to the deformability of garments, generating a large amount of high-quality data for robotic garment manipulation tasks is highly challenging. In this paper, we present a synthetic garment dataset that can be used for robotic garment folding. We begin by constructing geometric garment templates based on keypoints and applying generative models to generate realistic texture patterns. Leveraging these keypoint annotations, we generate folding demonstrations in simulation and train folding policies via closed-loop imitation learning. To improve robustness, we propose KG-DAgger, which uses a keypoint-based strategy to generate demonstration data for recovering from failures. KG-DAgger significantly improves the model performance, boosting the real-world success rate by 25\%. After training with 15K trajectories (about 2M image-action pairs), the model achieves a 75\% success rate in the real world. Experiments in both simulation and real-world settings validate the effectiveness of our proposed framework.
comment: Project: https://pku-epic.github.io/FoldNet/
Manual2Skill++: Connector-Aware General Robotic Assembly from Instruction Manuals via Vision-Language Models
Assembly hinges on reliably forming connections between parts; yet most robotic approaches plan assembly sequences and part poses while treating connectors as an afterthought. Connections represent the foundational physical constraints of assembly execution; while task planning sequences operations, the precise establishment of these constraints ultimately determines assembly success. In this paper, we treat connections as explicit, primary entities in assembly representation, directly encoding connector types, specifications, and locations for every assembly step. Drawing inspiration from how humans learn assembly tasks through step-by-step instruction manuals, we present Manual2Skill++, a vision-language framework that automatically extracts structured connection information from assembly manuals. We encode assembly tasks as hierarchical graphs where nodes represent parts and sub-assemblies, and edges explicitly model connection relationships between components. A large-scale vision-language model parses symbolic diagrams and annotations in manuals to instantiate these graphs, leveraging the rich connection knowledge embedded in human-designed instructions. We curate a dataset containing over 20 assembly tasks with diverse connector types to validate our representation extraction approach, and evaluate the complete task understanding-to-execution pipeline across four complex assembly scenarios in simulation, spanning furniture, toys, and manufacturing components with real-world correspondence. More detailed information can be found at https://nus-lins-lab.github.io/Manual2SkillPP/
AdaptPNP: Integrating Prehensile and Non-Prehensile Skills for Adaptive Robotic Manipulation
Non-prehensile (NP) manipulation, in which robots alter object states without forming stable grasps (for example, pushing, poking, or sliding), significantly broadens robotic manipulation capabilities when grasping is infeasible or insufficient. However, enabling a unified framework that generalizes across different tasks, objects, and environments while seamlessly integrating non-prehensile and prehensile (P) actions remains challenging: robots must determine when to invoke NP skills, select the appropriate primitive for each context, and compose P and NP strategies into robust, multi-step plans. We introduce ApaptPNP, a vision-language model (VLM)-empowered task and motion planning framework that systematically selects and combines P and NP skills to accomplish diverse manipulation objectives. Our approach leverages a VLM to interpret visual scene observations and textual task descriptions, generating a high-level plan skeleton that prescribes the sequence and coordination of P and NP actions. A digital-twin based object-centric intermediate layer predicts desired object poses, enabling proactive mental rehearsal of manipulation sequences. Finally, a control module synthesizes low-level robot commands, with continuous execution feedback enabling online task plan refinement and adaptive replanning through the VLM. We evaluate ApaptPNP across representative P&NP hybrid manipulation tasks in both simulation and real-world environments. These results underscore the potential of hybrid P&NP manipulation as a crucial step toward general-purpose, human-level robotic manipulation capabilities. Project Website: https://adaptpnp.github.io/
U-ARM : Ultra low-cost general teleoperation interface for robot manipulation
We propose U-Arm, a low-cost and rapidly adaptable leader-follower teleoperation framework designed to interface with most of commercially available robotic arms. Our system supports teleoperation through three structurally distinct 3D-printed leader arms that share consistent control logic, enabling seamless compatibility with diverse commercial robot configurations. Compared with previous open-source leader-follower interfaces, we further optimized both the mechanical design and servo selection, achieving a bill of materials (BOM) cost of only \$50.5 for the 6-DoF leader arm and \$56.8 for the 7-DoF version. To enhance usability, we mitigate the common challenge in controlling redundant degrees of freedom by %engineering methods mechanical and control optimizations. Experimental results demonstrate that U-Arm achieves 39\% higher data collection efficiency and comparable task success rates across multiple manipulation scenarios compared with Joycon, another low-cost teleoperation interface. We have open-sourced all CAD models of three configs and also provided simulation support for validating teleoperation workflows. We also open-sourced real-world manipulation data collected with U-Arm. The project website is https://github.com/MINT-SJTU/LeRobot-Anything-U-Arm.
Aegis: Automated Error Generation and Attribution for Multi-Agent Systems
Large language model based multi-agent systems (MAS) have unlocked significant advancements in tackling complex problems, but their increasing capability introduces a structural fragility that makes them difficult to debug. A key obstacle to improving their reliability is the severe scarcity of large-scale, diverse datasets for error attribution, as existing resources rely on costly and unscalable manual annotation. To address this bottleneck, we introduce Aegis, a novel framework for Automated error generation and attribution for multi-agent systems. Aegis constructs a large dataset of 9,533 trajectories with annotated faulty agents and error modes, covering diverse MAS architectures and task domains. This is achieved using a LLM-based manipulator that can adaptively inject context-aware errors into successful execution trajectories. Leveraging fine-grained labels and the structured arrangement of positive-negative sample pairs, Aegis supports three different learning paradigms: Supervised Fine-Tuning, Reinforcement Learning, and Contrastive Learning. We develop learning methods for each paradigm. Comprehensive experiments show that trained models consistently achieve substantial improvements in error attribution. Notably, several of our fine-tuned LLMs demonstrate performance competitive with or superior to proprietary models an order of magnitude larger, validating our automated data generation framework as a crucial resource for developing more robust and interpretable multi-agent systems. Our project website is available at https://kfq20.github.io/Aegis-Website/.
RhoMorph: Rhombus-shaped Deformable Modular Robots for Stable, Medium-Independent Reconfiguration Motion
In this paper, we present RhoMorph, a novel deformable planar lattice modular self-reconfigurable robot (MSRR) with a rhombus shaped module. Each module consists of a parallelogram skeleton with a single centrally mounted actuator that enables folding and unfolding along its diagonal. The core design philosophy is to achieve essential MSRR functionalities such as morphing, docking, and locomotion with minimal control complexity. This enables a continuous and stable reconfiguration process that is independent of the surrounding medium, allowing the system to reliably form various configurations in diverse environments. To leverage the unique kinematics of RhoMorph, we introduce morphpivoting, a novel motion primitive for reconfiguration that differs from advanced MSRR systems, and propose a strategy for its continuous execution. Finally, a series of physical experiments validate the module's stable reconfiguration ability, as well as its positional and docking accuracy.
Whole-Body Safe Control of Robotic Systems with Koopman Neural Dynamics
Controlling robots with strongly nonlinear, high-dimensional dynamics remains challenging, as direct nonlinear optimization with safety constraints is often intractable in real time. The Koopman operator offers a way to represent nonlinear systems linearly in a lifted space, enabling the use of efficient linear control. We propose a data-driven framework that learns a Koopman embedding and operator from data, and integrates the resulting linear model with the Safe Set Algorithm (SSA). This allows the tracking and safety constraints to be solved in a single quadratic program (QP), ensuring feasibility and optimality without a separate safety filter. We validate the method on a Kinova Gen3 manipulator and a Go2 quadruped, showing accurate tracking and obstacle avoidance.
From Optimizable to Interactable: Mixed Digital Twin-Empowered Testing of Vehicle-Infrastructure Cooperation Systems
Sufficient testing under corner cases is critical for the long-term operation of vehicle-infrastructure cooperation systems (VICS). However, existing corner-case generation methods are primarily AI-driven, and VICS testing under corner cases is typically limited to simulation. In this paper, we introduce an L5 ''Interactable'' level to the VICS digital twin (VICS-DT) taxonomy, extending beyond the conventional L4 ''Optimizable'' level. We further propose an L5-level VICS testing framework, IMPACT (Interactive Mixed-digital-twin Paradigm for Advanced Cooperative vehicle-infrastructure Testing). By enabling direct human interactions with VICS entities, IMPACT incorporates highly uncertain and unpredictable human behaviors into the testing loop, naturally generating high-quality corner cases that complement AI-based methods. Furthermore, the mixedDT-enabled ''Physical-Virtual Action Interaction'' facilitates safe VICS testing under corner cases, incorporating real-world environments and entities rather than purely in simulation. Finally, we implement IMPACT on the I-VIT (Interactive Vehicle-Infrastructure Testbed), and experiments demonstrate its effectiveness. The experimental videos are available at our project website: https://dongjh20.github.io/IMPACT.
Fast Confidence-Aware Human Prediction via Hardware-accelerated Bayesian Inference for Safe Robot Navigation
As robots increasingly integrate into everyday environments, ensuring their safe navigation around humans becomes imperative. Efficient and safe motion planning requires robots to account for human behavior, particularly in constrained spaces such as grocery stores or care homes, where interactions with multiple individuals are common. Prior research has employed Bayesian frameworks to model human rationality based on navigational intent, enabling the prediction of probabilistic trajectories for planning purposes. In this work, we present a simple yet novel approach for confidence-aware prediction that treats future predictions as particles. This framework is highly parallelized and accelerated on an graphics processing unit (GPU). As a result, this enables longer-term predictions at a frequency of 125 Hz and can be easily extended for multi-human predictions. Compared to existing methods, our implementation supports finer prediction time steps, yielding more granular trajectory forecasts. This enhanced resolution allows motion planners to respond effectively to subtle changes in human behavior. We validate our approach through real-world experiments, demonstrating a robot safely navigating among multiple humans with diverse navigational goals. Our results highlight the methods potential for robust and efficient human-robot coexistence in dynamic environments.
comment: Update the paper
Embodied Foundation Models at the Edge: A Survey of Deployment Constraints and Mitigation Strategies
Deploying foundation models in embodied edge systems is fundamentally a systems problem, not just a problem of model compression. Real-time control must operate within strict size, weight, and power constraints, where memory traffic, compute latency, timing variability, and safety margins interact directly. The Deployment Gauntlet organizes these constraints into eight coupled barriers that determine whether embodied foundation models can run reliably in practice. Across representative edge workloads, autoregressive Vision-Language-Action policies are constrained primarily by memory bandwidth, whereas diffusion-based controllers are limited more by compute latency and sustained execution cost. Reliable deployment therefore depends on system-level co-design across memory, scheduling, communication, and model architecture, including decompositions that separate fast control from slower semantic reasoning.
Agentic Vehicles for Human-Centered Mobility: Definition, Prospects, and System Implications
Autonomy, from the Greek autos (self) and nomos (law), refers to the capacity to operate according to internal rules without external control. Autonomous vehicles (AuVs) are therefore understood as systems that perceive their environment and execute pre-programmed tasks independently of external input, consistent with the SAE levels of automated driving. Yet recent research and real-world deployments have begun to showcase vehicles that exhibit behaviors outside the scope of this definition. These include natural language interaction with humans, goal adaptation, contextual reasoning, external tool use, and the handling of unforeseen ethical dilemmas, enabled in part by multimodal large language models (LLMs). These developments highlight not only a gap between technical autonomy and the broader cognitive and social capacities required for human-centered mobility, but also the emergence of a form of vehicle intelligence that currently lacks a clear designation. To address this gap, the paper introduces the concept of agentic vehicles (AgVs): vehicles that exhibit agency, the capacity for goal-driven reasoning, strategic adaptation, self-reflection, and purposeful engagement with complex environments. We conclude by outlining key challenges in the development and governance of AgVs and their potential role in shaping future agentic transportation systems that align with user and societal needs.
Path Integral Particle Filtering for Hybrid Systems via Saltation Matrices
We present an optimal-control-based particle filtering method for state estimation in hybrid systems that undergo intermittent contact with their environments. We follow the path integral filtering framework that exploits the duality between the smoothing problem and optimal control. We leverage saltation matrices to map out the uncertainty propagation during contact events for hybrid systems. The resulting path integral optimal control problem allows for a state estimation algorithm robust to outlier effects, flexible to non-Gaussian noise distributions, that also handles the challenging contact dynamics in hybrid systems. This work offers a computationally efficient and reliable estimation algorithm for hybrid systems with stochastic dynamics. We also present extensive experimental results demonstrating that our approach consistently outperforms strong baselines across multiple settings.
HaltNav: Reactive Visual Halting over Lightweight Topological Priors for Robust Vision-Language Navigation
Vision-and-Language Navigation (VLN) is shifting from rigid, step-by-step instruction following toward open-vocabulary, goal-oriented autonomy. Achieving this transition without exhaustive routing prompts requires agents to leverage structural priors. While prior work often assumes computationally heavy 2D/3D metric maps, we instead exploit a lightweight, text-based osmAG (OpenStreetMap Area Graph), a floorplan-level topological representation that is easy to obtain and maintain. However, global planning over a prior map alone is brittle in real-world deployments, where local connectivity can change (e.g., closed doors or crowded passages), leading to execution-time failures. To address this gap, we propose a hierarchical navigation framework HaltNav that couples the robust global planning of osmAG with the local exploration and instruction-grounding capability of VLN. Our approach features an MLLM-based brain module, which is capable of high-level task grounding and obstruction awareness. Conditioned on osmAG, the brain converts the global route into a sequence of localized execution snippets, providing the VLN executor with prior-grounded, goal-centric sub-instructions. Meanwhile, it detects local anomalies via a mechanism we term Reactive Visual Halting (RVH), which interrupts the local control loop, updates osmAG by invalidating the corresponding topology, and triggers replanning to orchestrate a viable detour. To train this halting capability efficiently, we introduce a data synthesis pipeline that leverages generative models to inject realistic obstacles into otherwise navigable scenes, substantially enriching hard negative samples. Extensive experiments demonstrate that our hierarchical framework outperforms several baseline methods without tedious language instructions, and significantly improves robustness for long-horizon vision-language navigation under environmental changes.
AI-driven Dispensing of Coral Reseeding Devices for Broad-scale Restoration of the Great Barrier Reef
Coral reefs are on the brink of collapse, with climate change, ocean acidification, and pollution leading to a projected 70-90% loss of coral species within the next decade. Reef restoration is crucial, but its success hinges on introducing automation to upscale efforts. In this work, we present a highly configurable AI pipeline for the real-time deployment of coral reseeding devices. The pipeline consists of three core components: (i) the image labeling scheme, designed to address data availability and reduce the cost of expert labeling; (ii) the classifier which performs automated analysis of underwater imagery, at the image or patch-level, while also enabling quantitative coral coverage estimation; and (iii) the decision-making module that determines whether deployment should occur based on the classifier's analysis. By reducing reliance on manual experts, our proposed pipeline increases operational range and efficiency of reef restoration. We validate the proposed pipeline at five sites across the Great Barrier Reef, benchmarking its performance against annotations from expert marine scientists. The pipeline achieves 77.8% deployment accuracy, 89.1% accuracy for sub-image patch classification, and real-time model inference at 5.5 frames per second on a Jetson Orin. To address the limited availability of labeled data in this domain and encourage further research, we publicly release a comprehensive, annotated dataset of substrate imagery from the surveyed sites.
comment: 8 pages, 5 figures
2-D Directed Formation Control Based on Bipolar Coordinates
This work proposes a novel 2-D formation control scheme for acyclic triangulated directed graphs (a class of minimally acyclic persistent graphs) based on bipolar coordinates with (almost) global convergence to the desired shape. Prescribed performance control is employed to devise a decentralized control law that avoids singularities and introduces robustness against external disturbances while ensuring predefined transient and steady-state performance for the closed-loop system. Furthermore, it is shown that the proposed formation control scheme can handle formation maneuvering, scaling, and orientation specifications simultaneously. Additionally, the proposed control law is implementable in agents' arbitrarily oriented local coordinate frames using only low-cost onboard vision sensors, which are favorable for practical applications. Finally, a formation maneuvering simulation study verifies the proposed approach.
comment: 16 pages, 10 figures; minor typos corrected; no change in results
UDON: Uncertainty-weighted Distributed Optimization for Multi-Robot Neural Implicit Mapping under Extreme Communication Constraints ICRA 2026
Multi-robot mapping with neural implicit representations enables the compact reconstruction of complex environments. However, it demands robustness against communication challenges like packet loss and limited bandwidth. While prior works have introduced various mechanisms to mitigate communication disruptions, performance degradation still occurs under extremely low communication success rates. This paper presents UDON, a real-time multi-agent neural implicit mapping framework that introduces a novel uncertainty-weighted distributed optimization to achieve high-quality mapping under severe communication deterioration. The uncertainty weighting prioritizes more reliable portions of the map, while the distributed optimization isolates and penalizes mapping disagreement between individual pairs of communicating agents. We conduct extensive experiments on standard benchmark datasets and real-world robot hardware. We demonstrate that UDON significantly outperforms existing baselines, maintaining high-fidelity reconstructions and consistent scene representations even under extreme communication degradation (as low as 1% success rate).
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA 2026)
Direct Data-Driven Predictive Control for a Three-dimensional Cable-Driven Soft Robotic Arm
Soft robots offer significant advantages in safety and adaptability, yet achieving precise and dynamic control remains a major challenge due to their inherently complex and nonlinear dynamics. Recently, Data-enabled Predictive Control (DeePC) has emerged as a promising model-free approach that bypasses explicit system identification by directly leveraging input-output data. While DeePC has shown success in other domains, its application to soft robots remains underexplored, particularly for three-dimensional (3D) soft robotic systems. This paper addresses this gap by developing and experimentally validating an effective DeePC framework on a 3D, cable-driven soft arm. Specifically, we design and fabricate a soft robotic arm with a thick tubing backbone for stability, a dense silicone body with large cavities for strength and flexibility, and rigid endcaps for secure termination. Using this platform, we implement DeePC with singular value decomposition (SVD)-based dimension reduction for two key control tasks: fixed-point regulation and trajectory tracking in 3D space. Comparative experiments with a baseline model-based controller demonstrate DeePC's superior accuracy, robustness, and adaptability, highlighting its potential as a practical solution for dynamic control of soft robots.
Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions
Task and Motion Planning combines high-level task sequencing (what to do) with low-level motion planning (how to do it) to generate feasible, collision-free execution plans. However, in many real-world domains, such as automated warehouses, tasks are predefined, shifting the challenge to if, when, and how to execute them safely and efficiently under resource, time and motion constraints. In this paper, we formalize this as the Scheduling and Motion Planning problem for multi-object navigation in shared workspaces. We propose a novel solution framework that interleaves off-the-shelf schedulers and motion planners in an incremental learning loop. The scheduler generates candidate plans, while the motion planner checks feasibility and returns symbolic feedback, i.e., spatial conflicts and timing adjustments, to guide the scheduler towards motion-feasible solutions. We validate our proposal on logistics and job-shop scheduling benchmarks augmented with motion tasks, using state-of-the-art schedulers and sampling-based motion planners. Our results show the effectiveness of our framework in generating valid plans under complex temporal and spatial constraints, where synchronized motion is critical.
PathSpace: Rapid continuous map approximation for efficient SLAM using B-Splines in constrained environments
Simultaneous Localization and Mapping (SLAM) plays a crucial role in enabling autonomous vehicles to navigate previously unknown environments. Semantic SLAM mostly extends visual SLAM, leveraging the higher density information available to reason about the environment in a more human-like manner. This allows for better decision making by exploiting prior structural knowledge of the environment, usually in the form of labels. Current semantic SLAM techniques still mostly rely on a dense geometric representation of the environment, limiting their ability to apply constraints based on context. We propose PathSpace, a novel semantic SLAM framework that uses continuous B-splines to represent the environment in a compact manner, while also maintaining and reasoning through the continuous probability density functions required for probabilistic reasoning. This system applies the multiple strengths of B-splines in the context of SLAM to interpolate and fit otherwise discrete sparse environments. We test this framework in the context of autonomous racing, where we exploit pre-specified track characteristics to produce significantly reduced representations at comparable levels of accuracy to traditional landmark based methods and demonstrate its potential in limiting the resources used by a system with minimal accuracy loss.
Distributional Uncertainty and Adaptive Decision-Making in System Co-design
Complex engineered systems require coordinated design choices across heterogeneous components under multiple conflicting objectives and uncertain specifications. Monotone co-design provides a compositional framework for such problems by modeling each subsystem as a design problem: a feasible relation between provided functionalities and required resources in partially ordered sets. Existing uncertain co-design models rely on interval bounds, which support worst-case reasoning but cannot represent probabilistic risk or multi-stage adaptive decisions. We develop a distributional extension of co-design that models uncertain design outcomes as distributions over design problems and supports adaptive decision processes through Markov-kernel re-parameterizations. Using quasi-measurable and quasi-universal spaces, we show that the standard co-design interconnection operations remain compositional under this richer notion of uncertainty. We further introduce queries and observations that extract probabilistic design trade-offs, including feasibility probabilities, confidence bounds, and distributions of minimal required resources. A task-driven unmanned aerial vehicle case study illustrates how the framework captures risk-sensitive and information-dependent design choices that interval-based models cannot express.
Mash, Spread, Slice! Learning to Manipulate Object States via Visual Spatial Progress ICRA 2026
Most robot manipulation focuses on changing the kinematic state of objects: picking, placing, opening, or rotating them. However, a wide range of real-world manipulation tasks involve a different class of object state change--such as mashing, spreading, or slicing--where the object's physical and visual state evolve progressively without necessarily changing its position. We present SPARTA, the first unified framework for the family of object state change manipulation tasks. Our key insight is that these tasks share a common structural pattern: they involve spatially-progressing, object-centric changes that can be represented as regions transitioning from an actionable to a transformed state. Building on this insight, SPARTA integrates spatially progressing object change segmentation maps, a visual skill to perceive actionable vs. transformed regions for specific object state change tasks, to generate a) structured policy observations that strip away appearance variability, and b) dense rewards that capture incremental progress over time. These are leveraged in two SPARTA policy variants: reinforcement learning for fine-grained control without demonstrations or simulation; and greedy control for fast, lightweight deployment. We validate SPARTA on a real robot for three challenging tasks across 10 diverse real-world objects, achieving significant improvements in training time and accuracy over sparse rewards and visual goal-conditioned baselines. Our results highlight progress-aware visual representations as a versatile foundation for the broader family of object state manipulation tasks. Project website: https://vision.cs.utexas.edu/projects/sparta-robot
comment: Accepted at ICRA 2026
World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation
Robotic manipulation policies are commonly initialized through imitation learning, but their performance is limited by the scarcity and narrow coverage of expert data. Reinforcement learning can refine polices to alleviate this limitation, yet real-robot training is costly and unsafe, while training in simulators suffers from the sim-to-real gap. Recent advances in generative models have demonstrated remarkable capabilities in real-world simulation, with diffusion models in particular excelling at generation. This raises the question of how diffusion model-based world models can be combined to enhance pre-trained policies in robotic manipulation. In this work, we propose World4RL, a framework that employs diffusion-based world models as high-fidelity simulators to refine pre-trained policies entirely in imagined environments for robotic manipulation. Unlike prior works that primarily employ world models for planning, our framework enables direct end-to-end policy optimization. World4RL is designed around two principles: pre-training a diffusion world model that captures diverse dynamics on multi-task datasets and refining policies entirely within a frozen world model to avoid online real-world interactions. We further design a two-hot action encoding scheme tailored for robotic manipulation and adopt diffusion backbones to improve modeling fidelity. Extensive simulation and real-world experiments demonstrate that World4RL provides high-fidelity environment modeling and enables consistent policy refinement, yielding significantly higher success rates compared to imitation learning and other baselines.
Adaptive Relative Pose Estimation Framework with Dual Noise Tuning for Safe Approaching Maneuvers
Accurate and robust relative pose estimation is crucial for enabling challenging Active Debris Removal (ADR) missions targeting tumbling derelict satellites such as ESA's ENVISAT. This work presents a complete pipeline integrating advanced computer vision techniques with adaptive nonlinear filtering to address this challenge. A Convolutional Neural Network (CNN), enhanced with image preprocessing, detects structural markers (corners) from chaser imagery, whose 2D coordinates are converted to 3D measurements using camera modeling. These measurements are fused within an Unscented Kalman Filter (UKF) framework, selected for its ability to handle nonlinear relative dynamics, to estimate the full relative pose. Key contributions include the integrated system architecture and a dual adaptive strategy within the UKF: dynamic tuning of the measurement noise covariance compensates for varying CNN measurement uncertainty, while adaptive tuning of the process noise covariance, utilizing measurement residual analysis, accounts for unmodeled dynamics or maneuvers online. This dual adaptation enhances robustness against both measurement imperfections and dynamic model uncertainties. The performance of the proposed adaptive integrated system is evaluated through high-fidelity simulations using a realistic ENVISAT model, comparing estimates against ground truth under various conditions, including measurement outages. This comprehensive approach offers an enhanced solution for robust onboard relative navigation, significantly advancing the capabilities required for safe proximity operations during ADR missions.
Feasibility Analysis and Constraint Selection in Optimization-Based Controllers
Control synthesis under constraints is at the forefront of research on autonomous systems, in part due to its broad application from low-level control to high-level planning, where computing control inputs is typically cast as a constrained optimization problem. Assessing feasibility of the constraints and selecting among subsets of feasible constraints is a challenging yet crucial problem. In this work, we provide a novel theoretical analysis that yields necessary and sufficient conditions for feasibility assessment of linear constraints and based on this analysis, we develop novel methods for feasible constraint selection in the context of control of autonomous systems. Through a series of simulations, we demonstrate that our algorithms achieve performance comparable to state-of-the-art methods while offering improved computational efficiency. Importantly, our analysis provides a novel theoretical framework for assessing, analyzing and handling constraint infeasibility.
comment: 13 pages, 4 figures, submitted to IEEE Transactions on Automatic Control
CageDroneRF: A Large-Scale RF Benchmark and Toolkit for Drone Perception
We present CageDroneRF (CDRF), a large-scale benchmark for Radio-Frequency (RF) drone detection and identification built from real-world captures and systematically generated synthetic variants. CDRF addresses the scarcity and limited diversity of existing RF datasets by coupling extensive raw recordings with a principled augmentation pipeline that (i)~precisely controls Signal-to-Noise Ratio (SNR), (ii)~injects interfering emitters, and (iii)~applies frequency shifts with label-consistent bounding-box recomputation for detection. The dataset spans a wide range of contemporary drone models, many of which are unavailable in current public datasets, and diverse acquisition conditions, derived from data collected at the Rowan University campus and within a controlled RF-cage facility. CDRF is released with interoperable open-source tools for data generation, preprocessing, augmentation, and evaluation that also operate on existing public benchmarks. It enables standardized benchmarking for classification, open-set recognition, and object detection, supporting rigorous comparisons and reproducible pipelines. By releasing this comprehensive benchmark and tooling, we aim to accelerate progress toward robust, generalizable RF perception models.
EgoSpot:Egocentric Multimodal Control for Hands-Free Mobile Manipulation
We propose a novel hands-free control framework for the Boston Dynamics Spot robot using the Microsoft HoloLens 2 mixed-reality headset. Enabling accessible robot control is critical for allowing individuals with physical disabilities to benefit from robotic assistance in daily activities, teleoperation, and remote interaction tasks. However, most existing robot control interfaces rely on manual input devices such as joysticks or handheld controllers, which can be difficult or impossible for users with limited motor capabilities. To address this limitation, we develop an intuitive multimodal control system that leverages egocentric sensing from a wearable device. Our system integrates multiple control signals, including eye gaze, head gestures, and voice commands, to enable hands-free interaction. These signals are fused to support real-time control of both robot locomotion and arm manipulation. Experimental results show that our approach achieves performance comparable to traditional joystick-based control in terms of task completion time and user experience, while significantly improving accessibility and naturalness of interaction. Our results highlight the potential of egocentric multimodal interfaces to make mobile manipulation robots more inclusive and usable for a broader population. A demonstration of the system is available on our project webpage.
Uncertainty-Aware Multi-Robot Task Allocation With Strongly Coupled Inter-Robot Rewards
Allocating tasks to heterogeneous robot teams in environments with uncertain task requirements is a fundamentally challenging problem. Redundantly assigning multiple robots to such tasks is overly conservative, while purely reactive strategies risk costly delays in task completion when the uncertain capabilities become necessary. This paper introduces an auction-based task allocation algorithm that explicitly models uncertain task requirements, leveraging a novel strongly coupled formulation to allocate tasks such that robots with potentially required capabilities are naturally positioned near uncertain tasks. This approach enables robots to remain productive on nearby tasks while simultaneously mitigating large delays in completion time when their capabilities are required. Through a set of simulated disaster relief missions with task deadline constraints, we demonstrate that the proposed approach yields up to a 15% increase in expected mission value compared to redundancy-based methods. Furthermore, we propose a novel framework to approximate uncertainty arising from unmodeled changes in task requirements by leveraging the natural delay between encountering unexpected environmental conditions and confirming whether additional capabilities are required to complete a task. We show that our approach achieves up to an 18% increase in expected mission value using this framework compared to reactive methods that don't leverage this delay.
comment: 9 pages
Multi-Robot Coordination for Planning under Context Uncertainty
Real-world robots often operate in settings where objective priorities depend on the underlying context of operation. When the underlying context is unknown apriori, multiple robots may have to coordinate to gather informative observations to infer the context, since acting based on an incorrect context can lead to misaligned and unsafe behavior. Once the underlying true context is inferred, the robots optimize their task-specific objectives in the preference order induced by the context. We formalize this problem as a Multi-Robot Context-Uncertain Stochastic Shortest Path (MR-CUSSP), which captures context-relevant information at landmark states through joint observations. Our two-stage solution approach is composed of: (1) CIMOP (Coordinated Inference for Multi-Objective Planning) to compute plans that guide robots toward informative landmarks to efficiently infer the true context, and (2) LCBS (Lexicographic Conflict-Based Search) for collision-free multi-robot path planning with lexicographic objective preferences, induced by the context. We evaluate the algorithms using three simulated domains and demonstrate its practical applicability using five mobile robots in the salp domain setup.
comment: 8 pages, 6 figures
Multiagent Systems
Reasonably reasoning AI agents can avoid game-theoretic failures in zero-shot, provably
AI agents are increasingly deployed in interactive economic environments characterized by repeated AI-AI interactions. Despite AI agents' advanced capabilities, empirical studies reveal that such interactions often fail to stably induce a strategic equilibrium, such as a Nash equilibrium. Post-training methods have been proposed to induce a strategic equilibrium; however, it remains impractical to uniformly apply an alignment method across diverse, independently developed AI models in strategic settings. In this paper, we provide theoretical and empirical evidence that off-the-shelf reasoning AI agents can achieve Nash-like play zero-shot, without explicit post-training. Specifically, we prove that `reasonably reasoning' agents, i.e., agents capable of forming beliefs about others' strategies from previous observation and learning to best respond to these beliefs, eventually behave along almost every realized play path in a way that is weakly close to a Nash equilibrium of the continuation game. In addition, we relax the common-knowledge payoff assumption by allowing stage payoffs to be unknown and by having each agent observe only its own privately realized stochastic payoffs, and we show that we can still achieve the same on-path Nash convergence guarantee. We then empirically validate the proposed theories by simulating five game scenarios, ranging from a repeated prisoner's dilemma game to stylized repeated marketing promotion games. Our findings suggest that AI agents naturally exhibit such reasoning patterns and therefore attain stable equilibrium behaviors intrinsically, obviating the need for universal alignment procedures in many real-world strategic interactions.
Computationally Efficient Density-Driven Optimal Control via Analytical KKT Reduction and Contractive MPC
Efficient coordination for collective spatial distribution is a fundamental challenge in multi-agent systems. Prior research on Density-Driven Optimal Control (D2OC) established a framework to match agent trajectories to a desired spatial distribution. However, implementing this as a predictive controller requires solving a large-scale Karush-Kuhn-Tucker (KKT) system, whose computational complexity grows cubically with the prediction horizon. To resolve this, we propose an analytical structural reduction that transforms the T-horizon KKT system into a condensed quadratic program (QP). This formulation achieves O(T) linear scalability, significantly reducing the online computational burden compared to conventional O(T^3) approaches. Furthermore, to ensure rigorous convergence in dynamic environments, we incorporate a contractive Lyapunov constraint and prove the Input-to-State Stability (ISS) of the closed-loop system against reference propagation drift. Numerical simulations verify that the proposed method facilitates rapid density coverage with substantial computational speed-up, enabling long-horizon predictive control for large-scale multi-agent swarms.
Interleaved Information Structures in Dynamic Games: A General Framework with Application to the Linear-Quadratic Case
A fundamental problem in noncooperative dynamic game theory is the computation of Nash equilibria under different information structures, which specify the information available to each agent during decision-making. Prior work has extensively studied equilibrium solutions for two canonical information structures: feedback, where agents observe the current state at each time, and open-loop, where agents only observe the initial state. However, these paradigms are often too restrictive to capture realistic settings exhibiting interleaved information structures, in which each agent observes only a subset of other agents at every timestep. To date, there is no systematic framework for modeling and solving dynamic games under arbitrary interleaved information structures. To this end, we make two main contributions. First, we introduce a method to model deterministic dynamic games with arbitrary interleaved information structures as Mathematical Program Networks (MPNs), where the network structure encodes the informational dependencies between agents. Second, for linear-quadratic (LQ) dynamic games, we leverage the MPN formulation to develop a systematic procedure for deriving Riccati-like equations that characterize Nash equilibria. Finally, we illustrate our approach through an example involving three agents exhibiting a cyclic information structure.
comment: 6 pages, 3 figures
Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization
Automatic prompt optimization (APO) has emerged as a powerful paradigm for improving LLM performance without manual prompt engineering. Reflective APO methods such as GEPA iteratively refine prompts by diagnosing failure cases, but the optimization process remains black-box and label-free, leading to uninterpretable trajectories and systematic failure. We identify and empirically demonstrate four limitations: on GSM8K with a defective seed, GEPA degrades accuracy from 23.81% to 13.50%. We propose VISTA, a multi-agent APO framework that decouples hypothesis generation from prompt rewriting, enabling semantically labeled hypotheses, parallel minibatch verification, and interpretable optimization trace. A two-layer explore-exploit mechanism combining random restart and epsilon-greedy sampling further escapes local optima. VISTA recovers accuracy to 87.57% on the same defective seed and consistently outperforms baselines across all conditions on GSM8K and AIME2025.
Evolutionarily Stable Stackelberg Equilibrium
We present a new solution concept called evolutionarily stable Stackelberg equilibrium (SESS). We study the Stackelberg evolutionary game setting in which there is a single leading player and a symmetric population of followers. The leader selects an optimal mixed strategy, anticipating that the follower population plays an evolutionarily stable strategy (ESS) in the induced subgame and may satisfy additional ecological conditions. We consider both leader-optimal and follower-optimal selection among ESSs, which arise as special cases of our framework. Prior approaches to Stackelberg evolutionary games either define the follower response via evolutionary dynamics or assume rational best-response behavior, without explicitly enforcing stability against invasion by mutations. We present algorithms for computing SESS in discrete and continuous games, and validate the latter empirically. Our model applies naturally to biological settings; for example, in cancer treatment the leader represents the physician and the followers correspond to competing cancer cell phenotypes.
Optimal Path Planning in Hostile Environments ICAPS-2026
Coordinating agents through hazardous environments, such as aid-delivering drones navigating conflict zones or field robots traversing deployment areas filled with obstacles, poses fundamental planning challenges. We introduce and analyze the computational complexity of a new multi-agent path planning problem that captures this setting. A group of identical agents begins at a common start location and must navigate a graph-based environment to reach a common target. The graph contains hazards that eliminate agents upon contact but then enter a known cooldown period before reactivating. In this discrete-time, fully-observable, deterministic setting, the planning task is to compute a movement schedule that maximizes the number of agents reaching the target. We first prove that, despite the exponentially large space of feasible plans, optimal plans require only polynomially-many steps, establishing membership in NP. We then show that the problem is NP-hard even when the environment graph is a tree. On the positive side, we present a polynomial-time algorithm for graphs consisting of vertex-disjoint paths from start to target. Our results establish a rich computational landscape for this problem, identifying both intractable and tractable fragments.
comment: Accepted for publication at ICAPS-2026 (25 pages, 6 figures)
I Can't Believe It's Corrupt: Evaluating Corruption in Multi-Agent Governance Systems
Large language models are increasingly proposed as autonomous agents for high-stakes public workflows, yet we lack systematic evidence about whether they would follow institutional rules when granted authority. We present evidence that integrity in institutional AI should be treated as a pre-deployment requirement rather than a post-deployment assumption. We evaluate multi-agent governance simulations in which agents occupy formal governmental roles under different authority structures, and we score rule-breaking and abuse outcomes with an independent rubric-based judge across 28,112 transcript segments. While we advance this position, the core contribution is empirical: among models operating below saturation, governance structure is a stronger driver of corruption-related outcomes than model identity, with large differences across regimes and model--governance pairings. Lightweight safeguards can reduce risk in some settings but do not consistently prevent severe failures. These results imply that institutional design is a precondition for safe delegation: before real authority is assigned to LLM agents, systems should undergo stress testing under governance-like constraints with enforceable rules, auditable logs, and human oversight on high-impact actions.
comment: Short Paper, Preprint
TrustFlow: Topic-Aware Vector Reputation Propagation for Multi-Agent Ecosystems
We introduce TrustFlow, a reputation propagation algorithm that assigns each software agent a multi-dimensional reputation vector rather than a scalar score. Reputation is propagated through an interaction graph via topic-gated transfer operators that modulate each edge by its content embedding, with convergence to a unique fixed point guaranteed by the contraction mapping theorem. We develop a family of Lipschitz-1 transfer operators and composable information-theoretic gates that achieve up to 98% multi-label Precision@5 on dense graphs and 78% on sparse ones. On a benchmark of 50 agents across 8 domains, TrustFlow resists sybil attacks, reputation laundering, and vote rings with at most 4 percentage-point precision impact. Unlike PageRank and Topic-Sensitive PageRank, TrustFlow produces vector reputation that is directly queryable by dot product in the same embedding space as user queries.
comment: 14 pages, 3 figures, demo at https://robutler.ai
Reason-to-Transmit: Deliberative Adaptive Communication for Cooperative Perception
Cooperative perception among autonomous agents overcomes the limitations of single-agent sensing, but bandwidth constraints in vehicle-to-everything (V2X) networks require efficient communication policies. Existing approaches rely on reactive mechanisms, such as confidence maps, learned gating, or sparse masks, to decide what to transmit, without reasoning about why a message benefits the receiver. We introduce Reason-to-Transmit (R2T), a framework that equips each agent with a lightweight transformer-based module that reasons over local scene context, estimated neighbor information gaps, and bandwidth budget to make per-region transmission decisions. Trained end-to-end with a bandwidth-aware objective, R2T is evaluated against nine baselines in a multi-agent bird's-eye-view perception environment. Any communication improves performance by about 58% AP over no communication. At low bandwidth, all selective methods perform similarly, but R2T shows clear gains under high occlusion, where information asymmetry is greatest, approaching oracle performance. All methods degrade gracefully under packet drops up to 50%, showing robustness to communication failures. These results indicate that while fusion design dominates performance, deliberative communication provides additional gains in challenging scenarios. R2T introduces a reasoning-based approach to communication, enabling more efficient and context-aware information sharing in cooperative perception.
On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning
Decentralized learning provides a scalable alternative to parameter-server-based training, yet its performance is often hindered by limited peer-to-peer communication. In this paper, we study how communication should be scheduled over time, including determining when and how frequently devices synchronize. Counterintuitive empirical results show that concentrating communication budgets in the later stages of decentralized training remarkably improves global test performance. Surprisingly, we uncover that fully connected communication at the final step, implemented by a single global merging, can significantly improve the performance of decentralized learning under high data heterogeneity. Our theoretical contributions, which explain these phenomena, are the first to establish that the globally merged model of decentralized SGD can match the convergence rate of parallel SGD. Technically, we reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components essential for matching this rate. This work provides evidence that decentralized learning is able to generalize under high data heterogeneity and limited communication, while offering broad new avenues for model merging research.
comment: We discover and theoretically explain why and when a single global parameter merging in decentralized learning can recover the performance of federated learning, even in highly heterogeneous and communication-constrained environments
The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration AAAI-26
While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition. However, forming optimal teams is a significant challenge, as the inherent opacity of most models obscures the internal characteristics necessary for effective collaboration. In this paper, we propose an interaction-centric framework for automatic team composition that does not require any prior knowledge including their internal architectures, training data, or task performances. Our method constructs a "language model graph" that maps relationships between models from the semantic coherence of pairwise conversations, and then applies community detection to identify synergistic model clusters. Our experiments with diverse LLMs demonstrate that the proposed method discovers functionally coherent groups that reflect their latent specializations. Priming conversations with specific topics identified synergistic teams which outperform random baselines on downstream benchmarks and achieve comparable accuracy to that of manually-curated teams based on known model specializations. Our findings provide a new basis for the automated design of collaborative multi-agent LLM teams.
comment: Accepted at the AAAI-26 Workshop on LLM-based Multi-Agent Systems: Towards Responsible, Reliable, and Scalable Agentic Systems (LaMAS 2026) as an oral presentation
StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models AAAI 2026
Human writers often begin their stories with an overarching mental scene, where they envision the interactions between characters and their environment. Inspired by this creative process, we propose a novel approach to long-form story generation, termed hybrid bottom-up long-form story generation, using multi-agent simulations. In our method, agents interact within a dynamic sandbox environment, where their behaviors and interactions with one another and the environment generate emergent events. These events form the foundation for the story, enabling organic character development and plot progression. Unlike traditional top-down approaches that impose rigid structures, our hybrid bottom-up approach allows for the natural unfolding of events, fostering more spontaneous and engaging storytelling. The system is capable of generating stories exceeding 10,000 words while maintaining coherence and consistency, addressing some of the key challenges faced by current story generation models. We achieve state-of-the-art performance across several metrics. This approach offers a scalable and innovative solution for creating dynamic, immersive long-form stories that evolve organically from agent-driven interactions.
comment: Accepted by AAAI 2026. Project: https://storyboxproject.github.io
Aegis: Automated Error Generation and Attribution for Multi-Agent Systems
Large language model based multi-agent systems (MAS) have unlocked significant advancements in tackling complex problems, but their increasing capability introduces a structural fragility that makes them difficult to debug. A key obstacle to improving their reliability is the severe scarcity of large-scale, diverse datasets for error attribution, as existing resources rely on costly and unscalable manual annotation. To address this bottleneck, we introduce Aegis, a novel framework for Automated error generation and attribution for multi-agent systems. Aegis constructs a large dataset of 9,533 trajectories with annotated faulty agents and error modes, covering diverse MAS architectures and task domains. This is achieved using a LLM-based manipulator that can adaptively inject context-aware errors into successful execution trajectories. Leveraging fine-grained labels and the structured arrangement of positive-negative sample pairs, Aegis supports three different learning paradigms: Supervised Fine-Tuning, Reinforcement Learning, and Contrastive Learning. We develop learning methods for each paradigm. Comprehensive experiments show that trained models consistently achieve substantial improvements in error attribution. Notably, several of our fine-tuned LLMs demonstrate performance competitive with or superior to proprietary models an order of magnitude larger, validating our automated data generation framework as a crucial resource for developing more robust and interpretable multi-agent systems. Our project website is available at https://kfq20.github.io/Aegis-Website/.
Adaptive Accountability in Networked MAS: Tracing and Mitigating Emergent Norms at Scale
Large-scale networked multi-agent systems increasingly underpin critical infrastructure, yet their collective behavior can drift toward undesirable emergent norms such as collusion, resource hoarding, and implicit unfairness. We present the Adaptive Accountability Framework (AAF), an end-to-end runtime layer that (i) records cryptographically verifiable interaction provenance, (ii) detects distributional change points in streaming traces, (iii) attributes responsibility via a causal influence graph, and (iv) applies cost-bounded interventions-reward shaping and targeted policy patching-to steer the system back toward compliant behavior. We establish a bounded-compromise guarantee: if the expected cost of intervention exceeds an adversary's expected payoff, the long-run fraction of compromised interactions converges to a value strictly below one. We evaluate AAF in a large-scale factorial simulation suite (87,480 runs across two tasks; up to 100 agents plus a 500-agent scaling sweep; full and partial observability; Byzantine rates up to 10%; 10 seeds per regime). Across 324 regimes, AAF lowers the executed compromise ratio relative to a Proximal Policy Optimization baseline in 96% of regimes (median relative reduction 11.9%) while preserving social welfare (median change 0.4%). Under adversarial injections, AAF detects norm violations with a median delay of 71 steps (interquartile range 39-177) and achieves a mean top-ranked attribution accuracy of 0.97 at 10% Byzantine rate.
The Coordination Gap: Multi-Agent Alternation Metrics for Temporal Fairness in Repeated Games
Multi-agent coordination dilemmas expose a fundamental tension between individual optimization and collective welfare, yet characterizing such coordination requires metrics sensitive to temporal structure and collective dynamics. As a diagnostic testbed, we study a BoE-derived multi-agent variant of the Battle of the Exes, formalizing it as a Markov game in which turn-taking emerges as a periodic coordination regime. Conventional outcome-based metrics (e.g., efficiency and min/max fairness) are temporally blind (they cannot distinguish structured alternation from monopolistic or random access patterns) and fairness ratios lose discriminative power as n grows, obscuring inequities. To address this limitation, we introduce Perfect Alternation (PA) as a reference coordination regime and propose six novel Alternation (ALT) metrics designed as temporally sensitive observables of coordination quality. Using Q-learning agents as a minimal adaptive diagnostic baseline, and comparing against random-policy null processes, we uncover a clear measurement failure: despite exhibiting deceptively high traditional metrics (e.g., reward fairness often exceeding 0.9), learned policies perform up to 81% below random baselines under ALT-variant evaluation, a deficit already present in the two-agent case and intensifying as n grows. These results demonstrate, in this setting, that high aggregate payoffs can coexist with poor temporal coordination, and that conventional metrics may severely mischaracterize emergent dynamics. Our findings underscore the necessity of temporally aware observables for analyzing coordination in multi-agent games and highlight random-policy baselines as essential null processes for interpreting coordination outcomes relative to chance-level behavior.
comment: 41 pages, 5 figures, 4 tables, 1 supplementary pdf. Submitted to Social Choice & Welfare
2-D Directed Formation Control Based on Bipolar Coordinates
This work proposes a novel 2-D formation control scheme for acyclic triangulated directed graphs (a class of minimally acyclic persistent graphs) based on bipolar coordinates with (almost) global convergence to the desired shape. Prescribed performance control is employed to devise a decentralized control law that avoids singularities and introduces robustness against external disturbances while ensuring predefined transient and steady-state performance for the closed-loop system. Furthermore, it is shown that the proposed formation control scheme can handle formation maneuvering, scaling, and orientation specifications simultaneously. Additionally, the proposed control law is implementable in agents' arbitrarily oriented local coordinate frames using only low-cost onboard vision sensors, which are favorable for practical applications. Finally, a formation maneuvering simulation study verifies the proposed approach.
comment: 16 pages, 10 figures; minor typos corrected; no change in results
Verifiable Semantics for Agent-to-Agent Communication
Multiagent AI systems require consistent communication, but we lack methods to verify that agents share the same understanding of the terms used. Natural language is interpretable but vulnerable to semantic drift, while learned protocols are efficient but opaque. We propose a certification protocol based on the stimulus-meaning model, where agents are tested on shared observable events and terms are certified if empirical disagreement falls below a statistical threshold. In this protocol, agents restricting their reasoning to certified terms ("core-guarded reasoning") achieve provably bounded disagreement. We also outline mechanisms for detecting drift (recertification) and recovering shared vocabulary (renegotiation). In simulations with varying degrees of semantic divergence, core-guarding reduces disagreement by 72-96%. In a validation with fine-tuned language models, disagreement is reduced by 51%. Our framework provides a first step towards verifiable agent-to-agent communication.
A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning CVPR2026
This paper presents a multi-agent perception-action exploration alliance, dubbed A4VL, for efficient long-video reasoning. A4VL operates in a multi-round perception-action exploration loop with a selection of VLM agents. In each round, the team of agents performs video question-answer (VideoQA) via perception exploration followed by action exploration. During perception exploration, each agent learns to extract query-specific perception clue(s) from a few sampled frames and performs clue-based alignment to find the video block(s) that are most relevant to the query-specific event. During action exploration, A4VL performs video reasoning in three steps: (1) each agent produces its initial answer with rational, (2) all agents collaboratively scores one another through cross-reviews and relevance ranking, and (3) based on whether a satisfactory consensus is reached, the decision is made either to start a new round of perception-action deliberation by pruning (e.g., filtering out the lowest performing agent) and re-staging (e.g., new-clue and matching block based perception-action exploration), or to conclude by producing its final answer. The integration of the multi-agent alliance through multi-round perception-action exploration, coupled with event-driven partitioning and cue-guided block alignment, enables A4VL to effectively scale to real world long videos while preserving high quality video reasoning. Evaluation Results on five popular VideoQA benchmarks show that A4VL outperforms 18 existing representative VLMs and 11 recent methods optimized for long-video reasoning, while achieving significantly lower inference latency. Our code is released at https://github.com/git-disl/A4VL.
comment: Accepted by CVPR2026
Leader-following Consensus over Jointly Connected Switching Networks is Achievable for Exponentially Unstable Linear Systems
The leader-following consensus problem for general linear multi-agent systems over jointly connected switching networks has been a challenging problem and the solvability of the problem has been limited to the class of linear multi-agent systems whose system matrix is marginally stable. This condition is restrictive since it even excludes the most commonly used double-integrator system. This paper presents a breakthrough by demonstrating that leader-following exponential consensus is achievable for general linear multi-agent systems over jointly connected switching networks, even when the system matrix is exponentially unstable. The degree of instability can be explicitly characterized by two key quantities that arise from the jointly connected condition on a switching graph. By exploiting duality, we further show that the output-based distributed observer design problem for a general leader system is solvable over jointly connected switching networks, even when the system matrix is exponentially unstable. This is also in sharp contrast to the existing distributed observers, which rely on the assumption that the leader system is marginally stable.
Multi-Robot Coordination for Planning under Context Uncertainty
Real-world robots often operate in settings where objective priorities depend on the underlying context of operation. When the underlying context is unknown apriori, multiple robots may have to coordinate to gather informative observations to infer the context, since acting based on an incorrect context can lead to misaligned and unsafe behavior. Once the underlying true context is inferred, the robots optimize their task-specific objectives in the preference order induced by the context. We formalize this problem as a Multi-Robot Context-Uncertain Stochastic Shortest Path (MR-CUSSP), which captures context-relevant information at landmark states through joint observations. Our two-stage solution approach is composed of: (1) CIMOP (Coordinated Inference for Multi-Objective Planning) to compute plans that guide robots toward informative landmarks to efficiently infer the true context, and (2) LCBS (Lexicographic Conflict-Based Search) for collision-free multi-robot path planning with lexicographic objective preferences, induced by the context. We evaluate the algorithms using three simulated domains and demonstrate its practical applicability using five mobile robots in the salp domain setup.
comment: 8 pages, 6 figures
Systems and Control (EESS)
RadioDiff-FS: Physics-Informed Manifold Alignment in Few-Shot Diffusion Models for High-Fidelity Radio Map Construction
Radio maps (RMs) provide spatially continuous propagation characterizations essential for 6G network planning, but high-fidelity RM construction remains challenging. Rigorous electromagnetic solvers incur prohibitive computational latency, while data-driven models demand massive labeled datasets and generalize poorly from simplified simulations to complex multipath environments. This paper proposes RadioDiff-FS, a few-shot diffusion framework that adapts a pre-trained main-path generator to multipath-rich target domains with only a small number of high-fidelity samples. The adaptation is grounded in a theoretical decomposition of the multipath RM into a dominant main-path component and a directionally sparse residual. This decomposition shows that the cross-domain shift corresponds to a bounded and geometrically structured feature translation rather than an arbitrary distribution change. A Direction-Consistency Loss (DCL) is then introduced to constrain diffusion score updates along physically plausible propagation directions, suppressing phase-inconsistent artifacts that arise in the low-data regime. Experiments show that RadioDiff-FS reduces NMSE by 59.5% on static RMs and by 74.0% on dynamic RMs relative to the vanilla diffusion baseline, achieving an SSIM of 0.9752 and a PSNR of 36.37 dB under severely limited supervision.
A Passive Elastic-Folding Mechanism for Stackable Airdrop Sensors ICRA 2026
Air-dispersed sensor networks deployed from aerial robotic systems (e.g., UAVs) provide a low-cost approach to wide-area environmental monitoring. However, existing methods often rely on active actuators for mid-air shape or trajectory control, increasing both power consumption and system cost. Here, we introduce a passive elastic-folding hinge mechanism that transforms sensors from a flat, stackable form into a three-dimensional structure upon release. Hinges are fabricated by laminating commercial sheet materials with rigid printed circuit boards (PCBs) and programming fold angles through a single oven-heating step, enabling scalable production without specialized equipment. Our geometric model links laminate geometry, hinge mechanics, and resulting fold angle, providing a predictive design methodology for target configurations. Laboratory tests confirmed fold angles between 10 degrees and 100 degrees, with a standard deviation of 4 degrees and high repeatability. Field trials further demonstrated reliable data collection and LoRa transmission during dispersion, while the Horizontal Wind Model (HWM)-based trajectory simulations indicated strong potential for wide-area sensing exceeding 10 km.
comment: 8 pages, 8 figures, The 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)
BeamAgent: LLM-Aided MIMO Beamforming with Decoupled Intent Parsing and Alternating Optimization for Joint Site Selection and Precoding
Integrating large language models (LLMs) into wireless communication optimization is a promising yet challenging direction. Existing approaches either use LLMs as black-box solvers or code generators, tightly coupling them with numerical computation. However, LLMs lack the precision required for physical-layer optimization, and the scarcity of wireless training data makes domain-specific fine-tuning impractical. We propose BeamAgent, an LLM-aided MIMO beamforming framework that explicitly decouples semantic intent parsing from numerical optimization. The LLM serves solely as a semantic translator that converts natural language descriptions into structured spatial constraints. A dedicated gradient-based optimizer then jointly solves the discrete base station site selection and continuous precoding design through an alternating optimization algorithm. A scene-aware prompt enables grounded spatial reasoning without fine-tuning, and a multi-round interaction mechanism with dual-layer intent classification ensures robust constraint verification. A penalty-based loss function enforces dark-zone power constraints while releasing optimization degrees of freedom for bright-zone gain maximization. Experiments on a ray-tracing-based urban MIMO scenario show that BeamAgent achieves a bright-zone power of 84.0\,dB, outperforming exhaustive zero-forcing by 7.1 dB under the same dark-zone constraint. The end-to-end system reaches within 3.3 dB of the expert upper bound, with the full optimization completing in under 2 s on a laptop.
Learn for Variation: Variationally Guided AAV Trajectory Learning in Differentiable Environments
Autonomous aerial vehicles (AAVs) empower sixth-generation (6G) Internet-of-Things (IoT) networks through mobility-driven data collection. However, conventional reward-driven reinforcement learning for AAV trajectory planning suffers from severe credit assignment issues and training instability, because sparse scalar rewards fail to capture the long-term and nonlinear effects of sequential movements. To address these challenges, this paper proposes Learn for Variation (L4V), a gradient-informed trajectory learning framework that replaces high-variance scalar reward signals with dense and analytically grounded policy gradients. Particularly, the coupled evolution of AAV kinematics, distance-dependent channel gains, and per-user data-collection progress is first unrolled into an end-to-end differentiable computational graph. Backpropagation through time then serves as a discrete adjoint solver, which propagates exact sensitivities from the cumulative mission objective to every control action and policy parameter. These structured gradients are used to train a deterministic neural policy with temporal smoothness regularization and gradient clipping. Extensive simulations demonstrate that L4V consistently outperforms representative baselines, including a genetic algorithm, DQN, A2C, and DDPG, in mission completion time, average transmission rate, and training cost
Holistic Energy Performance Management: Enablers, Capabilities, and Features
Energy consumption is a significant concern for mobile network operators, and to enable further network energy improvements it is also an important target when developing the emerging 6G standard. In this paper we show that, despite the existence of many energy-saving features in 5G new radio (NR) networks, activating them in isolation yields only suboptimal savings and often compromises other network key performance indicators (KPIs) such as coverage or latency. We first introduce a compact taxonomy that distinguishes hardware capabilities from higher-layer features. Features fall into two classes: (i) signaling and scheduling mechanisms that create idle windows, and (ii) features that utilize those windows to save energy. We then present a feature orchestrator as a logical node to coordinate between features to maximize the gain. Using a 3GPP-aligned simulator with product-realistic parameters, we show that coordinating lean NR, scheduling, and advanced sleep modes significantly reduces gNodeB (gNB) energy consumption with negligible throughput loss, compared to the uncoordinated scenario. We conclude by outlining open issues in observability, system dynamics, coordination, and intelligent automation for energy performance management.
comment: 7 Pages, Accepted in IEEE Communications Magazine
Physics-grounded Mechanism Design for Spectrum Sharing between Passive and Active Users
We propose a physics-grounded mechanism design for dynamic spectrum sharing that bridges the gap between radiometric retrieval constraints and economic incentives. We formulate the active and passive users coexistence problem as a Vickrey-Clarke-Groves (VCG) auctions mechanism, where the radiometer dynamically procures ``quiet'' time-frequency tiles from active users based on the marginal reduction in retrieval error variance. This approach ensures allocative efficiency and dominant-strategy incentive compatibility (DSIC). To overcome the computational intractability of exact VCG on large grids, we derive an approximation algorithm by using the monotone submodularity induced by the radiometer equation. AMSR-2-based simulations show that the approach avoids high-cost tiles by aggregating low-cost spectrum across time and frequency. In an interference-trap case study, the proposed framework reduces procurement costs by about 60% over a fixed-band baseline while satisfying accuracy targets.
Assessing performance tradeoffs in hierarchical organizations using a diffusive coupling model
We study a continuous-time dynamical system of nodes diffusively coupled over a hierarchical network to examine the efficiency and performance tradeoffs that organizations, teams, and command and control units face while achieving coordination and sharing information across layers. Specifically, after defining a network structure that captures real-world features of hierarchical organizations, we use linear systems theory and perturbation theory to characterize the rate of convergence to a consensus state, and how effectively information can propagate through the network, depending on the breadth of the organization and the strength of inter-layer communication. Interestingly, our analytical insights highlight a fundamental performance tradeoff. Namely, networks that favor fast coordination will have decreased ability to share information that is generated in the lower layers of the organization and is to be passed up the hierarchy. Numerical results validate and extend our theoretical results.
comment: Paper submitted to IFAC for publication
Mean-field control barrier functions for stochastic multi-agent systems
Many applications involving multi-agent systems require fulfilling safety constraints. Control barrier functions offer a systematic framework to enforce forward invariance of safety sets. Recent work extended this paradigm to mean-field scenarios, where the number of agents is large enough to make density-space descriptions a reasonable workaround for the curse of dimensionality. However, an open gap in the recent literature concerns the development of mean-field control barrier functions for Fokker-Planck (advection-diffusion) equations. In this work, we address this gap, enabling safe mean-field control of agents with stochastic microscopic dynamics. We provide bounded stability guarantees under safety corrections and corroborate our results through numerical simulations in two representative scenarios, coverage and shepherding control of multi-agent systems.
WarPGNN: A Parametric Thermal Warpage Analysis Framework with Physics-aware Graph Neural Network
With the advent of system-in-package (SiP) chiplet-based design and heterogeneous 2.5D/3D integration, thermal-induced warpage has become a critical reliability concern. While conventional numerical approaches can deliver highly accurate results, they often incur prohib- itively high computational costs, limiting their scalability for complex chiplet-package systems. In this paper, we present WarPGNN, an ef- ficient and accurate parametric thermal warpage analysis framework powered by Graph Neural Networks (GNNs). By operating directly on graphs constructed from the floorplans, WarPGNN enables fast warpage-aware floorplan exploration and exhibits strong transfer- ability across diverse package configurations. Our method first en- codes multi-die floorplans into reduced Transitive Closure Graphs (rTCGs), then a Graph Convolution Network (GCN)-based encoder extracts hierarchical structural features, followed by a U-Net inspired decoder that reconstructs warpage maps from graph feature embed- dings. Furthermore, to address the long-tailed pattern of warpage data distribution, we developed a physics-informed loss and revised a message-passing encoder based on Graph Isomorphic Network (GIN) that further enhance learning performance for extreme cases and expressiveness of graph embeddings. Numerical results show that WarPGNN achieves more than 205.91x speedup compared with the 2-D efficient FEM-based method and over 119766.64x acceleration with 3-D FEM method COMSOL, respectively, while maintaining comparable accuracy at only 1.26% full-scale normalized RMSE and 2.21% warpage value error. Compared with recent DeepONet-based model, our method achieved comparable prediction accuracy and in- ference speedup with 3.4x lower training time. In addition, WarPGNN demonstrates remarkable transferability on unseen datasets with up to 3.69% normalized RMSE and similar runtime.
comment: 6 Pages, ACM format
HEP Statistical Inference for UAV Fault Detection: CLs, LRT, and SBI Applied to Blade Damage
This paper transfers three statistical methods from particle physics to multirotor propeller fault detection: the likelihood ratio test (LRT) for binary detection, the CLs modified frequentist method for false alarm rate control, and sequential neural posterior estimation (SNPE) for quantitative fault characterization. Operating on spectral features tied to rotor harmonic physics, the system returns three outputs: binary detection, controlled false alarm rates, and calibrated posteriors over fault severity and motor location. On UAV-FD, a hexarotor dataset of 18 real flights with 5% and 10% blade damage, leave-one-flight-out cross-validation gives AUC 0.862 +/- 0.007 (95% CI: 0.849--0.876), outperforming CUSUM (0.708 +/- 0.010), autoencoder (0.753 +/- 0.009), and LSTM autoencoder (0.551). At 5% false alarm rate the system detects 93% of significant and 81% of subtle blade damage. On PADRE, a quadrotor platform, AUC reaches 0.986 after refitting only the generative models. SNPE gives a full posterior over fault severity (90% credible interval coverage 92--100%, MAE 0.012), so the output includes uncertainty rather than just a point estimate or fault flag. Per-flight sequential detection achieves 100% fault detection with 94% overall accuracy.
comment: 12 Pages, 8 Figures
Fundamental Limits for Sensor-Based Control via the Gibbs Variational Principle
Fundamental limits on the performance of feedback controllers are essential for benchmarking algorithms, guiding sensor selection, and certifying task feasibility -- yet few general-purpose tools exist for computing them. Existing information-theoretic approaches overestimate the information a sensor must provide by evaluating it against the uncontrolled system, producing bounds that degrade precisely when feedback is most valuable. We derive a lower bound on the minimum expected cost of any causal feedback controller under partial observations by applying the Gibbs variational principle to the joint path measure over states and observations. The bound applies to nonlinear, nonholonomic, and hybrid dynamics with unbounded costs and admits a self-consistent refinement: any good controller concentrates the state, which limits the information the sensor can extract, which tightens the bound. The resulting fixed-point equation has a unique solution computable by bisection, and we provide conditions under which the free energy minimization is provably convex, yielding a certifiably correct numerical bound. On a nonlinear Dubins car tracking problem, the self-consistent bound captures most of the optimal cost across sensor noise levels, while the open-loop variant is vacuous at low noise.
comment: 6 pages, 1 figure
Generalizations of Backup Control Barrier Functions: Expansion and Adaptation for Input-Bounded Safety-Critical Control
Guaranteeing the safety of nonlinear systems with bounded inputs remains a key challenge in safe autonomy. Backup control barrier functions (bCBFs) provide a powerful mechanism for constructing controlled invariant sets by propagating trajectories under a pre-verified backup controller to a forward invariant backup set. While effective, the standard bCBF method utilizes the same backup controller for both set expansion and safety certification, which can restrict the expanded safe set and lead to conservative dynamic behavior. In this study, we generalize the bCBF framework by separating the set-expanding controller from the verified backup controller, thereby enabling a broader class of expansion strategies while preserving formal safety guarantees. We establish sufficient conditions for forward invariance of the resulting implicit safe set and show how the generalized construction recovers existing bCBF methods as special cases. Moreover, we extend the proposed framework to parameterized controller families, enabling online adaptation of the expansion controller while maintaining safety guarantees in the presence of input bounds.
comment: 6 pages, 2 figures
Deceiving Flexibility: A Stealthy False Data Injection Model in Vehicle-to-Grid Coordination
Electric vehicles (EVs) in Vehicle-to-Grid (V2G) systems act as distributed energy resources that support grid stability. Centralized coordination such as the extended State Space Model (eSSM) enhances scalability and estimation efficiency but may introduce new cyber-attack surfaces. This paper presents a stealthy False Data Injection Attack (FDIA) targeting eSSM-based V2G coordination. Unlike prior studies that assume attackers can disrupt physical charging or discharging processes, we consider an adversary who compromises only a subset of EVs, and limiting their influence to the manipulation of reported State of Charge (SoC) and power measurements. By doing so, the attacker can deceive the operator's perception of fleet flexibility while remaining consistent with model-based expectations, thus evading anomaly detection. Numerical simulations show that the proposed stealthy FDIA can deteriorate grid frequency stability even without direct access to control infrastructure. These findings highlight the need for enhanced detection and mitigation mechanisms tailored to aggregated V2G frameworks
Topological Obstructions to the Existence of Control Barrier Functions
In 1983, Brockett developed a topological necessary condition for the existence of continuous, asymptotically stabilizing control laws. Building upon recent work on necessary conditions for set stabilization, we develop Brockett-like necessary conditions for the existence of control barrier functions (CBFs). By leveraging the unique geometry of CBF safe sets, we provide simple and self-contained derivations of necessary conditions for the existence of CBFs and their safe, continuous controllers. We demonstrate the application of these conditions to instructive examples and kinematic nonholonomic systems, and discuss their relationship to Brockett's necessary condition.
comment: 6 pages, 3 figures
Interleaved Information Structures in Dynamic Games: A General Framework with Application to the Linear-Quadratic Case
A fundamental problem in noncooperative dynamic game theory is the computation of Nash equilibria under different information structures, which specify the information available to each agent during decision-making. Prior work has extensively studied equilibrium solutions for two canonical information structures: feedback, where agents observe the current state at each time, and open-loop, where agents only observe the initial state. However, these paradigms are often too restrictive to capture realistic settings exhibiting interleaved information structures, in which each agent observes only a subset of other agents at every timestep. To date, there is no systematic framework for modeling and solving dynamic games under arbitrary interleaved information structures. To this end, we make two main contributions. First, we introduce a method to model deterministic dynamic games with arbitrary interleaved information structures as Mathematical Program Networks (MPNs), where the network structure encodes the informational dependencies between agents. Second, for linear-quadratic (LQ) dynamic games, we leverage the MPN formulation to develop a systematic procedure for deriving Riccati-like equations that characterize Nash equilibria. Finally, we illustrate our approach through an example involving three agents exhibiting a cyclic information structure.
comment: 6 pages, 3 figures
A Distributionally Robust Optimal Control Approach for Differentially Private Dynamical Systems
In this paper, we develop a distributionally robust optimal control approach for differentially private dynamical systems, enabling a plant to securely outsource control computation to an untrusted remote server. We consider a plant that ensures differential privacy of its state trajectory by injecting calibrated noise into its output measurements. Unlike prior works, we assume that the server only has access to an ambiguity set consisting of admissible noise distributions, rather than the exact distribution. To account for this uncertainty, the server formulates a distributionally robust optimal control problem to minimize the worst-case expected cost over all admissible noise distributions. However, the formulated problem is computationally intractable due to the nonconvexity of the ambiguity set. To overcome this, we relax it into a convex Kullback--Leibler divergence ball, so that the reformulated problem admits a tractable closed-form solution.
comment: 6 pages, 3 figures, Submitted to IEEE L-CSS and CDC 2026
NavTrust: Benchmarking Trustworthiness for Embodied Navigation
There are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world settings. To address this gap, we present NavTrust, a unified benchmark that systematically corrupts input modalities, including RGB, depth, and instructions, in realistic scenarios and evaluates their impact on navigation performance. To our best knowledge, NavTrust is the first benchmark that exposes embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Our extensive evaluation of seven state-of-the-art approaches reveals substantial performance degradation under realistic corruptions, which highlights critical robustness gaps and provides a roadmap toward more trustworthy embodied navigation systems. Furthermore, we systematically evaluate four distinct mitigation strategies to enhance robustness against RGB-Depth and instructions corruptions. Our base models include Uni-NaVid and ETPNav. We deployed them on a real mobile robot and observed improved robustness to corruptions. The project website is: https://navtrust.github.io.
comment: Project Website: https://navtrust.github.io
Markov Potential Game and Multi-Agent Reinforcement Learning for Autonomous Driving
Autonomous driving (AD) requires safe and reliable decision-making among interacting agents, e.g., vehicles, bicycles, and pedestrians. Multi-agent reinforcement learning (MARL) modeled by Markov games (MGs) provides a suitable framework to characterize such agents' interactions during decision-making. Nash equilibria (NEs) are often the desired solution in an MG. However, it is typically challenging to compute an NE in general-sum games, unless the game is a Markov potential game (MPG), which ensures the NE attainability under a few learning algorithms such as gradient play. However, it has been an open question how to construct an MPG and whether these construction rules are suitable for AD applications. In this paper, we provide sufficient conditions under which an MG is an MPG and show that these conditions can accommodate general driving objectives for autonomous vehicles (AVs) using highway forced merge scenarios as illustrative examples. A parameter-sharing neural network (NN) structure is designed to enable decentralized policy execution. The trained driving policy from MPGs is evaluated in both simulated and naturalistic traffic datasets. Comparative studies with single-agent RL and with human drivers whose behaviors are recorded in the traffic datasets are reported, respectively.
Tutorial: Grid-Following Inverter for Electrical Power Grid
The growing use of inverter-based resources in modern power systems has made grid-following inverters a central topic in power-system modeling, control, and simulation. Despite their widespread deployment, introductory material that explains grid-following inverter operation from first principles and connects control design to time-domain simulation remains limited. To address this need, this tutorial presents a circuit-theoretic introduction to the modeling and simulation of a grid- following inverter connected to an electrical power grid. We describe the inverter synchronization with the grid (PLL), power control, and current control structure and show how these elements can be represented within an electromagnetic transient (EMT) simulation framework using companion model-based formulations similar to those used in circuit simulators such as SPICE and Cadence. In this tutorial, we use the grid-following inverter as the primary example to illustrate how its governing equations, control loops, and network interface can be formulated and simulated from first principles. By the end of the document, readers should gain a clear introductory understanding of how to model and simulate a grid-following inverter in an EMT platform.
Exact-Time Safety Recovery using Time-Varying Control Barrier Functions with Optimal Barrier Tracking
This paper is motivated by controllers developed for autonomous vehicles which occasionally result into conditions where safety is no longer guaranteed. We develop an exact-time safety recovery framework for any control-affine nonlinear system when its state is outside a safe region using time-varying Control Barrier Functions (CBFs) with optimal barrier tracking. Unlike conventional formulations that provide only conservative upper bounds on recovery time convergence, the proposed approach guarantees recovery to the safe set at a prescribed time. The key mechanism is an active barrier tracking condition that forces the barrier function to follow exactly a designer-specified recovery trajectory. This transforms safety recovery into a trajectory design problem. The recovery trajectory is parameterized and optimized to achieve optimal performance while preserving feasibility under input constraints, avoiding the aggressive corrective actions typically induced by conventional finite-time formulations. The safety recovery framework is applied to the roundabout traffic coordination problem for Connected and Automated Vehicles (CAVs), where any initially violated safe merging constraint is replaced by an exact-time recovery barrier constraint to ensure safety guarantee restoration before CAV conflict points are reached. Simulation results demonstrate improved feasibility and performance.
Assessment of Analog Time Multiplexing in SDM Digital to Analog Converters
Analog multiplexing for sigma delta modulated Digital to Analog Converters has been recently proposed as a means of achieving robustness. This preprint analyses said scheme via simulations. The main limitation introduced by the proposed architecture comes from mismatch in the DACs gain, which can drastically impact performances. A new technique of dynamic elements matching is proposed here to overcome this problem.
Heart Artifact Removal in Electrohysterography Measurements Using Algebraic Differentiators
Electrohysterography (EHG) enables non-invasive monitoring of uterine contractions but can be contaminated by electrocardiogram (ECG) artifacts. This work presents an ECG removal method using algebraic differentiators, a control-theoretic tool for model-free derivative estimation, that preserves signal shape outside the detected cardiac pulse locations. The differentiator parameters are designed to simultaneously suppress slow physiological artifacts and powerline interference while maximizing output signal-to-noise ratio. Cross-channel clustering distinguishes cardiac pulses from localized artifacts, enabling accurate pulse subtraction without auxiliary ECG references. Implemented as a causal FIR filter, the method is validated on multichannel EHG recordings from female and male subjects and compared to the template subtraction method.
On the Minimum Number of Control Laws for Nonlinear Systems with Input-Output Linearisation Singularities
This paper addresses the fundamental question of determining the minimum number of distinct control laws required for global controllability of nonlinear systems that exhibit singularities in their feedback linearising controllers. We introduce and rigorously prove the (k+1)-Controller Lemma, which establishes that for an nth order single-input single-output nonlinear system with a singularity manifold parameterised by k algebraically independent conditions, exactly k+1 distinct control laws are necessary and sufficient for complete state-space coverage. The sufficiency proof is constructive, employing the approximate linearisation methodology together with transversality arguments from differential topology. The necessity proof proceeds by contradiction, using the Implicit Function Theorem, a dimension-counting argument and structural constraints inherent to the approximate linearisation framework. The result is validated through exhaustive analysis of the ball-and-beam system, a fourth-order mechanical system that exhibits a two-parameter singularity at the third output derivative.
comment: 14
Lightweight Model Predictive Control for Spacecraft Rendezvous Attitude Synchronization
This work introduces two lightweight model predictive control (MPC) approaches for attitude tracking with reaction wheels during spacecraft rendezvous synchronization. Both approaches are based on a novel attitude deviation formulation, which enables the use of inherently linear constraints on angular velocity. We develop a single-loop and a dual-loop MPC; the latter embeds a stabilizing feedback controller within the inner loop, yielding a linear time-invariant system. Both controllers are implemented with CasADi - including automatic code generation - evaluated across various solvers, and validated within the Basilisk astrodynamics simulation framework. The experimental results demonstrate improved tracking accuracy alongside reductions in computational effort and memory consumption. Finally, embedded delivery to an ARM Cortex-M7 - representative of commercial off-the-shelf devices used in New Space platforms - confirms the real-time feasibility of these approaches and highlights their suitability for onboard attitude control in resource-constrained spacecraft rendezvous missions.
comment: Accepted at European Control Conference (ECC 2026)
Safety-Guaranteed Imitation Learning from Nonlinear Model Predictive Control for Spacecraft Close Proximity Operations
This paper presents a safety-guaranteed, runtime-efficient imitation learning framework for spacecraft close proximity control. We leverage Control Barrier Functions (CBFs) for safety certificates and Control Lyapunov Functions (CLFs) for stability as unified design principles across data generation, training, and deployment. First, a nonlinear Model Predictive Control (NMPC) expert enforces CBF constraints to provide safe reference trajectories. Second, we train a neural policy with a novel CBF-CLF-informed loss and DAgger-like rollouts with curriculum weighting, promoting data-efficiency and reducing future safety filter interventions. Third, at deployment a lightweight one-step CBF-CLF quadratic program minimally adjusts the learned control input to satisfy hard safety constraints while encouraging stability. We validate the approach for ESA-compliant close proximity operations, including fly-around with a spherical keep-out zone and final approach inside a conical approach corridor, using the Basilisk high-fidelity simulator with nonlinear dynamics and perturbations. Numerical experiments indicate stable convergence to decision points and strict adherence to safety under the filter, with task performance comparable to the NMPC expert while significantly reducing online computation. A runtime analysis demonstrates real-time feasibility on a commercial off-the-shelf processor, supporting onboard deployment for safety-critical on-orbit servicing.
comment: Accepted at European Control Conference (ECC 2026)
Remarks on Lipschitz-Minimal Interpolation: Generalization Bounds and Neural Network Implementation
This note establishes a theoretical framework for finding (potentially overparameterized) approximations of a function on a compact set with a-priori bounds for the generalization error. The approximation method considered is to choose, among all functions that (approximately) interpolate a given data set, one with a minimal Lipschitz constant. The paper establishes rigorous generalization bounds over practically relevant classes of approximators, including deep neural networks. It also presents a neural network implementation based on Lipschitz-bounded network layers and an augmented Lagrangian method. The results are illustrated for a problem of learning the dynamics of an input-to-state stable system with certified bounds on simulation error.
comment: 9 pages, 3 figures, 3 tables
Coordinating Stakeholders in the Consideration of Performance Indicators and Respective Interface Requirements for Automated Vehicles
This paper presents a process for coordinating stakeholders in their consideration of performance indicators and respective interface requirements for automated vehicles. These performance indicators are obtained and processed based on the system's self-perception and enable the realization of self-aware and self-adaptive vehicles. This is necessary to allow SAE Level 4 vehicles to handle external disturbances as well as internal degradations and failures at runtime. Without such a systematic process for stakeholder coordination, architectural decisions on realizing self-perception become untraceable and effective communication between stakeholders may be compromised. Our process-oriented approach includes necessary ingredients, steps, and artifacts that explicitly address stakeholder communication, traceability, and knowledge transfer through clear documentation. Our approach is based on the experience gained from applying the process in the autotech.agil project, from which we further present lessons learned, identified gaps, and steps for future work.
Real-Time Regulation of Direct Ink Writing Using Model Reference Adaptive Control
Direct Ink Writing (DIW) has gained attention for its potential to reduce printing time and material waste. However, maintaining precise geometry and consistent print quality remains challenging under dynamically varying operating conditions. This paper presents a control-focused approach using a model reference adaptive control (MRAC) strategy based on a reduced-order model (ROM) of extrusion-based 3D printing for a candidate cementitious material system. The proposed controller actively compensates for uncertainties and disturbances by adjusting process parameters in real time, with the objective of minimizing reference-tracking errors. Stability and convergence are rigorously verified via Lyapunov analysis, demonstrating that tracking errors asymptotically approach zero. Performance evaluation under realistic simulation scenarios confirms the effectiveness of the adaptive control framework in maintaining accurate and robust extrusion behavior.
Exact and Approximate Convex Reformulation of Linear Stochastic Optimal Control with Chance Constraints
In this paper, we present an equivalent convex optimization formulation for discrete-time stochastic linear systems subject to linear chance constraints, alongside a tight convex relaxation for quadratic chance constraints. By lifting the state vector to encode moment information explicitly, the formulation captures linear chance constraints on states and controls across multiple time steps exactly, without conservatism, yielding strict improvements in both feasibility and optimality. For quadratic chance constraints, we derive convex approximations that are provably less conservative than existing methods. We validate the framework on minimum-snap trajectory generation for a quadrotor, demonstrating that the proposed approach remains feasible at noise levels an order of magnitude beyond the operating range of prior formulations.
comment: Under Review
Variational Encrypted Model Predictive Control
We develop a variational encrypted model predictive control (VEMPC) protocol whose online execution relies only on encrypted polynomial operations. The proposed approach reformulates the MPC problem into a sampling-based estimator, in which the computation of the quadratic cost is naturally handled by tilting the sampling distribution, thus reducing online encrypted computation. The resulting protocol requires no additional communication rounds or intermediate decryption, and scales efficiently through two complementary levels of parallelism. We analyze the effect of encryption-induced errors on optimality, and simulation results demonstrate the practical applicability of the proposed method.
comment: 6 pages, 1 figure, 1 table. Submitted to IEEE Control Systems Letters (L-CSS) with CDC option, under review
String stable platoons of all-electric aircraft with operating costs and airspace complexity trade-off
This paper formulates an optimal control framework for computing cruise airspeeds in predecessor-follower platoons of all-electric aircraft that balance operational cost and airspace complexity. To quantify controller workload and coordination effort, a novel pairwise dynamic workload (PDW) function is developed. Within this framework, the optimal airspeed solution is derived for all-electric aircraft under longitudinal wind disturbances. Moreover, an analytical suboptimal solution for heterogeneous platoons with nonlinear aircraft dynamics is determined, for which a general sufficient condition for string stability is formally established. The methodology is validated through case studies of all-electric aircraft operating in air corridors that are suitable for low-altitude advanced/urban air mobility (AAM/UAM) applications. Results show that the suboptimal solution closely approximates the optimal, while ensuring safe separations, maintaining string stability, and reducing operational cost and airspace complexity. These findings support the development of sustainable and more autonomous air traffic procedures that will enable the implementation of emerging air transportation technologies, such as AAM/UAM, and their integration to the air traffic system environment.
comment: 28 pages, 8 figures
Operational tracking loss in nonautonomous second-order oscillator networks
We study when a network of coupled oscillators with inertia ceases to follow a time-dependent driving protocol coherently, using a simplified graph-based model motivated by inverter-dominated energy systems. We show that this loss of tracking is diagnosed most clearly in the frequency dynamics, rather than in phase-based observables. Concretely, a tracking ratio built from the frequency-disagreement observable $E_ω(t)$ and normalized by the instantaneous second-order modal decay rate yields a robust protocol-dependent freeze-out time whose relative dispersion decreases with system size. Graph topology matters substantially: the resulting freeze-out time is only partly captured by the algebraic connectivity $λ_2$, while additional structural descriptors, particularly Fiedler-mode localization and low-spectrum structure, improve the explanation of graph-to-graph variation. By contrast, phase-sector observables develop strong non-monotonic and underdamped structure, so simple diagonal low-mode relaxation closures are not quantitatively reliable in the same regime. These results identify the frequency sector as the natural operational sector for nonautonomous tracking loss in second-order oscillator networks and clarify both the usefulness and the limits of reduced spectral descriptions in this setting.
comment: 11 pages, 8 figures
Bridging Conformal Prediction and Scenario Optimization: Discarded Constraints and Modular Risk Allocation
Scenario optimization and conformal prediction share a common goal, that is, turning finite samples into safety margins. Yet, different terminology often obscures the connection between their respective guarantees. This paper revisits that connection directly from a systems-and-control viewpoint. Building on the recent conformal/scenario bridge of \citet{OSullivanRomaoMargellos2026}, we extend the forward direction to feasible sample-and-discard scenario algorithms. Specifically, if the final decision is determined by a stable subset of the retained sampled constraints, the classical mean violation law admits a direct exchangeability-based derivation. In this view, discarded samples naturally appear as admissible exceptions. We also introduce a simple modular composition rule that combines several blockwise calibration certificates into a single joint guarantee. This rule proves particularly useful in multi-output prediction and finite-horizon control, where engineers must distribute risk across coordinates, constraints, or prediction steps. Finally, we provide numerical illustrations using a calibrated multi-step tube around an identified predictor. These examples compare alternative stage-wise risk allocations and highlight the resulting performance and safety trade-offs in a standard constraint-tightening problem.
Safety-Aware Performance Boosting for Constrained Nonlinear Systems
We study a control architecture for nonlinear constrained systems that integrates a performance-boosting (PB) controller with a scheduled Predictive Safety Filter (PSF). The PSF acts as a pre-stabilizing base controller that enforces state and input constraints. The PB controller, parameterized as a causal operator, influences the PSF in two ways: it proposes a performance input to be filtered, and it provides a scheduling signal to adjust the filter's Lyapunov-decrease rate. We prove two main results: (i) Stability by design: any controller adhering to this parametrization maintains closed-loop stability of the pre-stabilized system and inherits PSF safety. (ii) Trajectory-set expansion: the architecture strictly expands the set of safe, stable trajectories achievable by controllers combined with conventional PSFs, which rely on a pre-defined Lyapunov decrease rate to ensure stability. This scheduling allows the PB controller to safely execute complex behaviors, such as transient detours, that are provably unattainable by standard PSF formulations. We demonstrate this expanded capability on a constrained inverted pendulum task with a moving obstacle.
A Control-Theoretic Foundation for Agentic Systems
This paper develops a control-theoretic framework for analyzing agentic systems embedded within feedback control loops, where an AI agent may adapt controller parameters, select among control strategies, invoke external tools, reconfigure decision architectures, and modify control objectives during operation. These capabilities are formalized by interpreting agency as hierarchical runtime decision authority over elements of the control architecture, leading to an augmented closed-loop representation in which physical states, internal memory, tool outputs, interaction signals, and design variables evolve as a coupled dynamical system. A five-level hierarchy of agency is defined, ranging from fixed control laws to runtime synthesis of control architectures and objectives. The analysis shows that increasing agency introduces interacting dynamical mechanisms such as time-varying adaptation, endogenous switching, decision-induced delays, and structural reconfiguration. The framework is developed in both nonlinear and linear settings, providing explicit design constraints for AI-enabled control systems in safety-critical applications.
LMI Optimization Based Multirate Steady-State Kalman Filter Design
This paper presents an LMI-based design framework for multirate steady-state Kalman filters in systems with sensors operating at different sampling rates. The multirate system is formulated as a periodic time-varying system, where the Kalman gains converge to periodic steady-state values that repeat every frame period. Cyclic reformulation transforms this into a time-invariant problem; however, the resulting measurement noise covariance becomes semidefinite rather than positive definite, preventing direct application of standard Riccati equation methods. I address this through a dual LQR formulation with LMI optimization that naturally handles semidefinite covariances. The framework enables multi-objective design, supporting pole placement for guaranteed convergence rates and $l_2$-induced norm constraints for balancing average and worst-case performance. Numerical validation using an automotive navigation system with GPS and wheel speed sensors, including Monte Carlo simulation with 500 independent noise realizations, demonstrates that the proposed filter achieves a position RMSE well below the GPS noise level through effective multirate sensor fusion, and that the LMI solution provides valid upper bounds on the estimation error covariance.
comment: Revised and resubmitted to IEEE ACCESS
Linear Attention for Joint Power Optimization and User-Centric Clustering in Cell-Free Networks
Optimal AP clustering and power allocation are critical in user-centric cell-free massive MIMO systems. Existing deep learning models lack flexibility to handle dynamic network configurations. Furthermore, many approaches overlook pilot contamination and suffer from high computational complexity. In this paper, we propose a lightweight transformer model that overcomes these limitations by jointly predicting AP clusters and powers solely from spatial coordinates of user devices and AP. Our model is architecture-agnostic to users load, handles both clustering and power allocation without channel estimation overhead, and eliminates pilot contamination by assigning users to AP within a pilot reuse constraint. We also incorporate a customized linear attention mechanism to capture user-AP interactions efficiently and enable linear scalability with respect to the number of users. Numerical results confirm the model's effectiveness in maximizing the minimum spectral efficiency and providing near-optimal performance while ensuring adaptability and scalability in dynamic scenarios.
Improving Spatial Allocation for Energy System Coupling with Graph Neural Networks SC
In energy system analysis, coupling models with mismatched spatial resolutions is a significant challenge. A common solution is assigning weights to high-resolution geographic units for aggregation, but traditional models are limited by using only a single geospatial attribute. This paper presents an innovative method employing a self-supervised Heterogeneous Graph Neural Network to address this issue. This method models high-resolution geographic units as graph nodes, integrating various geographical features to generate physically meaningful weights for each grid point. These weights enhance the conventional Voronoi-based allocation method, allowing it to go beyond simply geographic proximity by incorporating essential geographic information.In addition, the self-supervised learning paradigm overcomes the lack of accurate ground-truth data. Experimental results demonstrate that applying weights generated by this method to cluster-based Voronoi Diagrams significantly enhances scalability, accuracy, and physical plausibility, while increasing precision compared to traditional methods.
comment: Accepted at XXIV Power Systems Computation Conference (PSCC 2026)
Review of Superconducting Qubit Devices and Their Large-Scale Integration
The superconducting qubit quantum computer is one of the most promising quantum computing architectures for large-scale integration due to its maturity and close proximity to the well-established semiconductor manufacturing infrastructure. From an education perspective, it also bridges classical microwave electronics and quantum electrodynamics. In this paper, we will review the basics of quantum computers, superconductivity, and Josephson junctions. We then introduce important technologies and concepts related to DiVincenzo's criteria, which are the necessary conditions for the superconducting qubits to work as a useful quantum computer. Firstly, we will discuss various types of superconducting qubits formed with Josephson junctions, from which we will understand the trade-off across multiple design parameters, including their noise immunity. Secondly, we will discuss different schemes to achieve entanglement gate operations, which are a major bottleneck in achieving more efficient fault-tolerant quantum computing. Thirdly, we will review readout engineering, including the implementations of the Purcell filters and quantum-limited amplifiers. Finally, we will discuss the nature and review the studies of two-level system defects, which are currently the limiting factor of qubit coherence time. DiVincenzo's criteria are only the necessary conditions for a technology to be eligible for quantum computing. To have a useful quantum computer, large-scale integration is required. We will review proposals and developments for the large-scale integration of superconducting qubit devices. By comparing with the application of electronic design automation (EDA) in semiconductors, we will also review the use of EDA in superconducting qubit quantum computer design, which is necessary for its large-scale integration.
From Optimizable to Interactable: Mixed Digital Twin-Empowered Testing of Vehicle-Infrastructure Cooperation Systems
Sufficient testing under corner cases is critical for the long-term operation of vehicle-infrastructure cooperation systems (VICS). However, existing corner-case generation methods are primarily AI-driven, and VICS testing under corner cases is typically limited to simulation. In this paper, we introduce an L5 ''Interactable'' level to the VICS digital twin (VICS-DT) taxonomy, extending beyond the conventional L4 ''Optimizable'' level. We further propose an L5-level VICS testing framework, IMPACT (Interactive Mixed-digital-twin Paradigm for Advanced Cooperative vehicle-infrastructure Testing). By enabling direct human interactions with VICS entities, IMPACT incorporates highly uncertain and unpredictable human behaviors into the testing loop, naturally generating high-quality corner cases that complement AI-based methods. Furthermore, the mixedDT-enabled ''Physical-Virtual Action Interaction'' facilitates safe VICS testing under corner cases, incorporating real-world environments and entities rather than purely in simulation. Finally, we implement IMPACT on the I-VIT (Interactive Vehicle-Infrastructure Testbed), and experiments demonstrate its effectiveness. The experimental videos are available at our project website: https://dongjh20.github.io/IMPACT.
Benchmarking State Space Models, Transformers, and Recurrent Networks for US Grid Forecasting
Selecting the right deep learning model for power grid forecasting is challenging, as performance heavily depends on the data available to the operator. This paper presents a comprehensive benchmark of five modern neural architectures: two state space models (PowerMamba, S-Mamba), two Transformers (iTransformer, PatchTST), and a traditional LSTM. We evaluate these models on hourly electricity demand across six diverse US power grids for forecast windows between 24 and 168 hours. To ensure a fair comparison, we adapt each model with specialized temporal processing and a modular layer that cleanly integrates weather covariates. Our results reveal that there is no single best model for all situations. When forecasting using only historical load, PatchTST and the state space models provide the highest accuracy. However, when explicit weather data is added to the inputs, the rankings reverse: iTransformer improves its accuracy three times more efficiently than PatchTST. By controlling for model size, we confirm that this advantage stems from the architecture's inherent ability to mix information across different variables. Extending our evaluation to solar generation, wind power, and wholesale prices further demonstrates that model rankings depend on the forecast task: PatchTST excels on highly rhythmic signals like solar, while state space models are better suited for the chaotic fluctuations of wind and price. Ultimately, this benchmark provides grid operators with actionable guidelines for selecting the optimal forecasting architecture based on their specific data environments.
comment: 11 pages, 2 figures, 8 tables
Robust Adaptive MPC in the Presence of Nonlinear Time-Varying Uncertainties: An Uncertainty Compensation Approach
This paper introduces an uncertainty compensation-based robust adaptive model predictive control (MPC) framework for linear systems with nonlinear time-varying uncertainties. The framework integrates an L1 adaptive controller to compensate for the matched uncertainty and a robust feedback controller, designed using linear matrix inequalities, to mitigate the effect of unmatched uncertainty on target output channels. Uniform bounds on the errors between the system's states and control inputs and those of a nominal (i.e., uncertainty-free) system are derived. These error bounds are then used to tighten the actual system's state and input constraints, enabling the design of an MPC for the nominal system under these tightened constraints. Referred to as uncertainty compensation-based MPC (UC-MPC), this approach ensures constraint satisfaction while delivering enhanced performance compared to existing methods. Simulation results for a flight control example and a spacecraft landing on an asteroid demonstrate the effectiveness of the proposed framework.
Funnel Control Under Hard and Soft Output Constraints (extended version)
This paper proposes a funnel control method under time-varying hard and soft output constraints. First, an online funnel planning scheme is designed that generates a constraint consistent funnel, which always respects hard (safety) constraints, and soft (performance) constraints are met only when they are not conflicting with the hard constraints. Next, the prescribed performance control method is employed for designing a robust low-complexity funnel-based controller for uncertain nonlinear Euler-Lagrangian systems such that the outputs always remain within the planned constraint consistent funnels. Finally, the results are verified with a simulation example of a mobile robot tracking a moving object while staying in a box-constrained safe space.
comment: 9 pages, 7 figures. Minor revisions: corrected text and mathematical typos, expanded discussion in Section III.A, and added a short appendix on relaxation of an assumption; main results unchanged
2-D Directed Formation Control Based on Bipolar Coordinates
This work proposes a novel 2-D formation control scheme for acyclic triangulated directed graphs (a class of minimally acyclic persistent graphs) based on bipolar coordinates with (almost) global convergence to the desired shape. Prescribed performance control is employed to devise a decentralized control law that avoids singularities and introduces robustness against external disturbances while ensuring predefined transient and steady-state performance for the closed-loop system. Furthermore, it is shown that the proposed formation control scheme can handle formation maneuvering, scaling, and orientation specifications simultaneously. Additionally, the proposed control law is implementable in agents' arbitrarily oriented local coordinate frames using only low-cost onboard vision sensors, which are favorable for practical applications. Finally, a formation maneuvering simulation study verifies the proposed approach.
comment: 16 pages, 10 figures; minor typos corrected; no change in results
Studying the Role of Synthetic Data for Machine Learning-based Wireless Networks Traffic Forecasting
Synthetic data generation is an appealing tool for augmenting and enriching datasets, playing a crucial role in advancing artificial intelligence (AI) and machine learning (ML). Not only does synthetic data help build robust AI/ML datasets cost-effectively, but it also offers privacy-friendly solutions and bypasses the complexities of storing large data volumes. This paper proposes a novel method to generate synthetic data, based on first-order auto-regressive noise statistics, for large-scale Wi-Fi deployments. The approach operates with minimal real data requirements while producing statistically rich traffic patterns that effectively mimic real Access Point (AP) behavior. Experimental results show that ML models trained on synthetic data achieve Mean Absolute Error (MAE) values within 10 to 15 of those obtained using real data when trained on the same APs, while requiring significantly less training data. Moreover, when generalization is required, synthetic-data-trained models improve prediction accuracy by up to 50 percent compared to real-data-trained baselines, thanks to the enhanced variability and diversity of the generated traces. Overall, the proposed method bridges the gap between synthetic data generation and practical Wi-Fi traffic forecasting, providing a scalable, efficient, and real-time solution for modern wireless networks.
A System Level Approach to LQR Control of the Diffusion Equation
The optimal controller design problem for a linear, first-order spatially-invariant distributed parameter system is considered. Through a case study of the Linear Quadratic Regulator (LQR) problem for the diffusion equation over the torus, it is illustrated that the optimal controller design problem can be equivalently formulated as an optimization problem over the system's closed-loop mappings, analogous to the System Level Synthesis framework. This reformulation is solved analytically to recover the LQR for the diffusion equation, and an internally stable implementation of this controller is recovered from the optimal closed-loop mappings. It is further demonstrated that a class of spatio-temporal constraints on the closed-loop maps can be imposed on this closed-loop formulation while preserving convexity.
comment: 8 pages, 2 figures, Submitted to IEEE American Control Conference 2026
Direct Data-Driven Predictive Control for a Three-dimensional Cable-Driven Soft Robotic Arm
Soft robots offer significant advantages in safety and adaptability, yet achieving precise and dynamic control remains a major challenge due to their inherently complex and nonlinear dynamics. Recently, Data-enabled Predictive Control (DeePC) has emerged as a promising model-free approach that bypasses explicit system identification by directly leveraging input-output data. While DeePC has shown success in other domains, its application to soft robots remains underexplored, particularly for three-dimensional (3D) soft robotic systems. This paper addresses this gap by developing and experimentally validating an effective DeePC framework on a 3D, cable-driven soft arm. Specifically, we design and fabricate a soft robotic arm with a thick tubing backbone for stability, a dense silicone body with large cavities for strength and flexibility, and rigid endcaps for secure termination. Using this platform, we implement DeePC with singular value decomposition (SVD)-based dimension reduction for two key control tasks: fixed-point regulation and trajectory tracking in 3D space. Comparative experiments with a baseline model-based controller demonstrate DeePC's superior accuracy, robustness, and adaptability, highlighting its potential as a practical solution for dynamic control of soft robots.
AC Dynamics-aware Trajectory Optimization with Binary Enforcement for Adaptive UFLS Design
The high penetration of distributed energy resources, resulting in backfeed of power at the transmission and distribution interface, is causing conventional underfrequency load shedding (UFLS) schemes to become nonconforming. Adaptive schemes that update UFLS relay settings recursively in time offer a solution, but existing adaptive techniques that obtain UFLS relay settings with linearized or reduced-order model formulations fail to capture AC nonlinear network behavior. In practice, this will result in relays unable to restore system frequency during adverse disturbances. We formulate an adaptive UFLS problem as a trajectory optimization and include the full AC nonlinear network dynamics to ensure AC feasibility and time-coordinated control actions. We include binary decisions to model relay switching action and time-delayed multi-stage load-shedding. However, this formulation results in an intractable MINLP problem. To enforce model tractability, we relax these binary variables into continuous surrogates and reformulate the MINLP as a sequence of NLPs. We solve the NLPs with a homotopy-driven method that enforces near-integer-feasible solutions. We evaluate the framework on multiple synthetic transmission systems and demonstrate that it scales efficiently to networks exceeding 1500+ nodes with over 170k+ continuous and 73k+ binary decision variables, while successfully recovering binary-feasible solutions that arrest the frequency decline during worst-case disturbance.
Energy-efficient torque allocation for straight-line driving of electric vehicles based on pseudoconvex polynomials
Electric vehicles with multiple motors provide a flexibility in meeting the driver torque demand, which calls for minimizing the battery energy consumption through torque allocation. In this paper, we present an approach to this problem based on approximating electric motor losses using higher-order polynomials with specific properties. To ensure a well-behaved optimization landscape, monotonicity and positivity constraints are imposed on the polynomial models using sum of squares programming. This methodology provides robustness against noisy or sparse data, while retaining the computational efficiency of a polynomial function approximation. The torque allocation problem based on such polynomials is formulated as a constrained nonlinear optimization problem and solved efficiently using readily available solvers. In the nominal case, the first-order necessary conditions for optimality can also be used to obtain a global solution. The performance of the proposed method is evaluated on several certification driving cycles against a grid search-based benchmark. Results show a modest influence on electric energy consumption, while enabling real-time optimization and integration with other vehicle control systems.
comment: 21 pages, 8 figures
Distributional Uncertainty and Adaptive Decision-Making in System Co-design
Complex engineered systems require coordinated design choices across heterogeneous components under multiple conflicting objectives and uncertain specifications. Monotone co-design provides a compositional framework for such problems by modeling each subsystem as a design problem: a feasible relation between provided functionalities and required resources in partially ordered sets. Existing uncertain co-design models rely on interval bounds, which support worst-case reasoning but cannot represent probabilistic risk or multi-stage adaptive decisions. We develop a distributional extension of co-design that models uncertain design outcomes as distributions over design problems and supports adaptive decision processes through Markov-kernel re-parameterizations. Using quasi-measurable and quasi-universal spaces, we show that the standard co-design interconnection operations remain compositional under this richer notion of uncertainty. We further introduce queries and observations that extract probabilistic design trade-offs, including feasibility probabilities, confidence bounds, and distributions of minimal required resources. A task-driven unmanned aerial vehicle case study illustrates how the framework captures risk-sensitive and information-dependent design choices that interval-based models cannot express.
Recurrent neural network-based robust control systems with regional properties and application to MPC design
This paper investigates the design of output-feedback schemes for systems described by a class of recurrent neural networks. We propose a procedure based on linear matrix inequalities for designing an observer and a static state-feedback controller. The algorithm leverages global and regional incremental input-to-state stability (incremental ISS) and enables the tracking of constant setpoints, ensuring robustness to disturbances and state estimation uncertainty. To address the potential limitations of regional incremental ISS, we introduce an alternative scheme in which the static law is replaced with a tube-based nonlinear model predictive controller (NMPC) that exploits regional incremental ISS properties. We show that these conditions enable the formulation of a robust NMPC law with guarantees of convergence and recursive feasibility, leading to an enlarged region of attraction. Theoretical results are validated through numerical simulations on the pH-neutralisation process benchmark.
comment: 27 pages, 5 figures
Leader-following Consensus over Jointly Connected Switching Networks is Achievable for Exponentially Unstable Linear Systems
The leader-following consensus problem for general linear multi-agent systems over jointly connected switching networks has been a challenging problem and the solvability of the problem has been limited to the class of linear multi-agent systems whose system matrix is marginally stable. This condition is restrictive since it even excludes the most commonly used double-integrator system. This paper presents a breakthrough by demonstrating that leader-following exponential consensus is achievable for general linear multi-agent systems over jointly connected switching networks, even when the system matrix is exponentially unstable. The degree of instability can be explicitly characterized by two key quantities that arise from the jointly connected condition on a switching graph. By exploiting duality, we further show that the output-based distributed observer design problem for a general leader system is solvable over jointly connected switching networks, even when the system matrix is exponentially unstable. This is also in sharp contrast to the existing distributed observers, which rely on the assumption that the leader system is marginally stable.
Structural Monotonicity in Transmission Scheduling for Remote State Estimation with Hidden Channel Mode
This study treats transmission scheduling for remote state estimation over unreliable channels with a hidden mode. A local Kalman estimator selects scheduling actions, such as power allocation and resource usage, and communicates with a remote estimator based on acknowledgement feedback, balancing estimation performance and communication cost. The resulting problem is naturally formulated as a partially observable Markov decision process (POMDP). In settings with observable channel modes, it is well known that monotonicity of the value function can be established via investigating order-preserving property of transition kernels. In contrast, under partial observability, the transition kernels generally lack this property, which prevents the direct application of standard monotonicity arguments. To overcome this difficulty, we introduce a novel technique, referred to as state-space folding, which induces transformed transition kernels recovering order preservation on the folded space. This transformation enables a rigorous monotonicity analysis in the partially observable setting. As a representative implication, we focus on an associated optimal stopping formulation and show that the resulting optimal scheduling policy admits a threshold structure.
Feasibility Analysis and Constraint Selection in Optimization-Based Controllers
Control synthesis under constraints is at the forefront of research on autonomous systems, in part due to its broad application from low-level control to high-level planning, where computing control inputs is typically cast as a constrained optimization problem. Assessing feasibility of the constraints and selecting among subsets of feasible constraints is a challenging yet crucial problem. In this work, we provide a novel theoretical analysis that yields necessary and sufficient conditions for feasibility assessment of linear constraints and based on this analysis, we develop novel methods for feasible constraint selection in the context of control of autonomous systems. Through a series of simulations, we demonstrate that our algorithms achieve performance comparable to state-of-the-art methods while offering improved computational efficiency. Importantly, our analysis provides a novel theoretical framework for assessing, analyzing and handling constraint infeasibility.
comment: 13 pages, 4 figures, submitted to IEEE Transactions on Automatic Control
KAN-Koopman Based Rapid Detection Of Battery Thermal Anomalies With Diagnostics Guarantees
Early diagnosis of battery thermal anomalies is crucial to ensure safe and reliable battery operation by preventing catastrophic thermal failures. Battery diagnostics primarily rely on battery surface temperature measurements and/or estimation of core temperatures. However, aging-induced changes in the battery model and limited training data remain major challenges for model-based and machine-learning based battery state estimation and diagnostics. To address these issues, we propose a Kolomogorov-Arnold network (KAN) in conjunction with a Koopman-based detection algorithm that leverages the unique advantages of both methods. Firstly, the lightweight KAN provides a model-free estimation of the core temperature to ensure rapid detection of battery thermal anomalies. Secondly, the Koopman operator is learned in real time using the estimated core temperature from KAN and the measured surface temperature of the battery to provide the core and surface temperature prediction for diagnostic residual generation. This online learning approach overcomes the challenges of model changes. Furthermore, we derive analytical conditions to obtain diagnostic guarantees on our KAN-Koopman detection scheme. Our simulation results illustrate a significant reduction in detection time with the proposed algorithm compared to the baseline Koopman-only algorithm.
comment: 9 pages, 1 figure, Accepted to The 2026 American Control Conference
Low-Complexity Control for a Class of Uncertain MIMO Nonlinear Systems under Generalized Time-Varying Output Constraints (extended version)
This paper introduces a novel control framework to address the satisfaction of multiple time-varying output constraints in uncertain high-order MIMO nonlinear control systems. Unlike existing methods, which often assume that the constraints are always decoupled and feasible, our approach can handle coupled time-varying constraints even in the presence of potential infeasibilities. First, it is shown that satisfying multiple constraints essentially boils down to ensuring the positivity of a scalar variable, representing the signed distance from the boundary of the time-varying output-constrained set. To achieve this, a single consolidating constraint is designed that, when satisfied, guarantees convergence to and invariance of the time-varying output-constrained set within a user-defined finite time. Next, a novel robust and low-complexity feedback controller is proposed to ensure the satisfaction of the consolidating constraint. Additionally, we provide a mechanism for online modification of the consolidating constraint to find a least violating solution when the constraints become mutually infeasible for some time. Finally, simulation examples of trajectory and region tracking for a mobile robot validate the proposed approach.
comment: 21 pages, 9 figures (extended version). Minor revisions: corrected text and mathematical typos, updated assumption statements, expanded remarks, extended the discussion at the end of Section III.D, and fixed a minor issue in the proof of Theorem 1; results unchanged
Collaborative Satisfaction of Long-Term Spatial Constraints in Multi-Agent Systems: A Distributed Optimization Approach (extended version)
This paper addresses the problem of collaboratively satisfying long-term spatial constraints in multi-agent systems. Each agent is subject to spatial constraints, expressed as inequalities, which may depend on the positions of other agents with whom they may or may not have direct communication. These constraints need to be satisfied asymptotically or after an unknown finite time. The agents' objective is to collectively achieve a formation that fulfills all constraints. The problem is initially framed as a centralized unconstrained optimization, where the solution yields the optimal configuration by maximizing an objective function that reflects the degree of constraint satisfaction. This function encourages collaboration, ensuring agents help each other meet their constraints while fulfilling their own. When the constraints are infeasible, agents converge to a least-violating solution. A distributed consensus-based optimization scheme is then introduced, which approximates the centralized solution, leading to the development of distributed controllers for single-integrator agents. Finally, simulations validate the effectiveness of the proposed approach.
comment: 10 pages, 6 figures. Typos corrected and some remarks expanded; results unchanged
Robotics
KineVLA: Towards Kinematics-Aware Vision-Language-Action Models with Bi-Level Action Decomposition
In this paper, we introduce a novel kinematics-rich vision-language-action (VLA) task, in which language commands densely encode diverse kinematic attributes (such as direction, trajectory, orientation, and relative displacement) from initiation through completion, at key moments, unlike existing action instructions that capture kinematics only coarsely or partially, thereby supporting fine-grained and personalized manipulation. In this setting, where task goals remain invariant while execution trajectories must adapt to instruction-level kinematic specifications. To address this challenge, we propose KineVLA, a vision-language-action framework that explicitly decouples goal-level invariance from kinematics-level variability through a bi-level action representation and bi-level reasoning tokens to serve as explicit, supervised intermediate variables that align language and action. To support this task, we construct the kinematics-aware VLA datasets spanning both simulation and real-world robotic platforms, featuring instruction-level kinematic variations and bi-level annotations. Extensive experiments on LIBERO and a Realman-75 robot demonstrate that KineVLA consistently outperforms strong VLA baselines on kinematics-sensitive benchmarks, achieving more precise, controllable, and generalizable manipulation behaviors.
Interpreting Context-Aware Human Preferences for Multi-Objective Robot Navigation
Robots operating in human-shared environments must not only achieve task-level navigation objectives such as safety and efficiency, but also adapt their behavior to human preferences. However, as human preferences are typically expressed in natural language and depend on environmental context, it is difficult to directly integrate them into low-level robot control policies. In this work, we present a pipeline that enables robots to understand and apply context-dependent navigation preferences by combining foundational models with a Multi-Objective Reinforcement Learning (MORL) navigation policy. Thus, our approach integrates high-level semantic reasoning with low-level motion control. A Vision-Language Model (VLM) extracts structured environmental context from onboard visual observations, while Large Language Models (LLM) convert natural language user feedback into interpretable, context-dependent behavioral rules stored in a persistent but updatable rule memory. A preference translation module then maps contextual information and stored rules into numerical preference vectors that parameterize a pretrained MORL policy for real-time navigation adaptation. We evaluate the proposed framework through quantitative component-level evaluations, a user study, and real-world robot deployments in various indoor environments. Our results demonstrate that the system reliably captures user intent, generates consistent preference vectors, and enables controllable behavior adaptation across diverse contexts. Overall, the proposed pipeline improves the adaptability, transparency, and usability of robots operating in shared human environments, while maintaining safe and responsive real-time control.
From Optimizable to Interactable: Mixed Digital Twin-Empowered Testing of Vehicle-Infrastructure Cooperation Systems
Sufficient testing under corner cases is critical for the long-term operation of vehicle-infrastructure cooperation systems (VICS). However, existing corner-case generation methods are primarily AI-driven, and VICS testing under corner cases is typically limited to simulation. In this paper, we introduce an L5 ''Interactable'' level to the VICS digital twin (VICS-DT) taxonomy, extending beyond the conventional L4 ''Optimizable'' level. We further propose an L5-level VICS testing framework, IMPACT (Interactive Mixed-digital-twin Paradigm for Advanced Cooperative vehicle-infrastructure Testing). By enabling direct human interactions with VICS entities, IMPACT incorporates highly uncertain and unpredictable human behaviors into the testing loop, naturally generating high-quality corner cases that complement AI-based methods. Furthermore, the mixedDT-enabled ''Physical-Virtual Action Interaction'' facilitates safe VICS testing under corner cases, incorporating real-world environments and entities rather than purely in simulation. Finally, we implement IMPACT on the I-VIT (Interactive Vehicle-Infrastructure Testbed), and experiments demonstrate its effectiveness. The experimental videos are available at our project website: https://dongjh20.github.io/IMPACT.
Bringing Network Coding into Multi-Robot Systems: Interplay Study for Autonomous Systems over Wireless Communications
Communication is a core enabler for multi-robot systems (MRS), providing the mechanism through which robots exchange state information, coordinate actions, and satisfy safety constraints. While many MRS autonomy algorithms assume reliable and timely message delivery, realistic wireless channels introduce delay, erasures, and ordering stalls that can degrade performance and compromise safety-critical decisions of the robot task. In this paper, we investigate how transport-layer reliability mechanisms that mitigate communication losses and delays shape the autonomy-communication loop. We show that conventional non-coded retransmission-based protocols introduce long delays that are misaligned with the timeliness requirements of MRS applications, and may render the received data irrelevant. As an alternative, we advocate for adaptive and causal network coding, which proactively injects coded redundancy to achieve the desired delay and throughput that enable relevant data delivery to the robotic task. Specifically, this method adapts to channel conditions between robots and causally tunes the communication rates via efficient algorithms. We present two case studies: cooperative localization under delayed and lossy inter-robot communication, and a safety-critical overtaking maneuver where timely vehicle-to-vehicle message availability determines whether an ego vehicle can abort to avoid a crash. Our results demonstrate that coding-based communication significantly reduces in-order delivery stalls, preserves estimation consistency under delay, and improves deadline reliability relative to retransmission-based transport. Overall, the study highlights the need to jointly design autonomy algorithms and communication mechanisms, and positions network coding as a principled tool for dependable multi-robot operation over wireless networks.
P$^{3}$Nav: End-to-End Perception, Prediction and Planning for Vision-and-Language Navigation
In Vision-and-Language Navigation (VLN), an agent is required to plan a path to the target specified by the language instruction, using its visual observations. Consequently, prevailing VLN methods primarily focus on building powerful planners through visual-textual alignment. However, these approaches often bypass the imperative of comprehensive scene understanding prior to planning, leaving the agent with insufficient perception or prediction capabilities. Thus, we propose P$^{3}$Nav, a novel end-to-end framework integrating perception, prediction, and planning in a unified pipeline to strengthen the VLN agent's scene understanding and boost navigation success. Specifically, P$^{3}$Nav augments perception by extracting complementary cues from object-level and map-level perspectives. Subsequently, our P$^{3}$Nav predicts waypoints to model the agent's potential future states, endowing the agent with intrinsic awareness of candidate positions during navigation. Conditioned on these future waypoints, P$^{3}$Nav further forecasts semantic map cues, enabling proactive planning and reducing the strict reliance on purely historical context. Integrating these perceptual and predictive cues, a holistic planning module finally carries out the VLN tasks. Extensive experiments demonstrate that our P$^{3}$Nav achieves new state-of-the-art performance on the REVERIE, R2R-CE, and RxR-CE benchmarks.
FloorPlan-VLN: A New Paradigm for Floor Plan Guided Vision-Language Navigation
Existing Vision-Language Navigation (VLN) task requires agents to follow verbose instructions, ignoring some potentially useful global spatial priors, limiting their capability to reason about spatial structures. Although human-readable spatial schematics (e.g., floor plans) are ubiquitous in real-world buildings, current agents lack the cognitive ability to comprehend and utilize them. To bridge this gap, we introduce \textbf{FloorPlan-VLN}, a new paradigm that leverages structured semantic floor plans as global spatial priors to enable navigation with only concise instructions. We first construct the FloorPlan-VLN dataset, which comprises over 10k episodes across 72 scenes. It pairs more than 100 semantically annotated floor plans with Matterport3D-based navigation trajectories and concise instructions that omit step-by-step guidance. Then, we propose a simple yet effective method \textbf{FP-Nav} that uses a dual-view, spatio-temporally aligned video sequence, and auxiliary reasoning tasks to align observations, floor plans, and instructions. When evaluated under this new benchmark, our method significantly outperforms adapted state-of-the-art VLN baselines, achieving more than a 60\% relative improvement in navigation success rate. Furthermore, comprehensive noise modeling and real-world deployments demonstrate the feasibility and robustness of FP-Nav to actuation drift and floor plan distortions. These results validate the effectiveness of floor plan guided navigation and highlight FloorPlan-VLN as a promising step toward more spatially intelligent navigation.
SafeLand: Safe Autonomous Landing in Unknown Environments with Bayesian Semantic Mapping
Autonomous landing of uncrewed aerial vehicles (UAVs) in unknown, dynamic environments poses significant safety challenges, particularly near people and infrastructure, as UAVs transition to routine urban and rural operations. Existing methods often rely on prior maps, heavy sensors like LiDAR, static markers, or fail to handle non-cooperative dynamic obstacles like humans, limiting generalization and real-time performance. To address these challenges, we introduce SafeLand, a lean, vision-based system for safe autonomous landing (SAL) that requires no prior information and operates only with a camera and a lightweight height sensor. Our approach constructs an online semantic ground map via deep learning-based semantic segmentation, optimized for embedded deployment and trained on a consolidation of seven curated public aerial datasets (achieving 70.22% mIoU across 20 classes), which is further refined through Bayesian probabilistic filtering with temporal semantic decay to robustly identify metric-scale landing spots. A behavior tree then governs adaptive landing, iteratively validates the spot, and reacts in real time to dynamic obstacles by pausing, climbing, or rerouting to alternative spots, maximizing human safety. We extensively evaluate our method in 200 simulations and 60 end-to-end field tests across industrial, urban, and rural environments at altitudes up to 100m, demonstrating zero false negatives for human detection. Compared to the state of the art, SafeLand achieves sub-second response latency, substantially lower than previous methods, while maintaining a superior success rate of 95%. To facilitate further research in aerial robotics, we release SafeLand's segmentation model as a plug-and-play ROS package, available at https://github.com/markus-42/SafeLand.
Physics-informed Deep Mixture-of-Koopmans Vehicle Dynamics Model with Dual-branch Encoder for Distributed Electric-drive Trucks
Advanced autonomous driving systems require accurate vehicle dynamics modeling. However, identifying a precise dynamics model remains challenging due to strong nonlinearities and the coupled longitudinal and lateral dynamic characteristics. Previous research has employed physics-based analytical models or neural networks to construct vehicle dynamics representations. Nevertheless, these approaches often struggle to simultaneously achieve satisfactory performance in terms of system identification efficiency, modeling accuracy, and compatibility with linear control strategies. In this paper, we propose a fully data-driven dynamics modeling method tailored for complex distributed electric-drive trucks (DETs), leveraging Koopman operator theory to represent highly nonlinear dynamics in a lifted linear embedding space. To achieve high-precision modeling, we first propose a novel dual-branch encoder which encodes dynamic states and provides a powerful basis for the proposed Koopman-based methods entitled KODE. A physics-informed supervision mechanism, grounded in the geometric consistency of temporal vehicle motion, is incorporated into the training process to facilitate effective learning of both the encoder and the Koopman operator. Furthermore, to accommodate the diverse driving patterns of DETs, we extend the vanilla Koopman operator to a mixture-of-Koopman operator framework, enhancing modeling capability. Simulations conducted in a high-fidelity TruckSim environment and real-world experiments demonstrate that the proposed approach achieves state-of-the-art performance in long-term dynamics state estimation.
comment: 13 pages, 8 tables, 7 figures
OmniVLN: Omnidirectional 3D Perception and Token-Efficient LLM Reasoning for Visual-Language Navigation across Air and Ground Platforms
Language-guided embodied navigation requires an agent to interpret object-referential instructions, search across multiple rooms, localize the referenced target, and execute reliable motion toward it. Existing systems remain limited in real indoor environments because narrow field-of-view sensing exposes only a partial local scene at each step, often forcing repeated rotations, delaying target discovery, and producing fragmented spatial understanding; meanwhile, directly prompting LLMs with dense 3D maps or exhaustive object lists quickly exceeds the context budget. We present OmniVLN, a zero-shot visual-language navigation framework that couples omnidirectional 3D perception with token-efficient hierarchical reasoning for both aerial and ground robots. OmniVLN fuses a rotating LiDAR and panoramic vision into a hardware-agnostic mapping stack, incrementally constructs a five-layer Dynamic Scene Graph (DSG) from mesh geometry to room- and building-level structure, and stabilizes high-level topology through persistent-homology-based room partitioning and hybrid geometric/VLM relation verification. For navigation, the global DSG is transformed into an agent-centric 3D octant representation with multi-resolution spatial attention prompting, enabling the LLM to progressively filter candidate rooms, infer egocentric orientation, localize target objects, and emit executable navigation primitives while preserving fine local detail and compact long-range memory. Experiments show that the proposed hierarchical interface improves spatial referring accuracy from 77.27\% to 93.18\%, reduces cumulative prompt tokens by up to 61.7\% in cluttered multi-room settings, and improves navigation success by up to 11.68\% over a flat-list baseline. We will release the code and an omnidirectional multimodal dataset to support reproducible research.
DexEXO: A Wearability-First Dexterous Exoskeleton for Operator-Agnostic Demonstration and Learning
Scaling dexterous robot learning is constrained by the difficulty of collecting high-quality demonstrations across diverse operators. Existing wearable interfaces often trade comfort and cross-user adaptability for kinematic fidelity, while embodiment mismatch between demonstration and deployment requires visual post-processing before policy training. We present DexEXO, a wearability-first hand exoskeleton that aligns visual appearance, contact geometry, and kinematics at the hardware level. DexEXO features a pose-tolerant thumb mechanism and a slider-based finger interface analytically modeled to support hand lengths from 140~mm to 217~mm, reducing operator-specific fitting and enabling scalable cross-operator data collection. A passive hand visually matches the deployed robot, allowing direct policy training from raw wrist-mounted RGB observations. User studies demonstrate improved comfort and usability compared to prior wearable systems. Using visually aligned observations alone, we train diffusion policies that achieve competitive performance while substantially simplifying the end-to-end pipeline. These results show that prioritizing wearability and hardware-level embodiment alignment reduces both human and algorithmic bottlenecks without sacrificing task performance. Project Page: https://dexexo-research.github.io/
comment: https://dexexo-research.github.io/
Physics-informed offline reinforcement learning eliminates catastrophic fuel waste in maritime routing
International shipping produces approximately 3% of global greenhouse gas emissions, yet voyage routing remains dominated by heuristic methods. We present PIER (Physics-Informed, Energy-efficient, Risk-aware routing), an offline reinforcement learning framework that learns fuel-efficient, safety-aware routing policies from physics-calibrated environments grounded in historical vessel tracking data and ocean reanalysis products, requiring no online simulator. Validated on one full year (2023) of AIS data across seven Gulf of Mexico routes (840 episodes per method), PIER reduces mean CO2 emissions by 10% relative to great-circle routing. However, PIER's primary contribution is eliminating catastrophic fuel waste: great-circle routing incurs extreme fuel consumption (>1.5x median) in 4.8% of voyages; PIER reduces this to 0.5%, a 9-fold reduction. Per-voyage fuel variance is 3.5x lower (p<0.001), with bootstrap 95% CI for mean savings [2.9%, 15.7%]. Partial validation against observed AIS vessel behavior confirms consistency with the fastest real transits while exhibiting 23.1x lower variance. Crucially, PIER is forecast-independent: unlike A* path optimization whose wave protection degrades 4.5x under realistic forecast uncertainty, PIER maintains constant performance using only local observations. The framework combines physics-informed state construction, demonstration-augmented offline data, and a decoupled post-hoc safety shield, an architecture that transfers to wildfire evacuation, aircraft trajectory optimization, and autonomous navigation in unmapped terrain.
ReSteer: Quantifying and Refining the Steerability of Multitask Robot Policies
Despite strong multi-task pretraining, existing policies often exhibit poor task steerability. For example, a robot may fail to respond to a new instruction ``put the bowl in the sink" when moving towards the oven, executing ``close the oven", even though it can complete both tasks when executed separately. We propose ReSteer, a framework to quantify and improve task steerability in multitask robot policies. We conduct an exhaustive evaluation of state-of-the-art policies, revealing a common lack of steerability. We find that steerability is associated with limited overlap among training task trajectory distributions, and introduce a proxy metric to measure this overlap from policy behavior. Building on this insight, ReSteer improves steerability via three components: (i) a steerability estimator that identifies low-steerability states without full-rollout evaluation, (ii) a steerable data generator that synthesizes motion segments from these states, and (iii) a self-refinement pipeline that improves policy steerability using the generated data. In simulation on LIBERO, ReSteer improves steerability by 11\% over 18k rollouts. In real-world experiments, we show that improved steerability is critical for interactive use, enabling users to instruct robots to perform any task at any time. We hope this work motivates further study on quantifying steerability and data collection strategies for large robot policies.
comment: Project website: https://resteer-vla.github.io/
Neural Radiance Maps for Extraterrestrial Navigation and Path Planning
Autonomous vehicles such as the Mars rovers currently lead the vanguard of surface exploration on extraterrestrial planets and moons. In order to accelerate the pace of exploration and science objectives, it is critical to plan safe and efficient paths for these vehicles. However, current rover autonomy is limited by a lack of global maps which can be easily constructed and stored for onboard re-planning. Recently, Neural Radiance Fields (NeRFs) have been introduced as a detailed 3D scene representation which can be trained from sparse 2D images and efficiently stored. We propose to use NeRFs to construct maps for online use in autonomous navigation, and present a planning framework which leverages the NeRF map to integrate local and global information. Our approach interpolates local cost observations across global regions using kernel ridge regression over terrain features extracted from the NeRF map, allowing the rover to re-route itself around untraversable areas discovered during online operation. We validate our approach in high-fidelity simulation and demonstrate lower cost and higher percentage success rate path planning compared to various baselines.
comment: Published in the Proceedings of the ION GNSS+ 2023 Conference
Full Stack Navigation, Mapping, and Planning for the Lunar Autonomy Challenge
We present a modular, full-stack autonomy system for lunar surface navigation and mapping developed for the Lunar Autonomy Challenge. Operating in a GNSS-denied, visually challenging environment, our pipeline integrates semantic segmentation, stereo visual odometry, pose graph SLAM with loop closures, and layered planning and control. We leverage lightweight learning-based perception models for real-time segmentation and feature tracking and use a factor-graph backend to maintain globally consistent localization. High-level waypoint planning is designed to promote mapping coverage while encouraging frequent loop closures, and local motion planning uses arc sampling with geometric obstacle checks for efficient, reactive control. We evaluate our approach in the competition's high-fidelity lunar simulator, demonstrating centimeter-level localization accuracy, high-fidelity map generation, and strong repeatability across random seeds and rock distributions. Our solution achieved first place in the final competition evaluation.
comment: Published in the Proceedings of the ION GNSS+ 2025 conference
Visual SLAM with DEM Anchoring for Lunar Surface Navigation
Future lunar missions will require autonomous rovers capable of traversing tens of kilometers across challenging terrain while maintaining accurate localization and producing globally consistent maps. However, the absence of global positioning systems, extreme illumination, and low-texture regolith make long-range navigation on the Moon particularly difficult, as visual-inertial odometry pipelines accumulate drift over extended traverses. To address this challenge, we present a stereo visual simultaneous localization and mapping (SLAM) system that integrates learned feature detection and matching with global constraints from digital elevation models (DEMs). Our front-end employs learning-based feature extraction and matching to achieve robustness to illumination extremes and repetitive terrain, while the back-end incorporates DEM-derived height and surface-normal factors into a pose graph, providing absolute surface constraints that mitigate long-term drift. We validate our approach using both simulated lunar traverse data generated in Unreal Engine and real Moon/Mars analog data collected from Mt. Etna. Results demonstrate that DEM anchoring consistently reduces absolute trajectory error compared to baseline SLAM methods, lowering drift in long-range navigation even in repetitive or visually aliased terrain.
comment: Accepted to IEEE Aerospace Conference 2026
GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes 3DV 2026
Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remains challenging due to the need for accurate spatial reasoning, physical feasibility, and multimodal scene understanding. Existing approaches often rely on 2D or partial 3D representations, limiting their ability to capture full scene geometry and constraining trajectory precision. We present GMT, a multimodal transformer framework that generates realistic and goal-directed object trajectories by jointly leveraging 3D bounding box geometry, point cloud context, semantic object categories, and target end poses. The model represents trajectories as continuous 6-DOF pose sequences and employs a tailored conditioning strategy that fuses geometric, semantic, contextual, and goaloriented information. Extensive experiments on synthetic and real-world benchmarks demonstrate that GMT outperforms state-of-the-art human motion and human-object interaction baselines, such as CHOIS and GIMO, achieving substantial gains in spatial accuracy and orientation control. Our method establishes a new benchmark for learningbased manipulation planning and shows strong generalization to diverse objects and cluttered 3D environments. Project page: https://huajian- zeng.github. io/projects/gmt/.
comment: Accpeted by 3DV 2026. Project Page: https://huajian-zeng.github.io/projects/gmt/
A Single-Fiber Optical Frequency Domain Reflectometry (OFDR)-Based Shape Sensing of Concentric Tube Steerable Drilling Robots
This paper introduces a novel shape-sensing approach for Concentric Tube Steerable Drilling Robots (CT-SDRs) based on Optical Frequency Domain Reflectometry (OFDR). Unlike traditional FBG-based methods, OFDR enables continuous strain measurement along the entire fiber length with enhanced spatial resolution. In the proposed method, a Shape Sensing Assembly (SSA) is first fabricated by integrating a single OFDR fiber with a flat NiTi wire. The calibrated SSA is then routed through and housed within the internal channel of a flexible drilling instrument, which is guided by the pre-shaped NiTi tube of the CT-SDR. In this configuration, the drilling instrument serves as a protective sheath for the SSA during drilling, eliminating the need for integration or adhesion to the instrument surface that is typical of conventional optical sensor approaches. The performance of the proposed SSA, integrated within the cannulated CT-SDR, was thoroughly evaluated under free-bending conditions and during drilling along multiple J-shaped trajectories in synthetic Sawbones phantoms. Results demonstrate accurate and reliable shape-sensing capability, confirming the feasibility and robustness of this integration strategy.
comment: 8 pages, 7 figures
Specification-Aware Distribution Shaping for Robotics Foundation Models
Robotics foundation models have demonstrated strong capabilities in executing natural language instructions across diverse tasks and environments. However, they remain largely data-driven and lack formal guarantees on safety and satisfaction of time-dependent specifications during deployment. In practice, robots often need to comply with operational constraints involving rich spatio-temporal requirements such as time-bounded goal visits, sequential objectives, and persistent safety conditions. In this work, we propose a specification-aware action distribution optimization framework that enforces a broad class of Signal Temporal Logic (STL) constraints during execution of a pretrained robotics foundation model without modifying its parameters. At each decision step, the method computes a minimally modified action distribution that satisfies a hard STL feasibility constraint by reasoning over the remaining horizon using forward dynamics propagation. We validate the proposed framework in simulation using a state-of-the-art robotics foundation model across multiple environments and complex specifications.
comment: 8 pages, 3 figures
RoboForge: Physically Optimized Text-guided Whole-Body Locomotion for Humanoids IROS 2026
While generative models have become effective at producing human-like motions from text, transferring these motions to humanoid robots for physical execution remains challenging. Existing pipelines are often limited by retargeting, where kinematic quality is undermined by physical infeasibility, contact-transition errors, and the high cost of real-world dynamical data. We present a unified latent-driven framework that bridges natural language and whole-body humanoid locomotion through a retarget-free, physics-optimized pipeline. Rather than treating generation and control as separate stages, our key insight is to couple them bidirectionally under physical constraints.We introduce a Physical Plausibility Optimization (PP-Opt) module as the coupling interface. In the forward direction, PP-Opt refines a teacher-student distillation policy with a plausibility-centric reward to suppress artifacts such as floating, skating, and penetration. In the backward direction, it converts reward-optimized simulation rollouts into high-quality explicit motion data, which is used to fine-tune the motion generator toward a more physically plausible latent distribution. This bidirectional design forms a self-improving cycle: the generator learns a physically grounded latent space, while the controller learns to execute latent-conditioned behaviors with dynamical integrity.Extensive experiments on the Unitree G1 humanoid show that our bidirectional optimization improves tracking accuracy and success rates. Across IsaacLab and MuJoCo, the implicit latent-driven pipeline consistently outperforms conventional explicit retargeting baselines in both precision and stability. By coupling diffusion-based motion generation with physical plausibility optimization, our framework provides a practical path toward deployable text-guided humanoid intelligence.
comment: 10 pages, 5 figures,submitted to IROS 2026
DexViTac: Collecting Human Visuo-Tactile-Kinematic Demonstrations for Contact-Rich Dexterous Manipulation
Large-scale, high-quality multimodal demonstrations are essential for robot learning of contact-rich dexterous manipulation. While human-centric data collection systems lower the barrier to scaling, they struggle to capture the tactile information during physical interactions. Motivated by this, we present DexViTac, a portable, human-centric data collection system tailored for contact-rich dexterous manipulation. The system enables the high-fidelity acquisition of first-person vision, high-density tactile sensing, end-effector poses, and hand kinematics within unstructured, in-the-wild environments. Building upon this hardware, we propose a kinematics-grounded tactile representation learning algorithm that effectively resolves semantic ambiguities within tactile signals. Leveraging the efficiency of DexViTac, we construct a multimodal dataset comprising over 2,400 visuo-tactile-kinematic demonstrations. Experiments demonstrate that DexViTac achieves a collection efficiency exceeding 248 demonstrations per hour and remains robust against complex visual occlusions. Real-world deployment confirms that policies trained with the proposed dataset and learning strategy achieve an average success rate exceeding 85% across four challenging tasks. This performance significantly outperforms baseline methods, thereby validating the substantial improvement the system provides for learning contact-rich dexterous manipulation. Project page: https://xitong-c.github.io/DexViTac/.
comment: 9 pages, 9 figures.Project page: https://xitong-c.github.io/DexViTac/
ProbeFlow: Training-Free Adaptive Flow Matching for Vision-Language-Action Models
Recent Vision-Language-Action (VLA) models equipped with Flow Matching (FM) action heads achieve state-of-the-art performance in complex robot manipulation. However, the multi-step iterative ODE solving required by FM introduces inference latency that precludes responsive physical control. While current acceleration efforts optimize the Vision-Language Model (VLM) backbone, the action head bottleneck remains overlooked. To address this, we propose ProbeFlow, a training-free adaptive inference framework tai- lored for continuous robotic control. By evaluating geometric trajectory complexity via the cosine similarity between initial and lookahead velocity vectors, ProbeFlow dynamically sched- ules integration steps to prune redundant network evaluations. On the MetaWorld benchmark, it accelerates action decoding by 14.8x (reducing average steps from N = 50 to 2.6) and cuts end-to-end system latency by 2.8x without compromising the manipulation success rate. On the long-horizon LIBERO benchmark, the probe automatically allocates a denser schedule to navigate semantic bottlenecks, effectively resolving the flow solver delay. Real-world physical deployments confirm that ProbeFlow successfully mitigates action decoding latency while ensuring execution stability, offering a highly practical solution for low-latency continuous generative policies.
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control
Diffusion models and flow matching have become a cornerstone of robotic imitation learning, yet they suffer from a structural inefficiency where inference is often bound to a fixed integration schedule that is agnostic to state complexity. This paradigm forces the policy to expend the same computational budget on trivial motions as it does on complex tasks. We introduce Generative Control as Optimization (GeCO), a time-unconditional framework that transforms action synthesis from trajectory integration into iterative optimization. GeCO learns a stationary velocity field in the action-sequence space where expert behaviors form stable attractors. Consequently, test-time inference becomes an adaptive process that allocates computation based on convergence--exiting early for simple states while refining longer for difficult ones. Furthermore, this stationary geometry yields an intrinsic, training-free safety signal, as the field norm at the optimized action serves as a robust out-of-distribution (OOD) detector, remaining low for in-distribution states while significantly increasing for anomalies. We validate GeCO on standard simulation benchmarks and demonstrate seamless scaling to pi0-series Vision-Language-Action (VLA) models. As a plug-and-play replacement for standard flow-matching heads, GeCO improves success rates and efficiency with an optimization-native mechanism for safe deployment. Video and code can be found at https://hrh6666.github.io/GeCO/
comment: 10 pages, 6 figures
EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards
Video generative models are increasingly used as world models for robotics, where a model generates a future visual rollout conditioned on the current observation and task instruction, and an inverse dynamics model (IDM) converts the generated frames into executable robot actions. However, current video world models lack explicit executability constraints. As a result, visually coherent rollouts may still violate rigid-body and kinematic consistency, producing unstable or infeasible control commands when decoded by an IDM. We refer to this mismatch between visual generation and physically executable control as the executability gap. While this gap can be mitigated at inference time using techniques such as rejection sampling, such approaches are inefficient due to the high cost of video generation. In this paper, we leverage the executability gap as a training signal and introduce Executable Video Alignment (EVA), a reinforcement-learning post-training framework for aligning video world models. EVA trains an inverse dynamics model on real robot trajectories and repurposes it as a reward model that evaluates generated videos through the action sequences they induce, encouraging smooth motions measured by velocity, acceleration, and jerk while penalizing actions that violate embodiment constraints. Importantly, the reward remains informative even when generated videos contain severe visual artifacts, since such artifacts typically translate into unstable or out-of-bound actions. Experiments on the RoboTwin benchmark and a real bimanual robot show that EVA reduces embodiment-specific artifacts in generated rollouts and improves downstream task execution success.
comment: Project page: https://eva-project-page.github.io/
Huddle: Parallel Shape Assembly using Decentralized, Minimalistic Robots
We propose a novel algorithm for forming arbitrarily shaped assemblies using decentralized robots. By relying on local interactions, the algorithm ensures there are no unreachable states or gaps in the assembly, which are global properties. The in-assembly robots attract passing-by robots into expanding the assembly via a simple implementation of signaling and alignment. Our approach is minimalistic, requiring only communication between attached, immediate neighbors. It is motion-agnostic and requires no pose localization, enabling asynchronous and order-independent assembly. We prove the algorithm's correctness and demonstrate its effectiveness in forming a 107-robot assembly.
comment: 16 pages, 6 figures, submitted to DARS 2026
Multi-Source Human-in-the-Loop Digital Twin Testbed for Connected and Autonomous Vehicles in Mixed Traffic Flow
In the emerging mixed traffic environments, Connected and Autonomous Vehicles (CAVs) have to interact with surrounding human-driven vehicles (HDVs). This paper introduces MSH-MCCT (Multi-Source Human-in-the-Loop Mixed Cloud Control Testbed), a novel CAV testbed that captures complex interactions between various CAVs and HDVs. Utilizing the Mixed Digital Twin concept, which combines Mixed Reality with Digital Twin, MSH-MCCT integrates physical, virtual, and mixed platforms, along with multi-source control inputs. Bridged by the mixed platform, MSH-MCCT allows human drivers and CAV algorithms to operate both physical and virtual vehicles within multiple fields of view. Particularly, this testbed facilitates the coexistence and real-time interaction of physical and virtual CAVs \& HDVs, significantly enhancing the experimental flexibility and scalability. Experiments on vehicle platooning in mixed traffic showcase the potential of MSH-MCCT to conduct CAV testing with multi-source real human drivers in the loop through driving simulators of diverse fidelity. The videos for the experiments are available at our project website: https://dongjh20.github.io/MSH-MCCT.
VolumeDP: Modeling Volumetric Representation for Manipulation Policy Learning
Imitation learning is a prominent paradigm for robotic manipulation. However, existing visual imitation methods map 2D image observations directly to 3D action outputs, imposing a 2D-3D mismatch that hinders spatial reasoning and degrades robustness. We present VolumeDP, a policy architecture that restores spatial alignment by explicitly reasoning in 3D. VolumeDP first lifts image features into a Volumetric Representation via cross-attention. It then selects task-relevant voxels with a learnable module and converts them into a compact set of spatial tokens, markedly reducing computation while preserving action-critical geometry. Finally, a multi-token decoder conditions on the entire token set to predict actions, thereby avoiding lossy aggregation that collapses multiple spatial tokens into a single descriptor. VolumeDP achieves a state-of-the-art average success rate of 88.8% on the LIBERO simulation benchmark, outperforming the strongest baseline by a substantial 14.8% improvement. It also delivers large performance gains over prior methods on the ManiSkill and LIBERO-Plus benchmarks. Real-world experiments further demonstrate higher success rates and robust generalization to novel spatial layouts, camera viewpoints, and environment backgrounds. Code will be released.
AERR-Nav: Adaptive Exploration-Recovery-Reminiscing Strategy for Zero-Shot Object Navigation
Zero-Shot Object Navigation (ZSON) in unknown multi-floor environments presents a significant challenge. Recent methods, mostly based on semantic value greedy waypoint selection, spatial topology-enhanced memory, and Multimodal Large Language Model (MLLM) as a decision-making framework, have led to improvements. However, these architectures struggle to balance exploration and exploitation for ZSON when encountering unseen environments, especially in multi-floor settings, such as robots getting stuck at narrow intersections, endlessly wandering, or failing to find stair entrances. To overcome these challenges, we propose AERR-Nav, a Zero-Shot Object Navigation framework that dynamically adjusts its state based on the robot's environment. Specifically, AERR-Nav has the following two key advantages: (1) An Adaptive Exploration-Recovery-Reminiscing Strategy, enables robots to dynamically transition between three states, facilitating specialized responses to diverse navigation scenarios. (2) An Adaptive Exploration State featuring Fast and Slow-Thinking modes helps robots better balance exploration, exploitation, and higher-level reasoning based on evolving environmental information. Extensive experiments on the HM3D and MP3D benchmarks demonstrate that our AERR-Nav achieves state-of-the-art performance among zero-shot methods. Comprehensive ablation studies further validate the efficacy of our proposed strategy and modules.
Consistency-Driven Dual LSTM Models for Kinematic Control of a Wearable Soft Robotic Arm
In this paper, we introduce a consistency-driven dual LSTM framework for accurately learning both the forward and inverse kinematics of a pneumatically actuated soft robotic arm integrated into a wearable device. This approach effectively captures the nonlinear and hysteretic behaviors of soft pneumatic actuators while addressing the one-to-many mapping challenge between actuation inputs and end-effector positions. By incorporating a cycle consistency loss, we enhance physical realism and improve the stability of inverse predictions. Extensive experiments-including trajectory tracking, ablation studies, and wearable demonstrations-confirm the effectiveness of our method. Results indicate that the inclusion of the consistency loss significantly boosts prediction accuracy and promotes physical consistency over conventional approaches. Moreover, the wearable soft robotic arm demonstrates strong human-robot collaboration capabilities and adaptability in everyday tasks such as object handover, obstacle-aware pick-and-place, and drawer operation. This work underscores the promising potential of learning-based kinematic models for human-centric, wearable robotic systems.
AgentVLN: Towards Agentic Vision-and-Language Navigation
Vision-and-Language Navigation (VLN) requires an embodied agent to ground complex natural-language instructions into long-horizon navigation in unseen environments. While Vision-Language Models (VLMs) offer strong 2D semantic understanding, current VLN systems remain constrained by limited spatial perception, 2D-3D representation mismatch, and monocular scale ambiguity. In this paper, we propose AgentVLN, a novel and efficient embodied navigation framework that can be deployed on edge computing platforms. We formulate VLN as a Partially Observable Semi-Markov Decision Process (POSMDP) and introduce a VLM-as-Brain paradigm that decouples high-level semantic reasoning from perception and planning via a plug-and-play skill library. To resolve multi-level representation inconsistency, we design a cross-space representation mapping that projects perception-layer 3D topological waypoints into the image plane, yielding pixel-aligned visual prompts for the VLM. Building on this bridge, we integrate a context-aware self-correction and active exploration strategy to recover from occlusions and suppress error accumulation over long trajectories. To further address the spatial ambiguity of instructions in unstructured environments, we propose a Query-Driven Perceptual Chain-of-Thought (QD-PCoT) scheme, enabling the agent with the metacognitive ability to actively seek geometric depth information. Finally, we construct AgentVLN-Instruct, a large-scale instruction-tuning dataset with dynamic stage routing conditioned on target visibility. Extensive experiments show that AgentVLN consistently outperforms prior state-of-the-art methods (SOTA) on long-horizon VLN benchmarks, offering a practical paradigm for lightweight deployment of next-generation embodied navigation models. Code: https://github.com/Allenxinn/AgentVLN.
comment: 19pages, 4 figures
REAL: Robust Extreme Agility via Spatio-Temporal Policy Learning and Physics-Guided Filtering
Extreme legged parkour demands rapid terrain assessment and precise foot placement under highly dynamic conditions. While recent learning-based systems achieve impressive agility, they remain fundamentally fragile to perceptual degradation, where even brief visual noise or latency can cause catastrophic failure. To overcome this, we propose Robust Extreme Agility Learning (REAL), an end-to-end framework for reliable parkour under sensory corruption. Instead of relying on perfectly clean perception, REAL tightly couples vision, proprioceptive history, and temporal memory. We distill a cross-modal teacher policy into a deployable student equipped with a FiLM-modulated Mamba backbone to actively filter visual noise and build short-term terrain memory actively. Furthermore, a physics-guided Bayesian state estimator enforces rigid-body consistency during high-impact maneuvers. Validated on a Unitree Go2 quadruped, REAL successfully traverses extreme obstacles even with a 1-meter visual blind zone, while strictly satisfying real-time control constraints with a bounded 13.1 ms inference time.
VectorWorld: Efficient Streaming World Model via Diffusion Flow on Vector Graphs
Closed-loop evaluation of autonomous-driving policies requires interactive simulation beyond log replay. However, existing generative world models often degrade in closed loop due to (i) history-free initialization that mismatches policy inputs, (ii) multi-step sampling latency that violates real-time budgets, and (iii) compounding kinematic infeasibility over long horizons. We propose VectorWorld, a streaming world model that incrementally generates ego-centric $64 \mathrm{m}\times 64\mathrm{m}$ lane--agent vector-graph tiles during rollout. VectorWorld aligns initialization with history-conditioned policies by producing a policy-compatible interaction state via a motion-aware gated VAE. It enables real-time outpainting via solver-free one-step masked completion with an edge-gated relational DiT trained with interval-conditioned MeanFlow and JVP-based large-step supervision. To stabilize long-horizon rollouts, we introduce $Δ$Sim, a physics-aligned non-ego (NPC) policy with hybrid discrete--continuous actions and differentiable kinematic logit shaping. On Waymo open motion and nuPlan, VectorWorld improves map-structure fidelity and initialization validity, and supports stable, real-time $1\mathrm{km}+$ closed-loop rollouts (\href{https://github.com/jiangchaokang/VectorWorld}{code}).
comment: Under Review
Real-Time Online Learning for Model Predictive Control using a Spatio-Temporal Gaussian Process Approximation ICRA
Learning-based model predictive control (MPC) can enhance control performance by correcting for model inaccuracies, enabling more precise state trajectory predictions than traditional MPC. A common approach is to model unknown residual dynamics as a Gaussian process (GP), which leverages data and also provides an estimate of the associated uncertainty. However, the high computational cost of online learning poses a major challenge for real-time GP-MPC applications. This work presents an efficient implementation of an approximate spatio-temporal GP model, offering online learning at constant computational complexity. It is optimized for GP-MPC, where it enables improved control performance by learning more accurate system dynamics online in real-time, even for time-varying systems. The performance of the proposed method is demonstrated by simulations and hardware experiments in the exemplary application of autonomous miniature racing.
comment: to be published at 2026 IEEE International Conference on Robotics & Automation (ICRA)
HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness
Vision-Language-Action (VLA) Models have become the mainstream solution for robot control, but suffer from slow inference speeds. Speculative Decoding (SD) is a promising acceleration method which can be divided into two categories: drafter-based SD and retrieval-based SD. Existing methods fail to analyze the advantages and disadvantages of these two types of SD in VLA models, leading to their sole application or optimization. In this paper, we analyze the trajectory patterns of robots controlled by the VLA model and derive a key insight: the two types of SD should be used in a hybrid manner. However, achieving hybrid SD in VLA models poses several challenges: (1) draft rejection and persistent errors in retrieval-based SD; (2) difficulty in determining the hybrid boundary. To address these, we propose the HeiSD framework. We propose a retrieval-based SD optimization method in HeiSD,which contains a verify-skip mechanism and a sequence-wise relaxed acceptance strategy. Moreover, we proposed a kinematic-based fused metric in HeiSD to automatically determine the hybrid boundary. Experimental results demonstrate that HeiSD attains a speedup of up to 2.45x in simulation benchmarks and 2.06x~2.41x in real-world scenarios, while sustaining a high task success rate.
Multi-material Direct Ink Writing and Embroidery for Stretchable Wearable Sensors
The development of wearable sensing systems for sports performance tracking, rehabilitation, and injury prevention has driven growing demand for smart garments that combine comfort, durability, and accurate motion detection. This paper presents a textile-compatible fabrication workflow that integrates multi-material direct ink writing with automated embroidery to create stretchable strain sensors directly embedded into garments. The process combines sequential multi-material printing of a silicone-carbon grease-silicone stack with automated embroidery that provides both mechanical fixation and electrical interfacing in a single step. The resulting hybrid sensor demonstrates stretchability up to 120% strain while maintaining electrical continuity, with approximately linear behaviour up to 60% strain (R^2 = 0.99), a gauge factor of 31.4, and hysteresis of 22.9%. Repeated loading-unloading tests over 80 cycles show baseline and peak drift of 0.135% and 0.236% per cycle, respectively, indicating moderate cycle-to-cycle stability. Mechanical testing further confirms that the silicone-fabric interface remains intact under large deformation, with failure occurring in the textile rather than at the stitched boundary. As a preliminary proof of concept, the sensor was integrated into wearable elbow and knee sleeves for joint angle monitoring, showing a clear correlation between normalised resistance change and bending angle. By addressing both mechanical fixation and electrical interfacing through embroidery-based integration, this approach provides a reproducible and scalable pathway for incorporating printed stretchable electronics into textile systems for motion capture and soft robotic applications.
comment: 6 pages, 8 figures, conference
HRI-SA: A Multimodal Dataset for Online Assessment of Human Situational Awareness during Remote Human-Robot Teaming
Maintaining situational awareness (SA) is critical in human-robot teams. Yet, under high workload and dynamic conditions, operators often experience SA gaps. Automated detection of SA gaps could provide timely assistance for operators. However, conventional SA measures either disrupt task flow or cannot capture real-time fluctuations, limiting their operational utility. To the best of our knowledge, no publicly available dataset currently supports the systematic evaluation of online human SA assessment in human-robot teaming. To advance the development of online SA assessment tools, we introduce HRI-SA, a multimodal dataset from 30 participants in a realistic search-and-rescue human-robot teaming context, incorporating eye movements, pupil diameter, biosignals, user interactions, and robot data. The experimental protocol included predefined events requiring timely operator assistance, with ground truth SA latency of two types (perceptual and comprehension) systematically obtained by measuring the time between assistance need onset and resolution. We illustrate the utility of this dataset by evaluating standard machine learning models for detecting perceptual SA latencies using generic eye-tracking features and contextual features. Results show that eye-tracking features alone effectively classified perceptual SA latency (recall=88.91%, F1=67.63%) using leave-one-group-out cross-validation, with performance improved through contextual data fusion (recall=91.51%, F1=80.38%). This paper contributes the first public dataset supporting the systematic evaluation of SA throughout a human-robot teaming mission, while also demonstrating the potential of generic eye-tracking features for continuous perceptual SA latency detection in remote human-robot teaming.
comment: This work is currently under peer review
Shifting Uncertainty to Critical Moments: Towards Reliable Uncertainty Quantification for VLA Model
Vision-Language-Action (VLA) models enable general-purpose robotic policies by mapping visual observations and language instructions to low-level actions, but they often lack reliable introspection. A common practice is to compute a token-level uncertainty signal and take its mean over a rollout. However, mean aggregation can dilute short-lived but safety-critical uncertainty spikes in continuous control. In particular, successful rollouts may contain localized high-entropy segments due to benign noise or non-critical micro-adjustments, while failure rollouts can appear low-entropy for most timesteps and only exhibit brief spikes near the onset of failure. We propose a unified uncertainty quantification approach for predicting rollout success versus failure that (1) uses max-based sliding window pooling to preserve transient risk signals, (2) applies motion-aware stability weighting to emphasize high-frequency action oscillations associated with unstable behaviors, and (3) performs DoF-adaptive calibration via Bayesian Optimization to prioritize kinematically critical axes. Experiments on the LIBERO benchmark show that our method substantially improves failure prediction accuracy and yields more reliable signals for failure detection, which can support downstream human-in-the-loop interventions.
ManiDreams: An Open-Source Library for Robust Object Manipulation via Uncertainty-aware Task-specific Intuitive Physics
Dynamics models, whether simulators or learned world models, have long been central to robotic manipulation, but most focus on minimizing prediction error rather than confronting a more fundamental challenge: real-world manipulation is inherently uncertain. We argue that robust manipulation under uncertainty is fundamentally an integration problem: uncertainties must be represented, propagated, and constrained within the planning loop, not merely suppressed during training. We present and open-source ManiDreams, a modular framework for uncertainty-aware manipulation planning over intuitive physics models. It realizes this integration through composable abstractions for distributional state representation, backend-agnostic dynamics prediction, and declarative constraint specification for action optimization. The framework explicitly addresses three sources of uncertainty: perceptual, parametric, and structural. It wraps any base policy with a sample-predict-constrain loop that evaluates candidate actions against distributional outcomes, adding robustness without retraining. Experiments on ManiSkill tasks show that ManiDreams maintains robust performance under various perturbations where the RL baseline degrades significantly. Runnable examples on pushing, picking, catching, and real-world deployment demonstrate flexibility across different policies, optimizers, physics backends, and executors. The framework is publicly available at https://github.com/Rice-RobotPI-Lab/ManiDreams
comment: 9 pages, 10 figures. Project page at https://manidreams.github.io
DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving
Ensuring safe decision-making in autonomous vehicles remains a fundamental challenge despite rapid advances in end-to-end learning approaches. Traditional reinforcement learning (RL) methods rely on manually engineered rewards or sparse collision signals, which fail to capture the rich contextual understanding required for safe driving and make unsafe exploration unavoidable in real-world settings. Recent vision-language models (VLMs) offer promising semantic understanding capabilities; however, their high inference latency and susceptibility to hallucination hinder direct application to real-time vehicle control. To address these limitations, this paper proposes DriveVLM-RL, a neuroscience-inspired framework that integrates VLMs into RL through a dual-pathway architecture for safe and deployable autonomous driving. The framework decomposes semantic reward learning into a Static Pathway for continuous spatial safety assessment using CLIP-based contrasting language goals, and a Dynamic Pathway for attention-gated multi-frame semantic risk reasoning using a lightweight detector and a large VLM. A hierarchical reward synthesis mechanism fuses semantic signals with vehicle states, while an asynchronous training pipeline decouples expensive VLM inference from environment interaction. All VLM components are used only during offline training and are removed at deployment, ensuring real-time feasibility. Experiments in the CARLA simulator show significant improvements in collision avoidance, task success, and generalization across diverse traffic scenarios, including strong robustness under settings without explicit collision penalties. These results demonstrate that DriveVLM-RL provides a practical paradigm for integrating foundation models into autonomous driving without compromising real-time feasibility. Demo video and code are available at: https://zilin-huang.github.io/DriveVLM-RL-website/
comment: 32 pages, 15 figures. Code and demo available online
Proprioceptive-only State Estimation for Legged Robots with Set-Coverage Measurements of Learned Dynamics
Proprioceptive-only state estimation is attractive for legged robots since it is computationally cheaper and is unaffected by perceptually degraded conditions. The history of joint-level measurements contains rich information that can be used to infer the dynamics of the system and subsequently produce navigational measurements. Recent approaches produce these estimates with learned measurement models and fuse with IMU data, under a Gaussian noise assumption. However, this assumption can easily break down with limited training data and render the estimates inconsistent and potentially divergent. In this work, we propose a proprioceptive-only state estimation framework for legged robots that characterizes the measurement noise using set-coverage statements that do not assume any distribution. We develop a practical and computationally inexpensive method to use these set-coverage measurements with a Gaussian filter in a systematic way. We validate the approach in both simulation and two real-world quadrupedal datasets. Comparison with the Gaussian baselines shows that our proposed method remains consistent and is not prone to drift under real noise scenarios.
Sparse3DTrack: Monocular 3D Object Tracking Using Sparse Supervision
Monocular 3D object tracking aims to estimate temporally consistent 3D object poses across video frames, enabling autonomous agents to reason about scene dynamics. However, existing state-of-the-art approaches are fully supervised and rely on dense 3D annotations over long video sequences, which are expensive to obtain and difficult to scale. In this work, we address this fundamental limitation by proposing the first sparsely supervised framework for monocular 3D object tracking. Our approach decomposes the task into two sequential sub-problems: 2D query matching and 3D geometry estimation. Both components leverage the spatio-temporal consistency of image sequences to augment a sparse set of labeled samples and learn rich 2D and 3D representations of the scene. Leveraging these learned cues, our model automatically generates high-quality 3D pseudolabels across entire videos, effectively transforming sparse supervision into dense 3D track annotations. This enables existing fully-supervised trackers to effectively operate under extreme label sparsity. Extensive experiments on the KITTI and nuScenes datasets demonstrate that our method significantly improves tracking performance, achieving an improvement of up to 15.50 p.p. while using at most four ground truth annotations per track.
comment: 22 pages, 8 figures
Offload or Overload: A Platform Measurement Study of Mobile Robotic Manipulation Workloads
Mobile robotic manipulation--the ability of robots to navigate spaces and interact with objects--is a core capability of physical AI. Foundation models have led to breakthroughs in their performance, but at a significant computational cost. We present the first measurement study of mobile robotic manipulation workloads across onboard, edge, and cloud GPU platforms. We find that the full workload stack is infeasible to run on smaller onboard GPUs, while larger onboard GPUs drain robot batteries several hours faster. Offloading alleviates these constraints but introduces its own challenges, as additional network latency degrades task accuracy, and the bandwidth requirement makes naive cloud offloading impractical. Finally, we quantify opportunities and pitfalls of sharing compute across robot fleets. We believe our measurement study will be crucial to designing inference systems for mobile robots.
comment: 15 pages, 17 figures
SG-CoT: An Ambiguity-Aware Robotic Planning Framework using Scene Graph Representations
Ambiguity poses a major challenge to large language models (LLMs) used as robotic planners. In this letter, we present Scene Graph-Chain-of-Thought (SG-CoT), a two-stage framework where LLMs iteratively query a scene graph representation of the environment to detect and clarify ambiguities. First, a structured scene graph representation of the environment is constructed from input observations, capturing objects, their attributes, and relationships with other objects. Second, the LLM is equipped with retrieval functions to query portions of the scene graph that are relevant to the provided instruction. This grounds the reasoning process of the LLM in the observation, increasing the reliability of robotic planners under ambiguous situations. SG-CoT also allows the LLM to identify the source of ambiguity and pose a relevant disambiguation question to the user or another robot. Extensive experimentation demonstrates that SG-CoT consistently outperforms prior methods, with a minimum of 10% improvement in question accuracy and a minimum success rate increase of 4% in single-agent and 15% in multi-agent environments, validating its effectiveness for more generalizable robot planning.
comment: This work has been submitted to the IEEE Robotics and Automation Letters for possible publication
Manufacturing Micro-Patterned Surfaces with Multi-Robot Systems
Applying micro-patterns to surfaces has been shown to impart useful physical properties such as drag reduction and hydrophobicity. However, current manufacturing techniques cannot produce micro-patterned surfaces at scale due to high-cost machinery and inefficient coverage techniques such as raster-scanning. In this work, we use multiple robots, each equipped with a patterning tool, to manufacture these surfaces. To allow these robots to coordinate during the patterning task, we use the ergodic control algorithm, which specifies coverage objectives using distributions. We demonstrate that robots can divide complicated coverage objectives by communicating compressed representations of their trajectory history both in simulations and experimental trials. Further, we show that robot-produced patterning can lower the coefficient of friction of metallic surfaces. This work demonstrates that distributed multi-robot systems can coordinate to manufacture products that were previously unrealizable at scale.
Rapid Adaptation of Particle Dynamics for Generalized Deformable Object Mobile Manipulation ICRA 2026
We address the challenge of learning to manipulate deformable objects with unknown dynamics. In non-rigid objects, the dynamics parameters define how they react to interactions -- how they stretch, bend, compress, and move -- and they are critical to determining the optimal actions to perform a manipulation task successfully. In other robotic domains, such as legged locomotion and in-hand rigid object manipulation, state-of-the-art approaches can handle unknown dynamics using Rapid Motor Adaptation (RMA). Through a supervised procedure in simulation that encodes each rigid object's dynamics, such as mass and position, these approaches learn a policy that conditions actions on a vector of latent dynamic parameters inferred from sequences of state-actions. However, in deformable object manipulation, the object's dynamics not only includes its mass and position, but also how the shape of the object changes. Our key insight is that the recent ground-truth particle positions of a deformable object in simulation capture changes in the object's shape, making it possible to extend RMA to deformable object manipulation. This key insight allows us to develop RAPiD, a two-phase method that learns to perform real-robot deformable object mobile manipulation by: 1) learning a visuomotor policy conditioned on the object's dynamics embedding, which is encoded from the object's privileged information in simulation, such as its mass and ground-truth particle positions, and 2) learning to infer this embedding using non-privileged information instead, such as robot visual observations and actions, so that the learned policy can transfer to the real world. On a mobile manipulator with 22 degrees of freedom, RAPiD enables over 80%+ success rates across two vision-based deformable object mobile manipulation tasks in the real world, under various object dynamics, categories, and instances.
comment: 8 pages, ICRA 2026
ReDAG-RT: Global Rate-Priority Scheduling for Real-Time Multi-DAG Execution in ROS 2
ROS 2 has become a dominant middleware for robotic systems, where perception, estimation, planning, and control pipelines are structured as directed acyclic graphs of callbacks executed under a shared executor. However, default ROS 2 executors use best-effort dispatch without cross-DAG priority enforcement, leading to callback contention, structural priority inversion, and deadline instability under concurrent workloads. These limitations restrict deployment in time-critical and safety-sensitive cyber-physical systems. This paper presents ReDAGRT, a user-space global scheduling framework for deterministic multi-DAG execution in unmodified ROS 2. The framework introduces a Rate-Priority driven global ready queue that orders callbacks by activation rate, enforces per-DAG concurrency bounds, and mitigates cross-graph priority inversion without modifying the ROS 2 API, executor interface, or underlying operating system scheduler. We formalize a multi-DAG task model for ROS 2 callback pipelines and analyze cross-DAG interference under Rate-Priority scheduling. Response-time recurrences and schedulability conditions are derived within classical Rate-Monotonic theory. Experiments in a ROS 2 Humble environment compare ReDAGRT against SingleThreadedExecutor and MultiThreadedExecutor using synthetic multi-DAG workloads. Results show up to 29.7 percent reduction in deadline miss rate, 42.9 percent reduction in 99th percentile response time, and 13.7 percent improvement over MultiThreadedExecutor under comparable utilization. Asymmetric per-DAG concurrency bounds further reduce interference by 40.8 percent. These results demonstrate that deterministic and analyzable multi-DAG scheduling can be achieved entirely in the ROS 2 user-space execution layer, providing a practical foundation for real-time robotic middleware in safety-critical systems.
comment: 12 pages, 6 figures
Semantic Segmentation and Depth Estimation for Real-Time Lunar Surface Mapping Using 3D Gaussian Splatting
Navigation and mapping on the lunar surface require robust perception under challenging conditions, including poorly textured environments, high-contrast lighting, and limited computational resources. This paper presents a real-time mapping framework that integrates dense perception models with a 3D Gaussian Splatting (3DGS) representation. We first benchmark several models on synthetic datasets generated with the LuPNT simulator, selecting a stereo dense depth estimation model based on Gated Recurrent Units for its balance of speed and accuracy in depth estimation, and a convolutional neural network for its superior performance in detecting semantic segments. Using ground truth poses to decouple the local scene understanding from the global state estimation, our pipeline reconstructs a 120-meter traverse with a geometric height accuracy of approximately 3 cm, outperforming a traditional point cloud baseline without LiDAR. The resulting 3DGS map enables novel view synthesis and serves as a foundation for a full SLAM system, where its capacity for joint map and pose optimization would offer significant advantages. Our results demonstrate that combining semantic segmentation and dense depth estimation with learned map representations is an effective approach for creating detailed, large-scale maps to support future lunar surface missions.
GoalVLM: VLM-driven Object Goal Navigation for Multi-Agent System
Object-goal navigation has traditionally been limited to ground robots with closed-set object vocabularies. Existing multi-agent approaches depend on precomputed probabilistic graphs tied to fixed category sets, precluding generalization to novel goals at test time. We present GoalVLM, a cooperative multi-agent framework for zero-shot, open-vocabulary object navigation. GoalVLM integrates a Vision-Language Model (VLM) directly into the decision loop, SAM3 for text-prompted detection and segmentation, and SpaceOM for spatial reasoning, enabling agents to interpret free-form language goals and score frontiers via zero-shot semantic priors without retraining. Each agent builds a BEV semantic map from depth-projected voxel splatting, while a Goal Projector back-projects detections through calibrated depth into the map for reliable goal localization. A constraint-guided reasoning layer evaluates frontiers through a structured prompt chain (scene captioning, room-type classification, perception gating, multi-frontier ranking), injecting commonsense priors into exploration. We evaluate GoalVLM on GOAT-Bench val_unseen (360 multi-subtask episodes, 1032 sequential object-goal subtasks, HM3D scenes), where each episode requires navigating to a chain of 5-7 open-vocabulary targets. GoalVLM with N=2 agents achieves 55.8% subtask SR and 18.3% SPL, competitive with state-of-the-art methods while requiring no task-specific training. Ablation studies confirm the contributions of VLM-guided frontier reasoning and depth-projected goal localization.
comment: 8 pages, 5 figures
R2-Dreamer: Redundancy-Reduced World Models without Decoders or Augmentation ICLR 2026
A central challenge in image-based Model-Based Reinforcement Learning (MBRL) is to learn representations that distill essential information from irrelevant visual details. While promising, reconstruction-based methods often waste capacity on large task-irrelevant regions. Decoder-free methods instead learn robust representations by leveraging Data Augmentation (DA), but reliance on such external regularizers limits versatility. We propose R2-Dreamer, a decoder-free MBRL framework with a self-supervised objective that serves as an internal regularizer, preventing representation collapse without resorting to DA. The core of our method is a redundancy-reduction objective inspired by Barlow Twins, which can be easily integrated into existing frameworks. On DeepMind Control Suite and Meta-World, R2-Dreamer is competitive with strong baselines such as DreamerV3 and TD-MPC2 while training 1.59x faster than DreamerV3, and yields substantial gains on DMC-Subtle with tiny task-relevant objects. These results suggest that an effective internal regularizer can enable versatile, high-performance decoder-free MBRL. Code is available at https://github.com/NM512/r2dreamer.
comment: 20 pages, 12 figures, 2 tables. Published as a conference paper at ICLR 2026. Code available at https://github.com/NM512/r2dreamer
Final Report for the Workshop on Robotics & AI in Medicine
The CARE Workshop on Robotics and AI in Medicine, held on December 1, 2025 in Indianapolis, convened leading researchers, clinicians, industry innovators, and federal stakeholders to shape a national vision for advancing robotics and artificial intelligence in healthcare. The event highlighted the accelerating need for coordinated research efforts that bridge engineering innovation with real clinical priorities, emphasizing safety, reliability, and translational readiness with an emphasis on the use of robotics and AI to achieve this readiness goal. Across keynotes, panels, and breakout sessions, participants underscored critical gaps in data availability, standardized evaluation methods, regulatory pathways, and workforce training that hinder the deployment of intelligent robotic systems in surgical, diagnostic, rehabilitative, and assistive contexts. Discussions emphasized the transformative potential of AI enabled robotics to improve precision, reduce provider burden, expand access to specialized care, and enhance patient outcomes particularly in undeserved regions and high risk procedural domains. Special attention was given to austere settings, disaster and relief and military settings. The workshop demonstrated broad consensus on the urgency of establishing a national Center for AI and Robotic Excellence in medicine (CARE). Stakeholders identified priority research thrusts including human robot collaboration, trustworthy autonomy, simulation and digital twins, multi modal sensing, and ethical integration of generative AI into clinical workflows. Participants also articulated the need for high quality datasets, shared test beds, autonomous surgical systems, clinically grounded benchmarks, and sustained interdisciplinary training mechanisms.
comment: 51 pages, 5 figures
Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model
Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.
Uncovering Latent Phase Structures and Branching Logic in Locomotion Policies: A Case Study on HalfCheetah
In locomotion control tasks, Deep Reinforcement Learning (DRL) has demonstrated high performance; however, the decision-making process of the learned policy remains a black box, making it difficult for humans to understand. On the other hand, in periodic motions such as walking, it is well known that implicit motion phases exist, such as the stance phase and the swing phase. Focusing on this point, this study hypothesizes that a policy trained for locomotion control may also represent a phase structure that is interpretable by humans. To examine this hypothesis in a controlled setting, we consider a locomotion task that is amenable to observing whether a policy autonomously acquires temporally structured phases through interaction with the environment. To verify this hypothesis, in the MuJoCo locomotion benchmark HalfCheetah-v5, the state transition sequences acquired by a policy trained for walking control through interaction with the environment were aggregated into semantic phases based on state similarity and consistency of subsequent transitions. As a result, we demonstrated that the state sequences generated by the trained policy exhibit periodic phase transition structures as well as phase branching. Furthermore, by approximating the states and actions corresponding to each semantic phase using Explainable Boosting Machines (EBMs), we analyzed phase-dependent decision making-namely, which state features the policy function attends to and how it controls action outputs in each phase. These results suggest that neural network-based policies, which are often regarded as black boxes, can autonomously acquire interpretable phase structures and logical branching mechanisms.
comment: Accepted at XAI-2026: The 4th World Conference on eXplainable Artificial Intelligence
MG-Grasp: Metric-Scale Geometric 6-DoF Grasping Framework with Sparse RGB Observations
Single-view RGB-D grasp detection remains a common choice in 6-DoF robotic grasping systems, which typically requires a depth sensor. While RGB-only 6-DoF grasp methods has been studied recently, their inaccurate geometric representation is not directly suitable for physically reliable robotic manipulation, thereby hindering reliable grasp generation. To address these limitations, we propose MG-Grasp, a novel depth-free 6-DoF grasping framework that achieves high-quality object grasping. Leveraging two-view 3D foundation model with camera intrinsic/extrinsic, our method reconstructs metric-scale and multi-view consistent dense point clouds from sparse RGB images and generates stable 6-DoF grasp. Experiments on GraspNet-1Billion dataset and real world demonstrate that MG-Grasp achieves state-of-the-art (SOTA) grasp performance among RGB-based 6-DoF grasping methods.
comment: 8 pages, 5 figures
Mimic Intent, Not Just Trajectories
While imitation learning (IL) has achieved impressive success in dexterous manipulation through generative modeling and pretraining, state-of-the-art approaches like Vision-Language-Action (VLA) models still struggle with adaptation to environmental changes and skill transfer. We argue this stems from mimicking raw trajectories without understanding the underlying intent. To address this, we propose explicitly disentangling behavior intent from execution details in end-2-end IL: Mimic Intent, Not just Trajectories(MINT). We achieve this via multi-scale frequency-space tokenization, which enforces a spectral decomposition of action chunk representation. We learn action tokens with a multi-scale coarse-to-fine structure, and force the coarsest token to capture low-frequency global structure and finer tokens to encode high-frequency details. This yields an abstract Intent token that facilitates planning and transfer, and multi-scale Execution tokens that enable precise adaptation to environmental dynamics. Building on this hierarchy, our policy generates trajectories through next-scale autoregression, performing progressive intent-to-execution reasoning, thus boosting learning efficiency and generalization. Crucially, this disentanglement enables one-shot transfer of skills, by simply injecting the Intent token from a demonstration into the autoregressive generation process. Experiments on several manipulation benchmarks and on a real robot demonstrate state-of-the-art success rates, superior inference efficiency, robust generalization against disturbances, and effective one-shot transfer.
comment: Under review
Beyond Short-Horizon: VQ-Memory for Robust Long-Horizon Manipulation in Non-Markovian Simulation Benchmarks
The high cost of collecting real-robot data has made robotic simulation a scalable platform for both evaluation and data generation. Yet most existing benchmarks concentrate on simple manipulation tasks such as pick-and-place, failing to capture the non-Markovian characteristics of real-world tasks and the complexity of articulated object interactions. To address this limitation, we present RuleSafe, a new articulated manipulation benchmark built upon a scalable LLM-aided simulation framework. RuleSafe features safes with diverse unlocking mechanisms, such as key locks, password locks, and logic locks, which require different multi-stage reasoning and manipulation strategies. These LLM-generated rules produce non-Markovian and long-horizon tasks that require temporal modeling and memory-based reasoning. We further propose VQ-Memory, a compact and structured temporal representation that uses vector-quantized variational autoencoders (VQ-VAEs) to encode past proprioceptive states into discrete latent tokens. This representation filters low-level noise while preserving high-level task-phase context, providing lightweight yet robust temporal cues that are compatible with existing Vision-Language-Action models (VLA). Extensive experiments on state-of-the-art VLA models and diffusion policies show that VQ-Memory consistently improves long-horizon planning, enhances generalization to unseen configurations, and enables more efficient manipulation with reduced computational cost. Project page: vqmemory.github.io
comment: 9 pages
Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation CVPR 2026
Recent vision-language-action (VLA) models for multi-task robot manipulation often rely on fixed camera setups and shared visual encoders, which limit their performance under occlusions and during cross-task transfer. To address these challenges, we propose Task-aware Virtual View Exploration (TVVE), a framework that learns to select task-relevant virtual camera viewpoints and dynamically re-render observations from a reconstructed scene representation using the selected viewpoints. To enable efficient view selection, we train an exploration policy in a pseudo-environment. In addition, we introduce a Task-aware Mixture-of-Experts (TaskMoE) visual encoder that routes visual features to task-specialized experts, mitigating interference in multi-task learning. To evaluate robustness under distribution shifts, we construct RLBench-OG, an out-of-distribution benchmark with visual perturbations and camera pose variations. Experiments on RLBench and RLBench-OG demonstrate that TVVE achieves higher success rates than strong baselines, while real-robot experiments further confirm its robustness to visual disturbances and unseen instructions. Code and visualizations are available at: https://hcplab-sysu.github.io/TAVP.
comment: 24 pages, 15 figures, Project page: https://hcplab-sysu.github.io/TAVP, Code: https://github.com/HCPLab-SYSU/TAVP.git, Accepted at CVPR 2026
Swarm Self Clustering for Communication denied Environments without Global Positioning
In this work, we investigate swarm self-clustering, where robots autonomously organize into spatially coherent groups using only local sensing and decision-making, without external commands, global positioning, or inter-robot communication. Each robot forms and maintains clusters by responding to relative distances from nearby neighbors detected through onboard range sensors with limited fields of view. The method is suited for GPS-denied and communication-constrained environments and requires no prior knowledge of cluster size, number, or membership. A mechanism enables robots to alternate between consensus-based and random goal assignment based on local neighborhood size, ensuring robustness, scalability, and untraceable clustering independent of initial conditions. Extensive simulations and real-robot experiments demonstrate empirical convergence, adaptability to dynamic additions, and improved performance over local-only baselines across standard cluster quality metrics.
comment: 36 Pages, 15 figures, 8 tables, pre-print version
TwinTrack: Bridging Vision and Contact Physics for Real-Time Tracking of Unknown Objects in Contact-Rich Scenes ICRA
Real-time tracking of previously unseen, highly dynamic objects in contact-rich scenes, such as during dexterous in-hand manipulation, remains a major challenge. Pure vision-based approaches often fail under heavy occlusions due to frequent contact interactions and motion blur caused by abrupt impacts. We propose Twintrack, a physics-aware perception system that enables robust, real-time 6-DoF pose tracking of unknown dynamic objects in contact-rich scenes by leveraging contact physics cues. At its core, Twintrack integrates Real2Sim and Sim2Real. Real2Sim combines vision and contact physics to jointly estimate object geometry and physical properties: an initial reconstruction is obtained from vision, then refined by learning a geometry residual and simultaneously estimating physical parameters (e.g., mass, inertia, and friction) based on contact dynamics consistency. Sim2Real achieves robust pose estimation by adaptively fusing a visual tracker with predictions from the updated contact dynamics. Twintrack is implemented on a GPU-accelerated, customized MJX engine to guarantee real-time performance. We evaluate our method on two contact-rich scenarios: object falling with environmental contacts and multi-fingered in-hand manipulation. Results show that, compared to baselines, Twintrack delivers significantly more robust, accurate, and real-time tracking in these challenging settings, with tracking speeds above 20 Hz. Project page: https://irislab.tech/TwinTrack-webpage/
comment: Accepted by IEEE International Conference on Robotics & Automation (ICRA) 2026
S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight
Video action models (VAMs) have emerged as a promising paradigm for robot learning, owing to their powerful visual foresight for complex manipulation tasks. However, current VAMs, typically relying on either slow multi-step video generation or noisy one-step feature extraction, cannot simultaneously guarantee real-time inference and high-fidelity foresight. To address this limitation, we propose S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Serving as a stable blueprint, these foreseen representations significantly simplify the action prediction. To enable this efficient shortcut, we introduce a novel self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Specifically, vision foundation model (VFM) representations extracted from the diffusion model's own multi-step generated videos provide teacher targets. Lightweight decouplers, as students, learn to directly map noisy one-step features to these targets. Extensive experiments in simulation and the real world demonstrate that our S-VAM outperforms state-of-the-art methods, enabling efficient and precise manipulation in complex environments. Our project page is https://haodong-yan.github.io/S-VAM/
Echo Planning for Autonomous Driving: From Current Observations to Future Trajectories and Back
Modern end-to-end autonomous driving systems suffer from a critical limitation: their planners lack mechanisms to enforce temporal consistency between predicted trajectories and evolving scene dynamics. This absence of self-supervision allows early prediction errors to compound catastrophically over time. We introduce Echo Planning (EchoP), a new self-correcting framework that establishes an end-to-end Current - Future - Current (CFC) cycle to harmonize trajectory prediction with scene coherence. Our key insight is that plausible future trajectories should be bi-directionally consistent, i.e., not only generated from current observations but also capable of reconstructing them. The CFC mechanism first predicts future trajectories from the Bird's-Eye-View (BEV) scene representation, then inversely maps these trajectories back to estimate the current BEV state. By enforcing consistency between the original and reconstructed BEV representations through a cycle loss, the framework intrinsically penalizes physically implausible or misaligned trajectories. Experiments on nuScenes show that the proposed method yields competitive performance, reducing L2 error (Avg) by -0.04 m and collision rate by -0.12% compared to one-shot planners. Moreover, EchoP seamlessly extends to closed-loop evaluation, i.e., Bench2Drive, attaining a 26.54% success rate. Notably, EchoP requires no additional supervision: the CFC cycle acts as an inductive bias that stabilizes long-horizon planning. Overall, EchoP offers a simple, deployable pathway to improve reliability in safety-critical autonomous driving.
comment: 12 pages, 4 figures
See, Plan, Cut: MPC-Based Autonomous Volumetric Robotic Laser Surgery with OCT Guidance
Robotic laser systems offer the potential for sub-millimeter, non-contact, high-precision tissue resection, yet existing platforms lack volumetric planning and intraoperative feedback. We present RATS (Robot-Assisted Tissue Surgery), an intelligent opto-mechanical, optical coherence tomography (OCT)-guided robotic platform designed for autonomous volumetric soft tissue resection in surgical applications. RATS integrates macro-scale RGB-D imaging, micro-scale OCT, and a fiber-coupled surgical laser, calibrated through a novel multistage alignment pipeline that achieves OCT-to-laser calibration accuracy of 0.161+-0.031mm on tissue phantoms and ex vivo porcine tissue. A super-Gaussian laser-tissue interaction (LTI) model characterizes ablation crater morphology with an average RMSE of 0.231+-0.121mm, outperforming Gaussian baselines. A sampling-based model predictive control (MPC) framework operates directly on OCT voxel data to generate constraint-aware resection trajectories with closed-loop feedback, achieving 0.842mm RMSE and improving intersection-over-union agreement by 64.8% compared to feedforward execution. With OCT, RATS detects subsurface structures and modifies the planner's objective to preserve them, demonstrating clinical feasibility.
comment: 9 pages, 8 figures
AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving
Integrating vision-language models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose \OURS in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through asynchronous execution at different task frequencies. Extensive experiments on multiple benchmarks, under both open- and closed-loop settings, demonstrate that \OURS achieves competitive performance compared to state-of-the-art methods. We further investigate the functional boundary of pre-trained VLMs in AD, examining when AD-tailored fine-tuning is necessary. Our results show that pre-trained VLMs can achieve competitive multi-task scene understanding performance through semantic prompting alone, while fine-tuning remains essential for action-level tasks such as decision-making and trajectory planning. We refer to \href{https://automot-website.github.io/}{Project Page} for the demonstration videos and qualitative results.
IRIS-SLAM: Unified Geo-Instance Representations for Robust Semantic Localization and Mapping
Geometry foundation models have significantly advanced dense geometric SLAM, yet existing systems often lack deep semantic understanding and robust loop closure capabilities. Meanwhile, contemporary semantic mapping approaches are frequently hindered by decoupled architectures and fragile data association. We propose IRIS-SLAM, a novel RGB semantic SLAM system that leverages unified geometric-instance representations derived from an instance-extended foundation model. By extending a geometry foundation model to concurrently predict dense geometry and cross-view consistent instance embeddings, we enable a semantic-synergized association mechanism and instance-guided loop closure detection. Our approach effectively utilizes viewpoint-agnostic semantic anchors to bridge the gap between geometric reconstruction and open-vocabulary mapping. Experimental results demonstrate that IRIS-SLAM significantly outperforms state-of-the-art methods, particularly in map consistency and wide-baseline loop closure reliability.
comment: The reason for this withdrawal is that the current version was submitted without the final review and formal authorization of all co-authors. To ensure the academic consensus and integrity of our research group, we have decided to withdraw this submission from the repository
ViSA: Visited-State Augmentation for Generalized Goal-Space Contrastive Reinforcement Learning
Goal-Conditioned Reinforcement Learning (GCRL) is a framework for learning a policy that can reach arbitrarily given goals. In particular, Contrastive Reinforcement Learning (CRL) provides a framework for policy updates using an approximation of the value function estimated via contrastive learning, achieving higher sample efficiency compared to conventional methods. However, since CRL treats the visited state as a pseudo-goal during learning, it can accurately estimate the value function only for limited goals. To address this issue, we propose a novel data augmentation approach for CRL called ViSA (Visited-State Augmentation). ViSA consists of two components: 1) generating augmented state samples, with the aim of augmenting hard-to-visit state samples during on-policy exploration, and 2) learning consistent embedding space, which uses an augmented state as auxiliary information to regularize the embedding space by reformulating the objective function of the embedding space based on mutual information. We evaluate ViSA in simulation and real-world robotic tasks and show improved goal-space generalization, which permits accurate value estimation for hard-to-visit goals. Further details can be found on the project page: https://issa-n.github.io/projectPage_ViSA/
comment: 8 pages, 7 figures, under Review
DexGrasp-Zero: A Morphology-Aligned Policy for Zero-Shot Cross-Embodiment Dexterous Grasping
To meet the demands of increasingly diverse dexterous hand hardware, it is crucial to develop a policy that enables zero-shot cross-embodiment grasping without redundant re-learning. Cross-embodiment alignment is challenging due to heterogeneous hand kinematics and physical constraints. Existing approaches typically predict intermediate motion targets and retarget them to each embodiment, which may introduce errors and violate embodiment-specific limits, hindering transfer across diverse hands. To overcome these limitations, we propose DexGrasp-Zero, a policy that learns universal grasping skills from diverse embodiments, enabling zero-shot transfer to unseen hands. We first introduce a morphology-aligned graph representation that maps each hand's kinematic keypoints to anatomically grounded nodes and equips each node with tri-axial orthogonal motion primitives, enabling structural and semantic alignment across different morphologies. Relying on this graph-based representation, we design a Morphology-Aligned Graph Convolutional Network (MAGCN) to encode the graph for policy learning. MAGCN incorporates a Physical Property Injection mechanism that fuses hand-specific physical constraints into the graph features, enabling adaptive compensation for varying link lengths and actuation limits for precise and stable grasping. Our extensive simulation evaluations on the YCB dataset demonstrate that our policy, jointly trained on four heterogeneous hands (Allegro, Shadow, Schunk, Ability), achieves an 85% zero-shot success rate on unseen hardware (LEAP, Inspire), outperforming the state-of-the-art method by 59.5%. Real-world experiments further evaluate our policy on three robot platforms (LEAP, Inspire, Revo2), achieving an 82% average success rate on unseen objects.
SimScale: Learning to Drive via Real-World Simulation at Scale CVPR 2026
Achieving fully autonomous driving systems requires learning rational decisions in a wide span of scenarios, including safety-critical and out-of-distribution ones. However, such cases are underrepresented in real-world corpus collected by human experts. To complement for the lack of data diversity, we introduce a novel and scalable simulation framework capable of synthesizing massive unseen states upon existing driving logs. Our pipeline utilizes advanced neural rendering with a reactive environment to generate high-fidelity multi-view observations controlled by the perturbed ego trajectory. Furthermore, we develop a pseudo-expert trajectory generation mechanism for these newly simulated states to provide action supervision. Upon the synthesized data, we find that a simple co-training strategy on both real-world and simulated samples can lead to significant improvements in both robustness and generalization for various planning methods on challenging real-world benchmarks, up to +8.6 EPDMS on navhard and +2.9 on navtest. More importantly, such policy improvement scales smoothly by increasing simulation data only, even without extra real-world data streaming in. We further reveal several crucial findings of such a sim-real learning system, which we term SimScale, including the design of pseudo-experts and the scaling properties for different policy architectures. Simulation data and code have been released at https://github.com/OpenDriveLab/SimScale.
comment: Accepted to CVPR 2026. Project page: https://opendrivelab.com/SimScale
OGScene3D: Incremental Open-Vocabulary 3D Gaussian Scene Graph Mapping for Scene Understanding
Open-vocabulary scene understanding is crucial for robotic applications, enabling robots to comprehend complex 3D environmental contexts and supporting various downstream tasks such as navigation and manipulation. However, existing methods require pre-built complete 3D semantic maps to construct scene graphs for scene understanding, which limits their applicability in robotic scenarios where environments are explored incrementally. To address this challenge, we propose OGScene3D, an open-vocabulary scene understanding system that achieves accurate 3D semantic mapping and scene graph construction incrementally. Our system employs a confidence-based Gaussian semantic representation that jointly models semantic predictions and their reliability, enabling robust scene modeling. Building on this representation, we introduce a hierarchical 3D semantic optimization strategy that achieves semantic consistency through local correspondence establishment and global refinement, thereby constructing globally consistent semantic maps. Moreover, we design a long-term global optimization method that leverages temporal memory of historical observations to enhance semantic predictions. By integrating 2D-3D semantic consistency with Gaussian rendering contribution, this method continuously refines the semantic understanding of the entire scene. Furthermore, we develop a progressive graph construction approach that dynamically creates and updates both nodes and semantic relationships, allowing continuous updating of the 3D scene graphs. Extensive experiments on widely used datasets and real-world scenes demonstrate the effectiveness of our OGScene3D on open-vocabulary scene understanding.
PACE: Physics Augmentation for Coordinated End-to-end Reinforcement Learning toward Versatile Humanoid Table Tennis
Humanoid table tennis (TT) demands rapid perception, proactive whole-body motion, and agile footwork under strict timing--capabilities that remain difficult for end-to-end control policies. We propose a reinforcement learning (RL) framework that maps ball-position observations directly to whole-body joint commands for both arm striking and leg locomotion, strengthened by predictive signals and dense, physics-guided rewards. A lightweight learned predictor, fed with recent ball positions, estimates future ball states and augments the policy's observations for proactive decision-making. During training, a physics-based predictor supplies precise future states to construct dense, informative rewards that lead to effective exploration. The resulting policy attains strong performance across varied serve ranges (hit rate$\geq$96% and success rate$\geq$92%) in simulations. Ablation studies confirm that both the learned predictor and the predictive reward design are critical for end-to-end learning. Deployed zero-shot on a physical Booster T1 humanoid with 23 revolute joints, the policy produces coordinated lateral and forward-backward footwork with accurate, fast returns, suggesting a practical path toward versatile, competitive humanoid TT. We have open-sourced our RL training code at: https://github.com/purdue-tracelab/TTRL-ICRA2026
Grounding Robot Generalization in Training Data via Retrieval-Augmented VLMs
Recent work on robot manipulation has advanced policy generalization to novel scenarios. However, it is often difficult to characterize how different evaluation settings actually represent generalization from the training distribution of a given policy. To work towards more precise evaluation of generalization in robotics, we propose RADAR, a scalable framework for directly comparing test-time evaluation tasks to policy training data, to determine what form of policy generalization is required. RADAR consists of a two-stage pipeline: first, retrieval using generalist policy embeddings identifies which training examples are relevant for a given evaluation task. Next, vision-language models (VLMs) analyze the evaluation task against the retrieved data, outputting interpretable analysis on how they compare along a variety of axes, and an overall classification of what type of policy generalization is required. Through controlled experiments, we demonstrate that VLMs are effective at analyzing data for generalization, and that our retrieval step effectively identifies examples needed to make accurate classifications with respect to the training data. Furthermore, we scale RADAR to large-scale datasets, where we observe agreement with human-defined benchmark conditions from prior work. We provide demonstrations at radar-analysis.github.io.
comment: 12 pages
MOBODY: Model Based Off-Dynamics Offline Reinforcement Learning ICLR 2026
We study off-dynamics offline reinforcement learning, where the goal is to learn a policy from offline source and limited target datasets with mismatched dynamics. Existing methods either penalize the reward or discard source transitions occurring in parts of the transition space with high dynamics shift. As a result, they optimize the policy using data from low-shift regions, limiting exploration of high-reward states in the target domain that do not fall within these regions. Consequently, such methods often fail when the dynamics shift is significant or the optimal trajectories lie outside the low-shift regions. To overcome this limitation, we propose MOBODY, a Model-Based Off-Dynamics Offline RL algorithm that optimizes a policy using learned target dynamics transitions to explore the target domain, rather than only being trained with the low dynamics-shift transitions. For the dynamics learning, built on the observation that achieving the same next state requires taking different actions in different domains, MOBODY employs separate action encoders for each domain to encode different actions to the shared latent space while sharing a unified representation of states and a common transition function. We further introduce a target Q-weighted behavior cloning loss in policy optimization to avoid out-of-distribution actions, which push the policy toward actions with high target-domain Q-values, rather than high source domain Q-values or uniformly imitating all actions in the offline dataset. We evaluate MOBODY on a wide range of MuJoCo and Adroit benchmarks, demonstrating that it outperforms state-of-the-art off-dynamics RL baselines as well as policy learning methods based on different dynamics learning baselines, with especially pronounced improvements in challenging scenarios where existing methods struggle.
comment: Published at ICLR 2026
LaS-Comp: Zero-shot 3D Completion with Latent-Spatial Consistency CVPR2026
This paper introduces LaS-Comp, a zero-shot and category-agnostic approach that leverages the rich geometric priors of 3D foundation models to enable 3D shape completion across diverse types of partial observations. Our contributions are threefold: First, \ourname{} harnesses these powerful generative priors for completion through a complementary two-stage design: (i) an explicit replacement stage that preserves the partial observation geometry to ensure faithful completion; and (ii) an implicit refinement stage ensures seamless boundaries between the observed and synthesized regions. Second, our framework is training-free and compatible with different 3D foundation models. Third, we introduce Omni-Comp, a comprehensive benchmark combining real-world and synthetic data with diverse and challenging partial patterns, enabling a more thorough and realistic evaluation. Both quantitative and qualitative experiments demonstrate that our approach outperforms previous state-of-the-art approaches. Our code and data will be available at \href{https://github.com/DavidYan2001/LaS-Comp}{LaS-Comp}.
comment: Accepted by CVPR2026
Dynamic-ICP: Doppler-Aware Iterative Closest Point Registration for Dynamic Scenes
Reliable odometry in highly dynamic environments remains challenging when it relies on ICP-based registration: ICP assumes near-static scenes and degrades in repetitive or low-texture geometry. We introduce Dynamic-ICP, a Doppler-aware registration framework. The method (i) estimates ego motion from per-point Doppler velocity via robust regression and builds a velocity filter, (ii) clusters dynamic objects and reconstructs object-wise translational velocities from ego-compensated radial measurements, (iii) predicts dynamic points with a constant-velocity model, and (iv) aligns scans using a compact objective that combines point-to-plane geometry residual with a translation-invariant, rotation-only Doppler residual. The approach requires no external sensors or sensor-vehicle calibration and operates directly on FMCW LiDAR range and Doppler velocities. We evaluate Dynamic-ICP on three datasets-HeRCULES, HeLiPR, AevaScenes-focusing on highly dynamic scenes. Dynamic-ICP consistently improves rotational stability and translation accuracy over the state-of-the-art methods. Our approach is also simple to integrate into existing pipelines, runs in real time, and provides a lightweight solution for robust registration in dynamic environments. To encourage further research, the code is available at: https://github.com/JMUWRobotics/Dynamic-ICP.
comment: 8 pages, 5 figures
CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions ICRA 2026
Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety. Yet safety violations can lead to catastrophic outcomes in real-world deployments. Control Barrier Functions (CBFs) offer a principled method to enforce dynamic safety -- traditionally deployed online via safety filters. While the result is safe behavior, the fact that the RL policy does not have knowledge of the CBF can lead to conservative behaviors. This paper proposes CBF-RL, a framework for generating safe behaviors with RL by enforcing CBFs in training. CBF-RL has two key attributes: (1) minimally modifying a nominal RL policy to encode safety constraints via a CBF term, (2) and safety filtering of the policy rollouts in training. Theoretically, we prove that continuous-time safety filters can be deployed via closed-form expressions on discrete-time roll-outs. Practically, we demonstrate that CBF-RL internalizes the safety constraints in the learned policy -- both enforcing safer actions and biasing towards safer rewards -- enabling safe deployment without the need for an online safety filter. We validate our framework through ablation studies on navigation tasks and on the Unitree G1 humanoid robot, where CBF-RL enables safer exploration, faster convergence, and robust performance under uncertainty, enabling the humanoid robot to avoid obstacles and climb stairs safely in real-world settings without a runtime safety filter.
comment: To appear at ICRA 2026; sample code for the navigation example with CBF-RL reward core construction can be found at https://github.com/lzyang2000/cbf-rl-navigation-demo
NavThinker: Action-Conditioned World Models for Coupled Prediction and Planning in Social Navigation
Social navigation requires robots to act safely in dynamic human environments. Effective behavior demands thinking ahead: reasoning about how the scene and pedestrians evolve under different robot actions rather than reacting to current observations alone. This creates a coupled prediction-planning challenge, where robot actions and human motion mutually influence each other. To address this challenge, we propose NavThinker, a future-aware framework that couples an action-conditioned world model with on-policy reinforcement learning. The world model operates in the Depth Anything V2 patch feature space and performs autoregressive prediction of future scene geometry and human motion; multi-head decoders then produce future depth maps and human trajectories, yielding a future-aware state aligned with traversability and interaction risk. Crucially, we train the policy with DD-PPO while injecting world-model think-ahead signals via: (i) action-conditioned future features fused into the current observation embedding and (ii) social reward shaping from predicted human trajectories. Experiments on single- and multi-robot Social-HM3D show state-of-the-art navigation success, with zero-shot transfer to Social-MP3D and real-world deployment on a Unitree Go2, validating generalization and practical applicability. Webpage: https://hutslib.github.io/NavThinker.
Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning ICLR 2026
Simulated environments are a key piece in the success of Reinforcement Learning (RL), allowing practitioners and researchers to train decision making agents without running expensive experiments on real hardware. Simulators remain a security blind spot, however, enabling adversarial developers to alter the dynamics of their released simulators for malicious purposes. Therefore, in this work we highlight a novel threat, demonstrating how simulator dynamics can be exploited to stealthily implant action-level backdoors into RL agents. The backdoor then allows an adversary to reliably activate targeted actions in an agent upon observing a predefined ``trigger'', leading to potentially dangerous consequences. Traditional backdoor attacks are limited in their strong threat models, assuming the adversary has near full control over an agent's training pipeline, enabling them to both alter and observe agent's rewards. As these assumptions are infeasible to implement within a simulator, we propose a new attack ``Daze'' which is able to reliably and stealthily implant backdoors into RL agents trained for real world tasks without altering or even observing their rewards. We provide formal proof of Daze's effectiveness in guaranteeing attack success across general RL tasks along with extensive empirical evaluations on both discrete and continuous action space domains. We additionally provide the first example of RL backdoor attacks transferring to real, robotic hardware. These developments motivate further research into securing all components of the RL training pipeline to prevent malicious attacks.
comment: 10 pages main body, ICLR 2026
Aion: Towards Hierarchical 4D Scene Graphs with Temporal Flow Dynamics ICRA 2026
Autonomous navigation in dynamic environments requires spatial representations that capture both semantic structure and temporal evolution. 3D Scene Graphs (3DSGs) provide hierarchical multi-resolution abstractions that encode geometry and semantics, but existing extensions toward dynamics largely focus on individual objects or agents. In parallel, Maps of Dynamics (MoDs) model typical motion patterns and temporal regularities, yet are usually tied to grid-based discretizations that lack semantic awareness and do not scale well to large environments. In this paper we introduce Aion, a framework that embeds temporal flow dynamics directly within a hierarchical 3DSG, effectively incorporating the temporal dimension. Aion employs a graph-based sparse MoD representation to capture motion flows over arbitrary time intervals and attaches them to navigational nodes in the scene graph, yielding more interpretable and scalable predictions that improve planning and interaction in complex dynamic environments. We provide the code at https://github.com/IacopomC/aion
comment: Accepted at ICRA 2026, 8 pages
SAATT Nav: a Socially Aware Autonomous Transparent Transportation Navigation Framework for Wheelchairs IROS 2026
While powered wheelchairs reduce physical fatigue as opposed to manual wheelchairs for individuals with mobility impairment, they demand high cognitive workload due to information processing, decision making and motor coordination. Current autonomous systems lack social awareness in navigation and transparency in decision-making, leading to decreased perceived safety and trust from the user and others in context. This work proposes Socially Aware Autonomous Transparent Transportation (SAATT) Navigation framework for wheelchairs as a potential solution. By implementing a Large Language Model (LLM) informed of user intent and capable of predicting other peoples' intent as a decision-maker for its local controller, it is able to detect and navigate social situations, such as passing pedestrians or a pair conversing. Furthermore, the LLM textually communicates its reasoning at each waypoint for transparency. In this experiment, it is compared against a standard global planner, a representative competing social navigation model, and an Ablation study in three simulated environments varied by social levels in eight metrics categorized under Safety, Social Compliance, Efficiency, and Comfort. Overall, SAATT Nav outperforms in most social situations and equivalently or only slightly worse in the remaining metrics, demonstrating the potential of a socially aware and transparent autonomous navigation system to assist wheelchair users.
comment: 8 pages, 4 figures, 2 tables, 1 algorithm. Submitted to IROS 2026
World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training CVPR2026
Vision-Language-Action (VLA) models trained via imitation learning suffer from significant performance degradation in data-scarce scenarios due to their reliance on large-scale demonstration datasets. Although reinforcement learning (RL)-based post-training has proven effective in addressing data scarcity, its application to VLA models is hindered by the non-resettable nature of real-world environments. This limitation is particularly critical in high-risk domains such as industrial automation, where interactions often induce state changes that are costly or infeasible to revert. Furthermore, existing VLA approaches lack a reliable mechanism for detecting task completion, leading to redundant actions that reduce overall task success rates. To address these challenges, we propose RehearseVLA:, an RL-based post-training framework that replaces physical interaction with a low-cost world model-based virtual simulator. RehearseVLA: consists of two key components: (1) a physically-consistent world simulator that generates temporally consistent future visual observations, and (2) a vision-language model (VLM)-guided instant reflector that provides continuous reward signals and predicts action termination. This simulated environment enables VLA models to safely explore and generalize beyond their initial imitation learning distribution. Our method achieves notable performance gains with as few as five expert demonstrations per task. Experiments on complex robotic manipulation tasks demonstrate that RehearseVLA: effectively overcomes the data inefficiency, safety constraints, and inefficient execution of conventional VLA models that rely on real-world interaction, offering a practical and scalable solution for post-training in resource-constrained settings. Our code is available at https://github.com/amap-cvlab/world-env.
comment: Accepted to CVPR2026
Safety Case Patterns for VLA-based driving systems: Insights from SimLingo
Vision-Language-Action (VLA)-based driving systems represent a significant paradigm shift in autonomous driving since, by combining traffic scene understanding, linguistic interpretation, and action generation, these systems enable more flexible, adaptive, and instruction-responsive driving behaviors. However, despite their growing adoption and potential to support socially responsible autonomous driving as well as understanding high-level human instructions, VLA-based driving systems may exhibit new types of hazardous behaviors. For instance, the integration of open-ended natural language inputs (e.g., user or navigation instructions) into the multimodal control loop, may lead to unpredictable and unsafe behaviors that could endanger vehicle occupants and pedestrians. Hence, assuring the safety of these systems is crucial to help build trust in their operations. To support this, we propose a novel safety case design approach called RAISE. Our approach introduces novel patterns tailored to instruction-based driving systems such as VLA-based driving systems, an extension of Hazard Analysis and Risk Assessment (HARA) detailing safe scenarios and their outcomes, and a design technique to create the safety cases of VLA-based driving systems. A case study on SimLingo illustrates how our approach can be used to construct rigorous, evidence-based safety claims for this emerging class of autonomous driving systems.
ReTac-ACT: A State-Gated Vision-Tactile Fusion Transformer for Precision Assembly
Precision assembly requires sub-millimeter corrections in contact-rich "last-millimeter" regions where visual feedback fails due to occlusion from the end-effector and workpiece. We present ReTac-ACT (Reconstruction-enhanced Tactile ACT), a vision-tactile imitation learning policy that addresses this challenge through three synergistic mechanisms: (i) bidirectional cross-attention enabling reciprocal visuo-tactile feature enhancement before fusion, (ii) a proprioception-conditioned gating network that dynamically elevates tactile reliance when visual occlusion occurs, and (iii) a tactile reconstruction objective enforcing learning of manipulation-relevant contact information rather than generic visual textures. Evaluated on the standardized NIST Assembly Task Board M1 benchmark, ReTac-ACT achieves 90% peg-in-hole success, substantially outperforming vision-only and generalist baseline methods, and maintains 80% success at industrial-grade 0.1mm clearance. Ablation studies validate that each architectural component is indispensable. The ReTac-ACT codebase and a vision-tactile demonstration dataset covering various clearance levels with both visual and tactile features will be released to support reproducible research.
U-ARM : Ultra low-cost general teleoperation interface for robot manipulation
We propose U-Arm, a low-cost and rapidly adaptable leader-follower teleoperation framework designed to interface with most of commercially available robotic arms. Our system supports teleoperation through three structurally distinct 3D-printed leader arms that share consistent control logic, enabling seamless compatibility with diverse commercial robot configurations. Compared with previous open-source leader-follower interfaces, we further optimized both the mechanical design and servo selection, achieving a bill of materials (BOM) cost of only \$50.5 for the 6-DoF leader arm and \$56.8 for the 7-DoF version. To enhance usability, we mitigate the common challenge in controlling redundant degrees of freedom by %engineering methods mechanical and control optimizations. Experimental results demonstrate that U-Arm achieves 39\% higher data collection efficiency and comparable task success rates across multiple manipulation scenarios compared with Joycon, another low-cost teleoperation interface. We have open-sourced all CAD models of three configs and also provided simulation support for validating teleoperation workflows. We also open-sourced real-world manipulation data collected with U-Arm. The project website is https://github.com/MINT-SJTU/LeRobot-Anything-U-Arm.
Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation CVPR 2026
Text-goal instance navigation (TGIN) asks an agent to resolve a single, free-form description into actions that reach the correct object instance among same-category distractors. We present \textit{Context-Nav}, which elevates long, contextual captions from a local matching cue to a global exploration prior and verifies candidates through 3D spatial reasoning. First, we compute dense text-image alignments for a value map that ranks frontiers -- guiding exploration toward regions consistent with the entire description rather than early detections. Second, upon observing a candidate, we perform a viewpoint-aware relation check: the agent samples plausible observer poses, aligns local frames, and accepts a target only if the spatial relations can be satisfied from at least one viewpoint. The pipeline requires no task-specific training or fine-tuning; we attain state-of-the-art performance on InstanceNav and CoIN-Bench. Ablations show that (i) encoding full captions into the value map avoids wasted motion and (ii) explicit, viewpoint-aware 3D verification prevents semantically plausible but incorrect stops. This suggests that geometry-grounded spatial reasoning is a scalable alternative to heavy policy training or human-in-the-loop interaction for fine-grained instance disambiguation in cluttered 3D scenes.
comment: Accepted to CVPR 2026. Code is available at https://github.com/AutoCompSysLab/ContextNav
Latent Representations for Visual Proprioception in Inexpensive Robots
Robotic manipulation requires explicit or implicit knowledge of the robot's joint positions. Precise proprioception is standard in high-quality industrial robots but is often unavailable in inexpensive robots operating in unstructured environments. In this paper, we ask: to what extent can a fast, single-pass regression architecture perform visual proprioception from a single external camera image, available even in the simplest manipulation settings? We explore several latent representations, including CNNs, VAEs, ViTs, and bags of uncalibrated fiducial markers, using fine-tuning techniques adapted to the limited data available. We evaluate the achievable accuracy through experiments on an inexpensive 6-DoF robot.
TiROD: Tiny Robotics Dataset and Benchmark for Continual Object Detection
Detecting objects with visual sensors is crucial for numerous mobile robotics applications, from autonomous navigation to inspection. However, robots often need to operate under significant domains shifts from those they were trained in, requiring them to adjust to these changes. Tiny mobile robots, subject to size, power, and computational constraints, face even greater challenges when running and adapting detection models on low-resolution and noisy images. Such adaptability, though, is crucial for real-world deployment, where robots must operate effectively in dynamic and unpredictable settings. In this work, we introduce a new vision benchmark to evaluate lightweight continual learning strategies tailored to the unique characteristics of tiny robotic platforms. Our contributions include: (i) Tiny Robotics Object Detection~(TiROD), a challenging video dataset collected using the onboard camera of a small mobile robot, designed to test object detectors across various domains and classes; (ii) a comprehensive benchmark of several continual learning strategies on different scenarios using NanoDet, a lightweight, real-time object detector for resource-constrained devices.. Our results highlight some key challenges in developing robust and efficient continual learning strategies for object detectors in tiny robotics.es; (ii) a benchmark of different continual learning strategies on this dataset using NanoDet, a lightweight object detector. Our results highlight key challenges in developing robust and efficient continual learning strategies for object detectors in tiny robotics.
Learning Transferable Friction Models and LuGre Identification Via Physics-Informed Neural Networks
Accurately modeling friction in robotics remains a core challenge, as robotics simulators like MuJoCo and PyBullet use simplified friction models or heuristics to balance computational efficiency with accuracy, where these simplifications and approximations can lead to substantial differences between simulated and physical performance. In this paper, we present a physics-informed friction estimation framework that enables the integration of well-established friction models with learnable components, requiring only minimal, generic measurement data. Our approach enforces physical consistency yet retains the flexibility to capture complex friction phenomena. We demonstrate, on an underactuated and nonlinear system, that the learned friction models, trained solely on small and noisy datasets, accurately reproduce dynamic friction properties with significantly higher fidelity than the simplified models commonly used in robotics simulators. Crucially, we show that our approach enables the learned models to be transferable to systems they are not trained on. This ability to generalize across multiple systems streamlines friction modeling for complex, underactuated tasks, offering a scalable and interpretable path toward improving friction model accuracy in robotics and control.
comment: 7 pages, 8 figures, Accepted to 2026 American Control Conference (ACC)
MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation
Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and language to generate actions, whereas robots must perceive and interact within the spatial-physical world. This gap highlights the need for a comprehensive understanding of robotic-specific multisensory information, which is crucial for achieving complex and contact-rich control. To this end, we introduce a multisensory language-action (MLA) model that collaboratively perceives heterogeneous sensory modalities and predicts future multisensory objectives to facilitate physical world modeling. Specifically, to enhance perceptual representations, we propose an encoder-free multimodal alignment scheme that innovatively repurposes the large language model itself as a perception module, directly interpreting multimodal cues by aligning 2D images, 3D point clouds, and tactile tokens through positional correspondence. To further enhance MLA's understanding of physical dynamics, we design a future multisensory generation post-training strategy that enables MLA to reason about semantic, geometric, and interaction information, providing more robust conditions for action generation. For evaluation, the MLA model outperforms the previous state-of-the-art 2D and 3D VLA methods by 12% and 24% in complex, contact-rich real-world tasks, respectively, while also demonstrating improved generalization to unseen configurations.
comment: Project page: https://robotic-mla.github.io/
PLM-Net: Perception Latency Mitigation Network for Vision-Based Lateral Control of Autonomous Vehicles
This study introduces the Perception Latency Mitigation Network (PLM-Net), a modular deep learning framework designed to mitigate perception latency in vision-based imitation-learning lane-keeping systems. Perception latency, defined as the delay between visual sensing and steering actuation, can degrade lateral tracking performance and steering stability. While delay compensation has been extensively studied in classical predictive control systems, its treatment within vision-based imitation-learning architectures under constant and time-varying perception latency remains limited. Rather than reducing latency itself, PLM-Net mitigates its effect on control performance through a plug-in architecture that preserves the original control pipeline. The framework consists of a frozen Base Model (BM), representing an existing lane-keeping controller, and a Timed Action Prediction Model (TAPM), which predicts future steering actions corresponding to discrete latency conditions. Real-time mitigation is achieved by interpolating between model outputs according to the measured latency value, enabling adaptation to both constant and time-varying latency. The framework is evaluated in a closed-loop deterministic simulation environment under fixed-speed conditions to isolate the impact of perception latency. Results demonstrate significant reductions in steering error under multiple latency settings, achieving up to 62% and 78% reductions in Mean Absolute Error (MAE) for constant and time-varying latency cases, respectively. These findings demonstrate the architectural feasibility of modular latency mitigation for vision-based lateral control under controlled simulation settings. The project page including video demonstrations, code, and dataset is publicly released.
Developing a Discrete-Event Simulator of School Shooter Behavior from VR Data
Virtual reality (VR) has emerged as a powerful tool for evaluating school security measures in high-risk scenarios such as school shootings, offering experimental control and high behavioral fidelity. However, assessing new interventions in VR requires recruiting new participant cohorts for each condition, making large-scale or iterative evaluation difficult. These limitations are especially restrictive when attempting to learn effective intervention strategies, which typically require many training episodes. To address this challenge, we develop a data-driven discrete-event simulator (DES) that models shooter movement and in-region actions as stochastic processes learned from participant behavior in VR studies. We use the simulator to examine the impact of a robot-based shooter intervention strategy. Once shown to reproduce key empirical patterns, the DES enables scalable evaluation and learning of intervention strategies that are infeasible to train directly with human subjects. Overall, this work demonstrates a high-to-mid fidelity simulation workflow that provides a scalable surrogate for developing and evaluating autonomous school-security interventions.
comment: Accepted for presentation at ANNSIM 2026. Camera-ready version. 13 pages, 4 figures, 4 tables
Simulation to Rules: A Dual-VLM Framework for Formal Visual Planning
Vision Language Models (VLMs) show strong potential for visual planning but struggle with precise spatial and long-horizon reasoning, while Planning Domain Definition Language (PDDL) planners excel at formal long-horizon planning but cannot interpret visual inputs. Recent works combine these complementary advantages by translating visual problems into PDDL. However, while VLMs can generate PDDL problem files satisfactorily, accurately generating PDDL domain files, which encode planning rules, remains challenging and typically requires human expertise or environment interaction. We propose VLMFP, a Dual-VLM-guided framework that autonomously generates both PDDL problem and domain files for formal visual planning. VLMFP combines a SimVLM that simulates action consequences with a GenVLM that generates and iteratively refines PDDL files by aligning symbolic execution with simulated outcomes, enabling multiple levels of generalization across unseen instances, visual appearances, and game rules. We evaluate VLMFP on 6 grid-world domains and demonstrate its generalization capability. On average, SimVLM achieves 87.3% and 86.0% scenario understanding and action simulation for seen and unseen appearances, respectively. With the guidance of SimVLM, VLMFP attains 70.0%, 54.1% planning success on unseen instances in seen and unseen appearances, respectively. We further demonstrate that VLMFP scales to complex long-horizon 3D planning tasks, including multi-robot collaboration and assembly scenarios with partial observability and diverse visual variations. Project page: https://sites.google.com/view/vlmfp.
comment: 40 pages, 6 figures, 13 tables
AsgardBench -- Evaluating Visually Grounded Interactive Planning Under Minimal Feedback
With AsgardBench we aim to evaluate visually grounded, high-level action sequence generation and interactive planning, focusing specifically on plan adaptation during execution based on visual observations rather than navigation or low-level manipulation. In the landscape of embodied AI benchmarks, AsgardBench targets the capability category of interactive planning, which is more sophisticated than offline high-level planning as it requires agents to revise plans in response to environmental feedback, yet remains distinct from low-level execution. Unlike prior embodied AI benchmarks that conflate reasoning with navigation or provide rich corrective feedback that substitutes for perception, AsgardBench restricts agent input to images, action history, and lightweight success/failure signals, isolating interactive planning in a controlled simulator without low-level control noise. The benchmark contains 108 task instances spanning 12 task types, each systematically varied through object state, placement, and scene configuration. These controlled variations create conditional branches in which a single instruction can require different action sequences depending on what the agent observes, emphasizing conditional branching and plan repair during execution. Our evaluations of leading vision language models show that performance drops sharply without visual input, revealing weaknesses in visual grounding and state tracking that ultimately undermine interactive planning. Our benchmark zeroes in on a narrower question: can a model actually use what it sees to adapt a plan when things do not go as expected?
comment: 19 figures, 6 tables, including appendix
Thousand-GPU Large-Scale Training and Optimization Recipe for AI-Native Cloud Embodied Intelligence Infrastructure
Embodied intelligence is a key step towards Artificial General Intelligence (AGI), yet its development faces multiple challenges including data, frameworks, infrastructure, and evaluation systems. To address these issues, we have, for the first time in the industry, launched a cloud-based, thousand-GPU distributed training platform for embodied intelligence, built upon the widely adopted LeRobot framework, and have systematically overcome bottlenecks across the entire pipeline. At the data layer, we have restructured the data pipeline to optimize the flow of embodied training data. In terms of training, for the GR00T-N1.5 model, utilizing thousand-GPU clusters and data at the scale of hundreds of millions, the single-round training time has been reduced from 15 hours to just 22 minutes, achieving a 40-fold speedup. At the model layer, by combining variable-length FlashAttention and Data Packing, we have moved from sample redundancy to sequence integration, resulting in a 188% speed increase; π-0.5 attention optimization has accelerated training by 165%; and FP8 quantization has delivered a 140% speedup. On the infrastructure side, relying on high-performance storage, a 3.2T RDMA network, and a Ray-driven elastic AI data lake, we have achieved deep synergy among data, storage, communication, and computation. We have also built an end-to-end evaluation system, creating a closed loop from training to simulation to assessment. This framework has already been fully validated on thousand-GPU clusters, laying a crucial technical foundation for the development and application of next-generation autonomous intelligent robots, and is expected to accelerate the arrival of the era of human-machine integration.
Multiagent Systems
Bringing Network Coding into Multi-Robot Systems: Interplay Study for Autonomous Systems over Wireless Communications
Communication is a core enabler for multi-robot systems (MRS), providing the mechanism through which robots exchange state information, coordinate actions, and satisfy safety constraints. While many MRS autonomy algorithms assume reliable and timely message delivery, realistic wireless channels introduce delay, erasures, and ordering stalls that can degrade performance and compromise safety-critical decisions of the robot task. In this paper, we investigate how transport-layer reliability mechanisms that mitigate communication losses and delays shape the autonomy-communication loop. We show that conventional non-coded retransmission-based protocols introduce long delays that are misaligned with the timeliness requirements of MRS applications, and may render the received data irrelevant. As an alternative, we advocate for adaptive and causal network coding, which proactively injects coded redundancy to achieve the desired delay and throughput that enable relevant data delivery to the robotic task. Specifically, this method adapts to channel conditions between robots and causally tunes the communication rates via efficient algorithms. We present two case studies: cooperative localization under delayed and lossy inter-robot communication, and a safety-critical overtaking maneuver where timely vehicle-to-vehicle message availability determines whether an ego vehicle can abort to avoid a crash. Our results demonstrate that coding-based communication significantly reduces in-order delivery stalls, preserves estimation consistency under delay, and improves deadline reliability relative to retransmission-based transport. Overall, the study highlights the need to jointly design autonomy algorithms and communication mechanisms, and positions network coding as a principled tool for dependable multi-robot operation over wireless networks.
Is Your LLM-as-a-Recommender Agent Trustable? LLMs' Recommendation is Easily Hacked by Biases (Preferences)
Current Large Language Models (LLMs) are gradually exploited in practically valuable agentic workflows such as Deep Research, E-commerce recommendation, and job recruitment. In these applications, LLMs need to select some optimal solutions from massive candidates, which we term as \textit{LLM-as-a-Recommender} paradigm. However, the reliability of using LLM agents for recommendations is underexplored. In this work, we introduce a \textbf{Bias} \textbf{Rec}ommendation \textbf{Bench}mark (\textbf{BiasRecBench}) to highlight the critical vulnerability of such agents to biases in high-value real-world tasks. The benchmark includes three practical domains: paper review, e-commerce, and job recruitment. We construct a \textsc{Bias Synthesis Pipeline with Calibrated Quality Margins} that 1) synthesizes evaluation data by controlling the quality gap between optimal and sub-optimal options to provide a calibrated testbed to elicit the vulnerability to biases; 2) injects contextual biases that are logical and suitable for option contexts. Extensive experiments on both SOTA (Gemini-{2.5,3}-pro, GPT-4o, DeepSeek-R1) and small-scale LLMs reveal that agents frequently succumb to injected biases despite having sufficient reasoning capabilities to identify the ground truth. These findings expose a significant reliability bottleneck in current agentic workflows, calling for specialized alignment strategies for LLM-as-a-Recommender. The complete code and evaluation datasets will be made publicly available shortly.
Agentic Cognitive Profiling: Realigning Automated Alzheimer's Disease Detection with Clinical Construct Validity
Automated Alzheimer's Disease (AD) screening has predominantly followed the inductive paradigm of pattern recognition, which directly maps the input signal to the outcome label. This paradigm sacrifices construct validity of clinical protocol for statistical shortcuts. This paper proposes Agentic Cognitive Profiling (ACP), an agentic framework that realigns automated screening with clinical protocol logic across multiple cognitive domains. Rather than learning opaque mappings from transcripts to labels, the framework decomposes standardized assessments into atomic cognitive tasks and orchestrates specialized LLM agents to extract verifiable scoring primitives. Central to our design is decoupling semantic understanding from measurement by delegating all quantification to deterministic function calling, thereby mitigating hallucination and restoring construct validity. Unlike popular datasets that typically comprise around a hundred participants under a single task, we evaluate on a clinically-annotated corpus of 402 participants across eight structured cognitive tasks spanning multiple cognitive domains. The framework achieves 90.5% score match rate in task examination and 85.3% accuracy in AD prediction, surpassing popular baselines while generating interpretable cognitive profiles grounded in behavioral evidence. This work demonstrates that construct validity and predictive performance need not be traded off, charting a path toward AD screening systems that explain rather than merely predict.
Distributed Equilibrium-Seeking in Target Coverage Games via Self-Configurable Networks under Limited Communication
We study a target coverage problem in which a team of sensing agents, operating under limited communication, must collaboratively monitor targets that may be adaptively repositioned by an attacker. We model this interaction as a zero-sum game between the sensing team (known as the defender) and the attacker. However, computing an exact Nash equilibrium (NE) for this game is computationally prohibitive as the action space of the defender grows exponentially with the number of sensors and their possible orientations. Exploiting the submodularity property of the game's utility function, we propose a distributed framework that enables agents to self-configure their communication neighborhoods under bandwidth constraints and collaboratively maximize the target coverage. We establish theoretical guarantees showing that the resulting sensing strategies converge to an approximate NE of the game. To our knowledge, this is the first distributed, communication-aware approach that scales effectively for games with combinatorial action spaces while explicitly incorporating communication constraints. To this end, we leverage the distributed bandit-submodular optimization framework and the notion of Value of Coordination that were introduced in [1]. Through simulations, we show that our approach attains near-optimal game value and higher target coverage compared to baselines.
ReLMXEL: Adaptive RL-Based Memory Controller with Explainable Energy and Latency Optimization
Reducing latency and energy consumption is critical to improving the efficiency of memory systems in modern computing. This work introduces ReLMXEL (Reinforcement Learning for Memory Controller with Explainable Energy and Latency Optimization), a explainable multi-agent online reinforcement learning framework that dynamically optimizes memory controller parameters using reward decomposition. ReLMXEL operates within the memory controller, leveraging detailed memory behavior metrics to guide decision-making. Experimental evaluations across diverse workloads demonstrate consistent performance gains over baseline configurations, with refinements driven by workload-specific memory access behaviour. By incorporating explainability into the learning process, ReLMXEL not only enhances performance but also increases the transparency of control decisions, paving the way for more accountable and adaptive memory system designs.
Actionable Recourse in Competitive Environments: A Dynamic Game of Endogenous Selection
Actionable recourse studies whether individuals can modify feasible features to overturn unfavorable outcomes produced by AI-assisted decision-support systems. However, many such systems operate in competitive settings, such as admission or hiring, where only a fraction of candidates can succeed. A fundamental question arises: what happens when actionable recourse is available to everyone in a competitive environment? This study proposes a framework that models recourse as a strategic interaction among candidates under a risk-based selection rule. Rejected individuals exert effort to improve actionable features along directions implied by the decision rule, while the success benchmark evolves endogenously as many candidates adjust simultaneously. This creates endogenous selection, in which both the decision rule and the selection threshold are determined by the population's current feature state. This interaction generates a closed-loop dynamical system linking candidate selection and strategic recourse. We show that the initially selected candidates determine both the benchmark of success and the direction of improvement, thereby amplifying initial disparities and producing persistent performance gaps across the population.
Governed Memory: A Production Architecture for Multi-Agent Workflows
Enterprise AI deploys dozens of autonomous agent nodes across workflows, each acting on the same entities with no shared memory and no common governance. We identify five structural challenges arising from this memory governance gap: memory silos across agent workflows; governance fragmentation across teams and tools; unstructured memories unusable by downstream systems; redundant context delivery in autonomous multi-step executions; and silent quality degradation without feedback loops. We present Governed Memory, a shared memory and governance layer addressing this gap through four mechanisms: a dual memory model combining open-set atomic facts with schema-enforced typed properties; tiered governance routing with progressive context delivery; reflection-bounded retrieval with entity-scoped isolation; and a closed-loop schema lifecycle with AI-assisted authoring and automated per-property refinement. We validate each mechanism through controlled experiments (N=250, five content types): 99.6% fact recall with complementary dual-modality coverage; 92% governance routing precision; 50% token reduction from progressive delivery; zero cross-entity leakage across 500 adversarial queries; 100% adversarial governance compliance; and output quality saturation at approximately seven governed memories per entity. On the LoCoMo benchmark, the architecture achieves 74.8% overall accuracy, confirming that governance and schema enforcement impose no retrieval quality penalty. The system is in production at Personize.ai.
comment: 18 pages, 4 figures, 11 tables, 7 appendices. Code and datasets: https://github.com/personizeai/governed-memory
In Trust We Survive: Emergent Trust Learning
We introduce Emergent Trust Learning (ETL), a lightweight, trust-based control algorithm that can be plugged into existing AI agents. It enables these to reach cooperation in competitive game environments under shared resources. Each agent maintains a compact internal trust state, which modulates memory, exploration, and action selection. ETL requires only individual rewards and local observations and incurs negligible computational and communication overhead. We evaluate ETL in three environments: In a grid-based resource world, trust-based agents reduce conflicts and prevent long-term resource depletion while achieving competitive individual returns. In a hierarchical Tower environment with strong social dilemmas and randomised floor assignments, ETL sustains high survival rates and recovers cooperation even after extended phases of enforced greed. In the Iterated Prisoner's Dilemma, the algorithm generalises to a strategic meta-game, maintaining cooperation with reciprocal opponents while avoiding long-term exploitation by defectors. Code will be released upon publication.
HRI-SA: A Multimodal Dataset for Online Assessment of Human Situational Awareness during Remote Human-Robot Teaming
Maintaining situational awareness (SA) is critical in human-robot teams. Yet, under high workload and dynamic conditions, operators often experience SA gaps. Automated detection of SA gaps could provide timely assistance for operators. However, conventional SA measures either disrupt task flow or cannot capture real-time fluctuations, limiting their operational utility. To the best of our knowledge, no publicly available dataset currently supports the systematic evaluation of online human SA assessment in human-robot teaming. To advance the development of online SA assessment tools, we introduce HRI-SA, a multimodal dataset from 30 participants in a realistic search-and-rescue human-robot teaming context, incorporating eye movements, pupil diameter, biosignals, user interactions, and robot data. The experimental protocol included predefined events requiring timely operator assistance, with ground truth SA latency of two types (perceptual and comprehension) systematically obtained by measuring the time between assistance need onset and resolution. We illustrate the utility of this dataset by evaluating standard machine learning models for detecting perceptual SA latencies using generic eye-tracking features and contextual features. Results show that eye-tracking features alone effectively classified perceptual SA latency (recall=88.91%, F1=67.63%) using leave-one-group-out cross-validation, with performance improved through contextual data fusion (recall=91.51%, F1=80.38%). This paper contributes the first public dataset supporting the systematic evaluation of SA throughout a human-robot teaming mission, while also demonstrating the potential of generic eye-tracking features for continuous perceptual SA latency detection in remote human-robot teaming.
comment: This work is currently under peer review
MemArchitect: A Policy Driven Memory Governance Layer
Persistent Large Language Model (LLM) agents expose a critical governance gap in memory management. Standard Retrieval-Augmented Generation (RAG) frameworks treat memory as passive storage, lacking mechanisms to resolve contradictions, enforce privacy, or prevent outdated information ("zombie memories") from contaminating the context window. We introduce MemArchitect, a governance layer that decouples memory lifecycle management from model weights. MemArchitect enforces explicit, rule-based policies, including memory decay, conflict resolution, and privacy controls. We demonstrate that governed memory consistently outperforms unmanaged memory in agentic settings, highlighting the necessity of structured memory governance for reliable and safe autonomous systems.
comment: This is an on going research work and will be updated periodically
A Trace-Based Assurance Framework for Agentic AI Orchestration: Contracts, Testing, and Governance
In Agentic AI, Large Language Models (LLMs) are increasingly used in the orchestration layer to coordinate multiple agents and to interact with external services, retrieval components, and shared memory. In this setting, failures are not limited to incorrect final outputs. They also arise from long-horizon interaction, stochastic decisions, and external side effects (such as API calls, database writes, and message sends). Common failures include non-termination, role drift, propagation of unsupported claims, and attacks via untrusted context or external channels. This paper presents an assurance framework for such Agentic AI systems. Executions are instrumented as Message-Action Traces (MAT) with explicit step and trace contracts. Contracts provide machine-checkable verdicts, localize the first violating step, and support deterministic replay. The framework includes stress testing, formulated as a budgeted counterexample search over bounded perturbations. It also supports structured fault injection at service, retrieval, and memory boundaries to assess containment under realistic operational faults and degraded conditions. Finally, governance is treated as a runtime component, enforcing per-agent capability limits and action mediation (allow, rewrite, block) at the language-to-action boundary. To support comparative evaluations across stochastic seeds, models, and orchestration configurations, the paper defines trace-based metrics for task success, termination reliability, contract compliance, factuality indicators, containment rate, and governance outcome distributions. More broadly, the framework is intended as a common abstraction to support testing and evaluation of multi-agent LLM systems, and to facilitate reproducible comparison across orchestration designs and configurations.
FACET: Teacher-Centred LLM-Based Multi-Agent Systems-Towards Personalized Educational Worksheets
The increasing heterogeneity of student populations poses significant challenges for teachers, particularly in mathematics education, where cognitive, motivational, and emotional differences strongly influence learning outcomes. While AI-driven personalization tools have emerged, most remain performance-focused, offering limited support for teachers and neglecting broader pedagogical needs. This paper presents the FACET framework, a teacher-facing, large language model (LLM)-based multi-agent system designed to generate individualized classroom materials that integrate both cognitive and motivational dimensions of learner profiles. The framework comprises three specialized agents: (1) learner agents that simulate diverse profiles incorporating topic proficiency and intrinsic motivation, (2) a teacher agent that adapts instructional content according to didactical principles, and (3) an evaluator agent that provides automated quality assurance. We tested the system using authentic grade 8 mathematics curriculum content and evaluated its feasibility through a) automated agent-based assessment of output quality and b) exploratory feedback from K-12 in-service teachers. Results from ten internal evaluations highlighted high stability and alignment between generated materials and learner profiles, and teacher feedback particularly highlighted structure and suitability of tasks. The findings demonstrate the potential of multi-agent LLM architectures to provide scalable, context-aware personalization in heterogeneous classroom settings, and outline directions for extending the framework to richer learner profiles and real-world classroom trials.
Swarm Self Clustering for Communication denied Environments without Global Positioning
In this work, we investigate swarm self-clustering, where robots autonomously organize into spatially coherent groups using only local sensing and decision-making, without external commands, global positioning, or inter-robot communication. Each robot forms and maintains clusters by responding to relative distances from nearby neighbors detected through onboard range sensors with limited fields of view. The method is suited for GPS-denied and communication-constrained environments and requires no prior knowledge of cluster size, number, or membership. A mechanism enables robots to alternate between consensus-based and random goal assignment based on local neighborhood size, ensuring robustness, scalability, and untraceable clustering independent of initial conditions. Extensive simulations and real-robot experiments demonstrate empirical convergence, adaptability to dynamic additions, and improved performance over local-only baselines across standard cluster quality metrics.
comment: 36 Pages, 15 figures, 8 tables, pre-print version
Communication to Completion: Modeling Collaborative Workflows with Intelligent Multi-Agent Communication
Multi-agent LLM systems have demonstrated impressive capabilities in complex collaborative tasks, yet most frameworks treat communication as instantaneous and free, overlooking a fundamental constraint in real world teamwork, collaboration cost. We propose a scalable framework implemented via Communication to Completion (C2C), which explicitly models communication as a constrained resource with realistic temporal costs. We introduce the Alignment Factor (AF), a dynamic metric inspired by Shared Mental Models, to quantify the link between task understanding and work efficiency. Through experiments on 15 software engineering workflows spanning three complexity tiers and team sizes from 5 to 17 agents, we demonstrate that cost-aware strategies achieve over 40% higher efficiency compared to unconstrained interaction. Our analysis reveals emergent coordination patterns: agents naturally adopt manager centric hub-and-spoke topologies, strategically escalate from asynchronous to synchronous channels based on complexity, and prioritize high value help requests. These patterns remain consistent across multiple frontier models (GPT-5.2, Claude Sonnet 4.5, Gemini 2.5 Pro). This study moves beyond simple agent construction, offering a theoretical foundation for quantifying and optimizing the dynamics of collaboration in future digital workplaces.
comment: 13 pages
When Openclaw Agents Learn from Each Other: Insights from Emergent AI Agent Communities for Human-AI Partnership in Education
The AIED community envisions AI evolving "from tools to teammates," yet our understanding of AI teammates remains limited to dyadic human-AI interactions. We offer a different vantage point: a rapidly growing ecosystem of AI agent platforms where over 167,000 agents participate, interact as peers, and develop learning behaviors without researcher intervention. Drawing on a month of daily qualitative observations across multiple platforms including Moltbook, The Colony, and 4claw, we identify four phenomena with implications for AIED: (1) humans who configure their agents undergo a "bidirectional scaffolding" process, learning through teaching; (2) peer learning emerges without any designed curriculum, complete with idea cascades and quality hierarchies; (3) agents converge on shared memory architectures that mirror open learner model design; and (4) trust dynamics and platform mortality reveal design constraints for networked educational AI. Rather than presenting empirical findings, we argue that these organic phenomena offer a naturalistic window into dynamics that can inform principled design of multi-agent educational systems. We sketch an illustrative curriculum design, "Learn by Teaching Your AI Agent Teammate," and outline potential research directions and open problems to show how these observations might inform future AIED practice and inquiry.
comment: 14 pages, 4 figures
ORCA: ORchestrating Causal Agent
Causal analysis on relational databases is challenging, as analysis datasets must be repeatedly queried from complex schemas. Recent LLM systems can automate individual steps, but they hardly manage dependencies across analysis stages, making it difficult to preserve consistency between causal hypothesis. We propose ORCA (ORchestrating Causal Agent), an interactive multi-agent framework to enable coherent causal analysis on relational databases by maintaining shared state and introducing human checkpoints. In a controlled user study, participants using ORCA successfully completed end-to-end analysis more often than with a baseline LLM (GPT-4o-mini) assistant by 42 percentage points, achieved substantially lower ATE error, and reduced time spent on repetitive data exploration and query refinement by 76\% on average. These results show that ORCA improves both how users interact with the causal analysis pipeline and the reliability of the resulting causal conclusions.
comment: 35 pages, CHI EA 2026
Scalable UAV Multi-Hop Networking via Multi-Agent Reinforcement Learning with Large Language Models
In disaster scenarios, establishing robust emergency communication networks is critical, and unmanned aerial vehicles (UAVs) offer a promising solution to rapidly restore connectivity. However, organizing UAVs to form multi-hop networks in large-scale dynamic environments presents significant challenges, including limitations in algorithmic scalability and the vast exploration space required for coordinated decision-making. To address these issues, we propose MRLMN, a novel framework that integrates multi-agent reinforcement learning (MARL) and large language models (LLMs) to jointly optimize UAV agents toward achieving optimal networking performance. The framework incorporates a grouping strategy with reward decomposition to enhance algorithmic scalability and balance decision-making across UAVs. In addition, behavioral constraints are applied to selected key UAVs to improve the robustness of the network. Furthermore, the framework integrates LLM agents, leveraging knowledge distillation to transfer their high-level decision-making capabilities to MARL agents. This enhances both the efficiency of exploration and the overall training process. In the distillation module, a Hungarian algorithm-based matching scheme is applied to align the decision outputs of the LLM and MARL agents and define the distillation loss. Extensive simulation results validate the effectiveness of our approach, demonstrating significant improvements in network performance over the MAPPO baseline and other comparison methods, including enhanced coverage and communication quality.
comment: 18 pages, 23 figures
Forecast-Aware Cooperative Planning on Temporal Graphs under Stochastic Adversarial Risk
Cooperative multi-robot missions often require teams of robots to traverse environments where traversal risk evolves due to adversary patrols or shifting hazards with stochastic dynamics. While support coordination--where robots assist teammates in traversing risky regions--can significantly reduce mission costs, its effectiveness depends on the team's ability to anticipate future risk. Existing support-based frameworks assume static risk landscapes and therefore fail to account for predictable temporal trends in risk evolution. We propose a forecast-aware cooperative planning framework that integrates stochastic risk forecasting with anticipatory support allocation on temporal graphs. By modeling adversary dynamics as a first-order Markov stay-move process over graph edges, we propagate the resulting edge-occupancy probabilities forward in time to generate time-indexed edge-risk forecasts. These forecasts guide the proactive allocation of support positions to forecasted risky edges for effective support coordination, while also informing joint robot path planning. Experimental results demonstrate that our approach consistently reduces total expected team cost compared to non-anticipatory baselines, approaching the performance of an oracle planner.
Adaptive Accountability in Networked MAS: Tracing and Mitigating Emergent Norms at Scale
Large-scale networked multi-agent systems increasingly underpin critical infrastructure, yet their collective behavior can drift toward undesirable emergent norms such as collusion, resource hoarding, and implicit unfairness. We present the Adaptive Accountability Framework (AAF), an end-to-end runtime layer that (i) records cryptographically verifiable interaction provenance, (ii) detects distributional change points in streaming traces, (iii) attributes responsibility via a causal influence graph, and (iv) applies cost-bounded interventions-reward shaping and targeted policy patching-to steer the system back toward compliant behavior. We establish a bounded-compromise guarantee: if the expected cost of intervention exceeds an adversary's expected payoff, the long-run fraction of compromised interactions converges to a value strictly below one. We evaluate AAF in a large-scale factorial simulation suite (87,480 runs across two tasks; up to 100 agents plus a 500-agent scaling sweep; full and partial observability; Byzantine rates up to 10%; 10 seeds per regime). Across 324 regimes, AAF lowers the executed compromise ratio relative to a Proximal Policy Optimization baseline in 96% of regimes (median relative reduction 11.9%) while preserving social welfare (median change 0.4%). Under adversarial injections, AAF detects norm violations with a median delay of 71 steps (interquartile range 39-177) and achieves a mean top-ranked attribution accuracy of 0.97 at 10% Byzantine rate.
Game-Theoretic Coordination for Time-Critical Missions of UAV Systems
Coordinated missions involving Unmanned Aerial Vehicles (UAVs) in dynamic environments pose significant challenges in maintaining both coordination and agility. In this paper, relying on the cooperative path following framework and using a game-theoretic formulation, we introduce a novel and scalable approach in which each UAV acts autonomously in different mission conditions. This formulation naturally accommodates heterogeneous and time-varying objectives across the system. In our setting, each UAV optimizes a cost function that incorporates temporal and mission-specific constraints. The optimization is performed within a one-dimensional domain, significantly reducing the computational cost and enabling real-time application to complex and dynamic scenarios. The framework is distributed in structure, enabling global, system-wide coordination (a Nash equilibrium) by using only local information. For ideal systems, we prove the existence and the Nash equilibrium exhibits exponential convergence. Furthermore, we invoke model predictive control (MPC) for non-ideal scenarios. In particular, we propose a discrete-time optimization approach that tackles path-following errors and communication failures, ensuring reliable and agile performance in dynamic and uncertain environments. Simulation results demonstrate the effectiveness and agility of the approach in ensuring successful mission execution across diverse realistic scenarios.
comment: Revised version with improved exposition, expanded introduction, updated abstract, minor corrections and updated author list
Systems and Control (EESS)
Distributed Adaptive Control for DC Power Distribution in Hybrid-Electric Aircraft: Design and Experimental Validation
To reduce CO2 emissions and tackle increasing fuel costs, the aviation industry is swiftly moving towards the electrification of aircraft. From the viewpoint of systems and control, a key challenge brought by this transition corresponds to the management and safe operation of the propulsion system's onboard electrical power distribution network. In this work, for a series-hybrid-electric propulsion system, we propose a distributed adaptive controller for regulating the voltage of a DC bus that energizes the electricity-based propulsion system. The proposed controller -- whose design is based on principles of back-stepping, adaptive, and passivity-based control techniques -- also enables the proportional sharing of the electric load among multiple converter-interfaced sources, which reduces the likelihood of over-stressing individual sources. Compared to existing control strategies, our method ensures stable, convergent, and accurate voltage regulation and load-sharing even if the effects of power lines of unknown resistances and inductances are considered. The performance of the proposed control scheme is experimentally validated and compared to state-of-the-art controllers in a power hardware-in-the-loop (PHIL) environment.
A Tutorial on Learning-Based Radio Map Construction: Data, Paradigms, and Physics-Awarenes
The integration of artificial intelligence into next-generation wireless networks necessitates the accurate construction of radio maps (RMs) as a foundational prerequisite for electromagnetic digital twins. A RM provides the digital representation of the wireless propagation environment, mapping complex geographical and topological boundary conditions to critical spatial-spectral metrics that range from received signal strength to full channel state information matrices. This tutorial presents a comprehensive survey of learning-based RM construction, systematically addressing three intertwined dimensions: data, paradigms, and physics-awareness. From the data perspective, we review physical measurement campaigns, ray tracing simulation engines, and publicly available benchmark datasets, identifying their respective strengths and fundamental limitations. From the paradigm perspective, we establish a core taxonomy that categorizes RM construction into source-aware forward prediction and source-agnostic inverse reconstruction, and examine five principal neural architecture families spanning convolutional neural networks, vision transformers, graph neural networks, generative adversarial networks, and diffusion models. We further survey optics-inspired methods adapted from neural radiance fields and 3D Gaussian splatting for continuous wireless radiation field modeling. From the physics-awareness perspective, we introduce a three-level integration framework encompassing data-level feature engineering, loss-level partial differential equation regularization, and architecture-level structural isomorphism. Open challenges including foundation model development, physical hallucination detection, and amortized inference for real-time deployment are discussed to outline future research directions.
From Optimizable to Interactable: Mixed Digital Twin-Empowered Testing of Vehicle-Infrastructure Cooperation Systems
Sufficient testing under corner cases is critical for the long-term operation of vehicle-infrastructure cooperation systems (VICS). However, existing corner-case generation methods are primarily AI-driven, and VICS testing under corner cases is typically limited to simulation. In this paper, we introduce an L5 ''Interactable'' level to the VICS digital twin (VICS-DT) taxonomy, extending beyond the conventional L4 ''Optimizable'' level. We further propose an L5-level VICS testing framework, IMPACT (Interactive Mixed-digital-twin Paradigm for Advanced Cooperative vehicle-infrastructure Testing). By enabling direct human interactions with VICS entities, IMPACT incorporates highly uncertain and unpredictable human behaviors into the testing loop, naturally generating high-quality corner cases that complement AI-based methods. Furthermore, the mixedDT-enabled ''Physical-Virtual Action Interaction'' facilitates safe VICS testing under corner cases, incorporating real-world environments and entities rather than purely in simulation. Finally, we implement IMPACT on the I-VIT (Interactive Vehicle-Infrastructure Testbed), and experiments demonstrate its effectiveness. The experimental videos are available at our project website: https://dongjh20.github.io/IMPACT.
PowerDAG: Reliable Agentic AI System for Automating Distribution Grid Analysis
This paper introduces PowerDAG, an agentic AI system for automating complex distribution-grid analysis. We address the reliability challenges of state-of-the-art agentic systems in automating complex engineering workflows by introducing two innovative active mechanisms: (i) \textbf{adaptive retrieval}, which uses a similarity-decay cutoff algorithm to dynamically select the most relevant annotated exemplars as context, and (ii) \textbf{just-in-time (JIT) supervision}, which actively intercepts and corrects tool-usage violations during execution. On a benchmark of unseen distribution grid analysis queries, PowerDAG achieves a 100\% success rate with GPT-5.2 and 94.4--96.7\% with smaller open-source models, outperforming base ReAct (41--88\%), LangChain (30--90\%), and CrewAI (9--41\%) baselines by margins of 6--50 percentage points.
Physics-informed Deep Mixture-of-Koopmans Vehicle Dynamics Model with Dual-branch Encoder for Distributed Electric-drive Trucks
Advanced autonomous driving systems require accurate vehicle dynamics modeling. However, identifying a precise dynamics model remains challenging due to strong nonlinearities and the coupled longitudinal and lateral dynamic characteristics. Previous research has employed physics-based analytical models or neural networks to construct vehicle dynamics representations. Nevertheless, these approaches often struggle to simultaneously achieve satisfactory performance in terms of system identification efficiency, modeling accuracy, and compatibility with linear control strategies. In this paper, we propose a fully data-driven dynamics modeling method tailored for complex distributed electric-drive trucks (DETs), leveraging Koopman operator theory to represent highly nonlinear dynamics in a lifted linear embedding space. To achieve high-precision modeling, we first propose a novel dual-branch encoder which encodes dynamic states and provides a powerful basis for the proposed Koopman-based methods entitled KODE. A physics-informed supervision mechanism, grounded in the geometric consistency of temporal vehicle motion, is incorporated into the training process to facilitate effective learning of both the encoder and the Koopman operator. Furthermore, to accommodate the diverse driving patterns of DETs, we extend the vanilla Koopman operator to a mixture-of-Koopman operator framework, enhancing modeling capability. Simulations conducted in a high-fidelity TruckSim environment and real-world experiments demonstrate that the proposed approach achieves state-of-the-art performance in long-term dynamics state estimation.
comment: 13 pages, 8 tables, 7 figures
A Cycle-Based Solvability Condition for Real Power Flow Equations
The solvability condition of the power flow equation is important in operational planning and control as it guarantees the existence and uniqueness of a solution for a given set of power injections. As renewable generation becomes more prevalent, the steady-state operating point of the system changes more frequently, making it increasingly challenging to verify power flow solvability by running the AC power flow solver after each change in power injections. This process can be computationally intensive, and numerical solvers do not always converge reliably to an operational solution. In this paper, we propose a sufficient condition for the solvability of the lossless real power flow equation based on the cycle space of a meshed network. The proposed condition yields a less conservative solvability certificate than existing sufficient conditions on the tested systems and can serve as a useful foundation for developing solvability conditions for the fully coupled power flow equations.
comment: This work has been submitted to the IEEE for possible publication
Real-Time, Crowdsourcing-Enhanced Forecasting of Building Functionality During Urban Floods
Urban flood emergency response increasingly relies on infrastructure impact forecasts rather than hazard variables alone. However, real-time predictions are unreliable due to biased rainfall, incomplete flood knowledge, and sparse observations. Conventional open-loop forecasting propagates impacts without adjusting the system state, causing errors during critical decisions. This study presents CRAF (Crowdsourcing-Enhanced Real-Time Awareness and Forecasting), a physics-informed, closed-loop framework that converts sparse human-sensed evidence into rolling, decision-grade impact forecasts. By coupling physics-based simulation learning with crowdsourced observations, CRAF infers system conditions from incomplete data and propagates them forward to produce multi-step, real-time predictions of zone-level building functionality loss without online retraining. This closed-loop design supports continuous state correction and forward prediction under weakly structured data with low-latency operation. Offline evaluation demonstrates stable generalization across diverse storm scenarios. In operational deployment during Typhoon Haikui (2023) in Fuzhou, China, CRAF reduces 1-3 hour-ahead forecast errors by 84-95% relative to fixed rainfall-driven forecasting and by 73-80% relative to updated rainfall-driven forecasting, while limiting computation to 10 minutes per update cycle. These results show that impact-state alignment-rather than hazard refinement alone-is essential for reliable real-time decision support, providing a pathway toward operational digital twins for resilient urban infrastructure systems.
Distributed Equilibrium-Seeking in Target Coverage Games via Self-Configurable Networks under Limited Communication
We study a target coverage problem in which a team of sensing agents, operating under limited communication, must collaboratively monitor targets that may be adaptively repositioned by an attacker. We model this interaction as a zero-sum game between the sensing team (known as the defender) and the attacker. However, computing an exact Nash equilibrium (NE) for this game is computationally prohibitive as the action space of the defender grows exponentially with the number of sensors and their possible orientations. Exploiting the submodularity property of the game's utility function, we propose a distributed framework that enables agents to self-configure their communication neighborhoods under bandwidth constraints and collaboratively maximize the target coverage. We establish theoretical guarantees showing that the resulting sensing strategies converge to an approximate NE of the game. To our knowledge, this is the first distributed, communication-aware approach that scales effectively for games with combinatorial action spaces while explicitly incorporating communication constraints. To this end, we leverage the distributed bandit-submodular optimization framework and the notion of Value of Coordination that were introduced in [1]. Through simulations, we show that our approach attains near-optimal game value and higher target coverage compared to baselines.
ReLMXEL: Adaptive RL-Based Memory Controller with Explainable Energy and Latency Optimization
Reducing latency and energy consumption is critical to improving the efficiency of memory systems in modern computing. This work introduces ReLMXEL (Reinforcement Learning for Memory Controller with Explainable Energy and Latency Optimization), a explainable multi-agent online reinforcement learning framework that dynamically optimizes memory controller parameters using reward decomposition. ReLMXEL operates within the memory controller, leveraging detailed memory behavior metrics to guide decision-making. Experimental evaluations across diverse workloads demonstrate consistent performance gains over baseline configurations, with refinements driven by workload-specific memory access behaviour. By incorporating explainability into the learning process, ReLMXEL not only enhances performance but also increases the transparency of control decisions, paving the way for more accountable and adaptive memory system designs.
STLts-Div: Diversified Trace Synthesis from STL Specifications Using MILP (Extended Version)
Modern cyber-physical systems are complex, and requirements are often written in Signal Temporal Logic (STL). Writing the right STL is difficult in practice; engineers benefit from concrete executions that illustrate what a specification actually admits. Trace synthesis addresses this need, but a single witness rarely suffices to understand intent or explore edge cases - diverse satisfying behaviors are far more informative. We introduce diversified trace synthesis: the automatic generation of sets of behaviorally diverse traces that satisfy a given STL formula. Building on a MILP encoding of STL and system model, we formalize three complementary diversification objectives - Boolean distance, random Boolean distance, and value distance - all captured by an objective function and solved iteratively. We implement these ideas in STLts-Div, a lightweight Python tool that integrates with Gurobi.
The Geometry of Coordinated Trajectories for Non-stop Flying Carriers Holding a Cable-Suspended Load
This work considers the problem of using multiple aerial carriers to hold a cable-suspended load while remaining in periodic motion at all times. Using a novel differential geometric perspective, it is shown that the problem may be recast as that of finding an immersion of the unit circle into the smooth manifold of admissible configurations. Additionally, this manifold is shown to be path connected under a mild assumption on the attachment points of the carriers to the load. Based on these ideas, a family of simple linear solutions to the original problems is presented that overcomes the constraints of alternative solutions previously proposed in the literature. Simulation results demonstrate the flexibility of the theory in identifying suitable solutions.
comment: 6 pages, 1 figure, submitted to L-CSS
Real-time Coordination of Cascaded Hydroelectric Generation under Decision-Dependent Uncertainties
This paper proposes a real-time control policy for cascaded hydropower systems that incorporates decision-dependent uncertainty (DDU) to capture the coupling of streamflow uncertainties across the network. The framework jointly models exogenous forecast errors and endogenous uncertainty propagation, explicitly characterizing the dependence between upstream releases and downstream inflow variability through a heteroskedastic variance model conditioned on past errors, variance, and control actions. We formulate a joint chance-constrained optimization problem to ensure reliable system operation under uncertainty, and develop a tractable supporting hyperplane algorithm that enables explicit and adaptive risk allocation under DDU. We establish convergence of the proposed method and show that it recovers the Bonferroni approximation under steady-state conditions. A randomized case study based on Columbia River data demonstrates that the proposed framework improves both energy generation and reservoir reliability by accounting for DDU. Sensitivity analyses on drought severity and model parameters further highlight the value of adaptive risk allocation for resilient hydropower operations.
RHYME-XT: A Neural Operator for Spatiotemporal Control Systems
We propose RHYME-XT, an operator-learning framework for surrogate modeling of spatiotemporal control systems governed by input-affine nonlinear partial integro-differential equations (PIDEs) with localized rhythmic behavior. RHYME-XT uses a Galerkin projection to approximate the infinite-dimensional PIDE on a learned finite-dimensional subspace with spatial basis functions parameterized by a neural network. This yields a projected system of ODEs driven by projected inputs. Instead of integrating this non-autonomous system, we directly learn its flow map using an architecture for learning flow functions, avoiding costly computations while obtaining a continuous-time and discretization-invariant representation. Experiments on a neural field PIDE show that RHYME-XT outperforms a state-of-the-art neural operator and is able to transfer knowledge effectively across models trained on different datasets, through a fine-tuning process.
comment: 6 pages, 5 figures. Submitted to IEEE Control Systems Letters (L-CSS) and CDC 2026
Koopman Generator Decomposition for Port-Hamiltonian System
We establish a canonical decomposition of the infinitesimal Koopman generator of any port-Hamiltonian (pH) system into skew-adjoint (energy-conserving), positive-semidefinite (dissipative), and input-port components, proving that the generator satisfies an energy-dissipation inequality on a dense subdomain of $L^2(μ)$ for any invariant measure $μ$ satisfying a mild joint-invariance condition stated in Theorem 1. This infinite-dimensional splitting carries over exactly to finite-dimensional Galerkin approximations, yielding structure-constrained surrogate models that provably inherit passivity with a quadratic storage function in the lifted observable space. Leveraging this structure, we design passivity-based controllers directly in the lifted space and establish asymptotic stability of the lifted closed-loop system via LaSalle's invariance principle under a mild detectability condition. For linear pH systems, the decomposition recovers the true pH matrices exactly, confirming that the structural constraints arise naturally from the operator theory rather than being imposed by hand. The framework unifies port-Hamiltonian systems theory and Koopman spectral methods, providing a rigorous operator-theoretic foundation for energy-consistent lifting of nonlinear pH dynamics.
comment: 8 pages; submitted to IEEE Conference on Decision and Control, 2026
Certainty-equivalent adaptive MPC for uncertain nonlinear systems
We provide a method to design adaptive controllers for nonlinear systems using model predictive control (MPC). By combining a certainty-equivalent MPC formulation with least-mean-square parameter adaptation, we obtain an adaptive controller with strong robust performance guarantees: The cumulative tracking error and violation of state constraints scale linearly with noise energy, disturbance energy, and path length of parameter variation. A key technical contribution is developing the underlying certainty-equivalent MPC that tracks output references, accounts for actuator limitations and desired state constraints, requires no system-specific offline design, and provides strong inherent robustness properties. This is achieved by leveraging finite-horizon rollouts, artificial references, recent analysis techniques for optimization-based controllers, and soft state constraints. For open-loop stable systems, we derive a semi-global result that applies to arbitrarily large measurement noise, disturbances, and parametric uncertainty. For stabilizable systems, we derive a regional result that is valid within a given region of attraction and for sufficiently small uncertainty. Applicability and benefits are demonstrated with numerical simulations involving systems with large parametric uncertainty: a linear stable chain of mass-spring-dampers and a nonlinear unstable quadrotor navigating obstacles.
comment: Code available at: https://github.com/KohlerJohannes/Adaptive
Verification and Validation of Physics-Informed Surrogate Component Models for Dynamic Power-System Simulation
Physics-informed machine learning surrogates are increasingly explored to accelerate dynamic simulation of generators, converters, and other power grid components. The key question, however, is not only whether a surrogate matches a stand-alone component model on average, but whether it remains accurate after insertion into a differential-algebraic simulator, where the surrogate outputs enter the algebraic equations coupling the component to the rest of the system. This paper formulates that in-simulator use as a verification and validation (V\&V) problem. A finite-horizon bound is derived that links allowable component-output error to algebraic-coupling sensitivity, dynamic error amplification, and the simulation horizon. Two complementary settings are then studied: model-based verification against a reference component solver, and data-based validation through conformal calibration of the component-output variables exchanged with the simulator. The framework is general, but the case study focuses on physics-informed neural-network surrogates of second-, fourth-, and sixth-order synchronous-machine models. Results show that good stand-alone surrogate accuracy does not by itself guarantee accurate in-simulator behavior, that the largest discrepancies concentrate in stressed operating regions, and that small equation residuals do not necessarily imply small state-trajectory errors.
An HMDP-MPC Decision-making Framework with Adaptive Safety Margins and Hysteresis for Autonomous Driving ICRA 2026
This paper presents a unified decision-making framework that integrates Hybrid Markov Decision Processes (HMDPs) with Model Predictive Control (MPC), augmented by velocity-dependent safety margins and a prediction-aware hysteresis mechanism. Both the ego and surrounding vehicles are modeled as HMDPs, allowing discrete maneuver transition and kinematic evolution to be jointly considered within the MPC optimization. Safety margins derived from the Intelligent Driver Model (IDM) adapt to traffic context but vary with speed, which can cause oscillatory decisions and velocity fluctuations. To mitigate this, we propose a frozen-release hysteresis mechanism with distinct trigger and release thresholds, effectively enlarging the reaction buffer and suppressing oscillations. Decision continuity is further safeguarded by a two-layer recovery scheme: a global bounded relaxation tied to IDM margins and a deterministic fallback policy. The framework is evaluated through a case study, an ablation against a no-hysteresis baseline, and largescale randomized experiments across 18 traffic settings. Across 8,050 trials, it achieves a collision rate of only 0.05%, with 98.77% of decisions resolved by nominal MPC and minimal reliance on relaxation or fallback. These results demonstrate the robustness and adaptability of the proposed decision-making framework in heterogeneous traffic conditions.
comment: 8 pages, 6 figures, to be published in ICRA 2026 proceedings
Data-Driven Predictive Control for Stochastic Descriptor Systems: An Innovation-Based Approach Handling Non-Causal Dynamics
Descriptor systems arise naturally in applications governed by algebraic constraints, such as power networks and chemical processes. The singular system matrix in descriptor systems may introduce non-causal dynamics, where the current output depends on future inputs and, in the presence of stochastic process and measurement noise, on future noise realizations as well. This paper proposes a data-driven predictive control framework for stochastic descriptor systems that accommodates algebraic constraints and impulsive modes without explicit system identification. A causal innovation representation is constructed by augmenting the system state with a noise buffer that encapsulates the non-causal stochastic interactions, transforming the descriptor system into an equivalent proper state-space form. Willems' Fundamental Lemma is then extended to the innovation form with fully data-verifiable conditions. Building on these results, a practical Inno-DeePC algorithm is developed that integrates offline innovation estimation and online predictive control. Numerical experiments on a direct-current (DC) microgrid demonstrate the effectiveness of the proposed approach for stochastic descriptor systems.
comment: 6 pages, 2 figures
Robust Dynamic Pricing and Admission Control with Fairness Guarantees
Dynamic pricing is commonly used to regulate congestion in shared service systems. This paper is motivated by the fact that when heterogeneaous user groups (in terms of price responsiveness) are present, conventional monotonic pricing can lead to unfair outcomes by disproportionately excluding price-elastic users, particularly under high or uncertain demand. The paper's contributions are twofold. First, we show that when fairness is imposed as a hard state constraint, the optimal (revenue maximizing) pricing policy is generally non-monotonic in demand. This structural result departs fundamentally from standard surge pricing rules and reveals that price reduction under heavy load may be necessary to maintain equitable access. Second, we address the problem that price elasticity among heterogeneous users is unobservable. To solve it, we develop a robust dynamic pricing and admission control framework that enforces resource capacity and fairness constraints for all user type distributions consistent with aggregate measurements. By integrating integral High Order Control Barrier Functions (iHOCBFs) with a worst case robust optimization framework, we obtain a controller that guarantees forward invariance of safety and fairness constraints while optimizing revenue. Numerical experiments demonstrate improved fairness and revenue performance relative to monotonic surge pricing policies.
Multi-Source Human-in-the-Loop Digital Twin Testbed for Connected and Autonomous Vehicles in Mixed Traffic Flow
In the emerging mixed traffic environments, Connected and Autonomous Vehicles (CAVs) have to interact with surrounding human-driven vehicles (HDVs). This paper introduces MSH-MCCT (Multi-Source Human-in-the-Loop Mixed Cloud Control Testbed), a novel CAV testbed that captures complex interactions between various CAVs and HDVs. Utilizing the Mixed Digital Twin concept, which combines Mixed Reality with Digital Twin, MSH-MCCT integrates physical, virtual, and mixed platforms, along with multi-source control inputs. Bridged by the mixed platform, MSH-MCCT allows human drivers and CAV algorithms to operate both physical and virtual vehicles within multiple fields of view. Particularly, this testbed facilitates the coexistence and real-time interaction of physical and virtual CAVs \& HDVs, significantly enhancing the experimental flexibility and scalability. Experiments on vehicle platooning in mixed traffic showcase the potential of MSH-MCCT to conduct CAV testing with multi-source real human drivers in the loop through driving simulators of diverse fidelity. The videos for the experiments are available at our project website: https://dongjh20.github.io/MSH-MCCT.
On maximal positive invariant set computation for rank-deficient linear systems
The maximal positively invariant (MPI) set is obtained through a backward reachability procedure involving the iterative computation and intersection of predecessor sets under state and input constraints. However, standard static feedback synthesis may place some of the closed-loop eigenvalues at zero, leading to rank-deficient dynamics. This affects the MPI computation by inducing projections onto lower-dimensional subspaces during intermediate steps. By exploiting the Schur decomposition, we explicitly address this singular case and propose a robust algorithm that computes the MPI set in both polyhedral and constrained-zonotope representations.
Physical Layer Security in Finite Blocklength Massive IoT with Randomly Located Eavesdroppers
This paper analyzes the physical layer security performance of massive uplink Internet of Things (IoT) networks operating under the finite blocklength (FBL) regime. IoT devices and base stations (BS) are modeled using a stochastic geometry approach, while an eavesdropper is placed at a random location around the transmitting device. This system model captures security risks common in dense IoT deployments. Analytical expressions for the secure success probability, secrecy outage probability and secrecy throughput are derived to characterize how stochastic interference, fading and eavesdropper spatial uncertainty interact with FBL constraints in short packet uplink transmissions. Numerical results illustrate key system behavior under different network and channel conditions.
Defending the power grid by segmenting the EV charging cyber infrastructure
This paper examines defending the power grid against load-altering attacks using electric vehicle charging. It proposes to preventively segment the cyber infrastructure that charging station operators (CSOs) use to communicate with and control their charging stations, thereby limiting the impact of successful cyber-attacks. Using real German charging station data and a reconstructed transmission grid model, a threat analysis shows that without segmentation, the successful hack of just two CSOs can overload two transmission grid branches, exceeding the N-1 security margin and necessitating defense measures. A novel defense design problem is then formulated that minimizes the number of imposed segmentations while bounding the number of branch overloads under worst-case attacks. The resulting IP-MILP bi-level problem can be solved with an exact column and constraint generation algorithm and with heuristics for fast computation on large-scale instances. For the near-real-world Germany case, the applicability of the heuristics is demonstrated and validated under relevant load and dispatch scenarios. It is found that the simple scheme of segmenting CSOs evenly by their installed capacity leads to only 23% more segments compared to the heuristic optimization result, suggesting potential relevance as a regulatory measure.
Hierarchical Decision-Making under Uncertainty: A Hybrid MDP and Chance-Constrained MPC Approach
This paper presents a hierarchical decision-making framework for autonomous systems operating under uncertainty, demonstrated through autonomous driving as a representative application. Surrounding agents are modeled using Hybrid Markov Decision Processes (HMDPs) that jointly capture maneuver-level and dynamic-level uncertainties, enabling the multi-modal environmental prediction. The ego agent is modeled using a separate HMDP and integrated into a Model Predictive Control (MPC) framework that unifies maneuver selection with dynamic feasibility within a single optimization. A set of joint chance constraints serves as the bridge between environmental prediction and optimization, incorporating multi-modal environment predictions into the MPC formulation and ensuring safety across all plausible interaction scenarios. The proposed framework provides theoretical guarantees on recursive feasibility and asymptotic stability, and its benefits in terms of safety and efficiency are validated through comprehensive evaluations in highway and urban environments, together with comparisons against a rule-based baseline.
comment: 14 pages, 10 figures
Real-Time Online Learning for Model Predictive Control using a Spatio-Temporal Gaussian Process Approximation ICRA
Learning-based model predictive control (MPC) can enhance control performance by correcting for model inaccuracies, enabling more precise state trajectory predictions than traditional MPC. A common approach is to model unknown residual dynamics as a Gaussian process (GP), which leverages data and also provides an estimate of the associated uncertainty. However, the high computational cost of online learning poses a major challenge for real-time GP-MPC applications. This work presents an efficient implementation of an approximate spatio-temporal GP model, offering online learning at constant computational complexity. It is optimized for GP-MPC, where it enables improved control performance by learning more accurate system dynamics online in real-time, even for time-varying systems. The performance of the proposed method is demonstrated by simulations and hardware experiments in the exemplary application of autonomous miniature racing.
comment: to be published at 2026 IEEE International Conference on Robotics & Automation (ICRA)
Benchmarking Reinforcement Learning via Stochastic Converse Optimality: Generating Systems with Known Optimal Policies
The objective comparison of Reinforcement Learning (RL) algorithms is notoriously complex as outcomes and benchmarking of performances of different RL approaches are critically sensitive to environmental design, reward structures, and stochasticity inherent in both algorithmic learning and environmental dynamics. To manage this complexity, we introduce a rigorous benchmarking framework by extending converse optimality to discrete-time, control-affine, nonlinear systems with noise. Our framework provides necessary and sufficient conditions, under which a prescribed value function and policy are optimal for constructed systems, enabling the systematic generation of benchmark families via homotopy variations and randomized parameters. We validate it by automatically constructing diverse environments, demonstrating our framework's capacity for a controlled and comprehensive evaluation across algorithms. By assessing standard methods against a ground-truth optimum, our work delivers a reproducible foundation for precise and rigorous RL benchmarking.
An Extended T-A Formulation Based on Potential-Chain Recursion for Electromagnetic Modeling of Parallel-Wound No-Insulation HTS Coils
Parallel-wound no-insulation (PW-NI) high-temperature superconducting (HTS) coils significantly reduce charging delay while maintaining excellent self-protection capability, demonstrating great potential for high-field applications. Existing models that couple the T-A formulation with equivalent circuits have demonstrated high accuracy in electromagnetic analysis of PW-NI coils. However, eliminating the computational overhead caused by frequent variable mapping and data exchange between electromagnetic and circuit modules is important for improving computational efficiency, particularly in long-duration transient simulations of large-scale magnets. To address this issue, an extended T-A formulation based on potential-chain recursion, termed PCR-TA, is proposed. By directly embedding inter-tape current sharing and radial current bypass behaviors into the finite-element framework, this method computes the transient electromagnetic response of PW-NI coils without requiring an explicit equivalent circuit model. Building upon it, a multi-scale approach is further developed for large-scale PW-NI coils. The validity of the proposed method and its multi-scale extension is verified through comparisons with experimental measurements and field-circuit coupled modeling results. Comparative analyses demonstrate that the PCR-TA method achieves a speedup of approximately 2.4 over the field-circuit coupled method, whereas its multi-scale extension further increases this speedup to roughly 5.8. Furthermore, the PCR-TA method is extended to model the continuous transition of PW-NI coils from power-supply charging to closed-loop operation. This work provides an efficient method and tool for the electromagnetic modeling of PW-NI coils under both driven and closed-loop operating conditions.
Optimal Control for Steady Circulation of a Diffusion Process via Spectral Decomposition of Fokker-Planck Equation
We present a formulation of an optimal control problem for a two-dimensional diffusion process governed by a Fokker-Planck equation to achieve a nonequilibrium steady state with a desired circulation while accelerating convergence toward the stationary distribution. To achieve the control objective, we introduce costs for both the probability density function and flux rotation to the objective functional. We formulate the optimal control problem through dimensionality reduction of the Fokker-Planck equation via eigenfunction expansion, which requires a low-computational cost. We demonstrate that the proposed optimal control achieves the desired circulation while accelerating convergence to the stationary distribution through numerical simulations.
comment: 6 pages, 5 figures. Submitted to IEEE Control Systems Letters (L-CSS) and CDC 2026
Distributed Unknown Input Observer Design: A Geometric Approach
We present a geometric approach to designing distributed unknown input observers (DUIOs) for linear time-invariant systems, where measurements are distributed across nodes and each node is influenced by \emph{unknown inputs} through distinct channels. The proposed distributed estimation scheme consists of a network of observers, each tasked with reconstructing the entire system state despite having access only to local input-output signals that are individually insufficient for full state observation. Unlike existing methods that impose stringent rank conditions on the input and output matrices at each node, our approach leverages the $(C,A)$-invariant (conditioned invariant) subspace at each node from a geometric perspective. This enables the design of DUIOs in both continuous- and discrete-time settings under relaxed conditions, for which we establish sufficiency and necessity. The effectiveness of our methodology is demonstrated through extensive simulations, including a practical case study on a power grid system.
Trajectory Landscapes for Therapeutic Strategy Design in Agent-Based Tumor Microenvironment Models
Multiplex tissue imaging (MTI) enables high- dimensional, spatially resolved measurements of the tumor microenvironment (TME), but most clinical datasets are tempo- rally undersampled and longitudinally limited, restricting direct inference of underlying spatiotemporal dynamics and effective intervention timing. Agent-based models (ABMs) provide mech- anistic, stochastic simulators of TME evolution; yet their high- dimensional state space and uncertain parameterization make direct control design challenging. This work presents a reduced- order, simulation-driven framework for therapeutic strategy design using ABM-derived trajectory ensembles. Starting from a nominal ABM, we systematically perturb biologically plausible parameters to generate a set of simulated trajectories and construct a low-dimensional trajectory landscape describing TME evolution. From time series of spatial summary statistics extracted from the simulations, we learn a probabilistic Markov State Model (MSM) that captures metastable states and the transitions between them. To connect simulation dynamics with clinical observations, we map patient MTI snapshots onto the landscape and assess concordance with observed spatial phenotypes and clinical outcomes. We further show that conditioning the MSM on dominant governing parameters yields group-specific transition models to formulate a finite-horizon Markov Decision Process (MDP) for treatment scheduling. The resulting framework enables simulation-grounded therapeutic policy design for partially observed biological systems without requiring longitudinal patient measurements.
Offload or Overload: A Platform Measurement Study of Mobile Robotic Manipulation Workloads
Mobile robotic manipulation--the ability of robots to navigate spaces and interact with objects--is a core capability of physical AI. Foundation models have led to breakthroughs in their performance, but at a significant computational cost. We present the first measurement study of mobile robotic manipulation workloads across onboard, edge, and cloud GPU platforms. We find that the full workload stack is infeasible to run on smaller onboard GPUs, while larger onboard GPUs drain robot batteries several hours faster. Offloading alleviates these constraints but introduces its own challenges, as additional network latency degrades task accuracy, and the bandwidth requirement makes naive cloud offloading impractical. Finally, we quantify opportunities and pitfalls of sharing compute across robot fleets. We believe our measurement study will be crucial to designing inference systems for mobile robots.
comment: 15 pages, 17 figures
Delay-Robust Primal-Dual Dynamics for Distributed Optimization
Continuous-time primal-dual gradient dynamics (PDGD) is an ubiquitous approach for dynamically solving constrained distributed optimization problems. Yet, the distributed nature of the dynamics makes it prone to communication uncertainties, especially time delays. To mitigate this effect, we propose a delay-robust continuous-time PDGD. The dynamics is obtained by augmenting the standard PDGD with an auxiliary state coupled through a gain matrix, while preserving the optimal solution. Then, we present sufficient tuning conditions for this gain matrix in the form of linear matrix inequalities, which ensure uniform asymptotic stability in the presence of bounded, time-varying delays. The criterion is derived via the Lyapunov-Krasovskii method. A numerical example illustrates the improved delay robustness of our approach compared to the standard PDGD under large, time-varying delays.
Convergence of Payoff-Based Higher-Order Replicator Dynamics in Contractive Games
We study the convergence properties of a payoff-based higher-order version of replicator dynamics, a widely studied model in evolutionary dynamics and game-theoretic learning, in contractive games. Recent work has introduced a control-theoretic perspective for analyzing the convergence of learning dynamics through passivity theory, leading to a classification of learning dynamics based on the passivity notion they satisfy, such as \textdelta-passivity, equilibrium-independent passivity, and incremental passivity. We leverage this framework for the study of higher-order replicator dynamics for contractive games, which form the complement of passive learning dynamics. Standard replicator dynamics can be represented as a cascade interconnection between an integrator and the softmax mapping. Payoff-based higher-order replicator dynamics include a linear time-invariant (LTI) system in parallel with the existing integrator. First, we show that if this added system is strictly passive and asymptotically stable, then the resulting learning dynamics converge locally to the Nash equilibrium in contractive games. Second, we establish global convergence properties using incremental stability analysis for the special case of symmetric matrix contractive games.
Minimum Energy Cruise of All-Electric Aircraft with Applications to Advanced Air Mobility
Electrified propulsion is expected to play an important role in the sustainable development of Advanced Air Mobility (AAM). However, the limited energy density of batteries motivates the need to minimize energy consumption during flight. This paper studies the minimum total energy problem for an all-electric aircraft in steady cruise flight. The problem is formulated as an optimal control problem in which the cruise airspeed and final cruise time are optimization variables. The battery supply voltage is modeled as an affine function of the battery charge. Pontryagin's Minimum Principle is used to derive the necessary and sufficient conditions for optimality, from which closed-form expressions for the optimal cruise airspeed and optimal final cruise time are obtained. Additional analytical conditions are derived that determine when all-electric operation is feasible, one of which is that sufficient electric charge must be available. Numerical simulations based on the BETA Technologies CX300 all-electric aircraft and a representative AAM scenario illustrate how the aircraft weight, cruising altitude, electrical system efficiency, and initial battery charge influence the optimal airspeed and the feasibility of all-electric cruise.
comment: 17 pages, 3 figures, submitted to Aerospace Systems special issue on Low-altitude Economy
Don't Vibe Code, Do Skele-Code: Interactive No-Code Notebooks for Subject Matter Experts to Build Lower-Cost Agentic Workflows
Skele-Code is a natural-language and graph-based interface for building workflows with AI agents, designed especially for less or non-technical users. It supports incremental, interactive notebook-style development, and each step is converted to code with a required set of functions and behavior to enable incremental building of workflows. Agents are invoked only for code generation and error recovery, not orchestration or task execution. This agent-supported, but code-first approach to workflows, along with the context-engineering used in Skele-Code, can help reduce token costs compared to the multi-agent system approach to executing workflows. Skele-Code produces modular, easily extensible, and shareable workflows. The generated workflows can also be used as skills by agents, or as steps in other workflows.
comment: Main paper 9 pages. Topics: Agentic Coding, HCI, LLMs, Workflows
Token Economy for Fair and Efficient Dynamic Resource Allocation in Congestion Games
Self-interested behavior in sharing economies often leads to inefficient aggregate outcomes compared to a centrally coordinated allocation, ultimately harming users. Yet, centralized coordination removes individual decision power. This issue can be addressed by designing rules that align individual preferences with system-level objectives. Unfortunately, rules based on conventional monetary mechanisms introduce unfairness by discriminating among users based on their wealth. To solve this problem, in this paper, we propose a token-based mechanism for congestion games that achieves efficient and fair dynamic resource allocation. Specifically, we model the token economy as a continuous-time dynamic game with finitely many boundedly rational agents, explicitly capturing their evolutionary policy-revision dynamics. We derive a mean-field approximation of the finite-population game and establish strong approximation guarantees between the mean-field and the finite-population games. This approximation enables the design of integer tolls in closed form that provably steer the aggregate dynamics toward an optimal efficient and fair allocation from any initial condition.
Robust Global Position and Heading Tracking on SE(3) via Saturated Hybrid Feedback
This letter presents a novel control solution to the robust global position and heading tracking problem for underactuated vehicles, equipped with single-axis thrust and full torque actuation, operating under strict, user-defined actuation limits. The architecture features a saturated position tracking controller augmented with two first-order filters. This formulation ensures the boundedness of the first and second derivatives, yielding less conservative bounds and systematically generating bounded attitude references whose limits are easily tuned via design parameters. To track these dynamic references, the inner loop comprises a saturated, modified Rodrigues parameter (MRP)-based controller paired with a hybrid dynamic path-lifting mechanism. This approach allows the attitude tracking law to be designed on a covering space of the configuration manifold. By leveraging a stability equivalence framework, the methodology establishes that the resulting interconnected system achieves robust global asymptotic and semi-global exponential tracking on SE(3), while complying with user-defined input saturation bounds. Numerical simulations validate the proposed solution.
Joint Deployment and Beamforming Design of Aerial STAR-RIS Aided Networks with Reinforcement Learning
Aerial simultaneous transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) enables full-space coverage in dynamic wireless networks. However, most existing works assume fixed user grouping, overlooking the fact that STAR-RIS deployment inherently determines whether users are served via transmission or reflection. To address this, we propose a joint deployment and beamforming framework, where an aerial STAR-RIS dynamically adjusts its location and orientation to adaptively control user grouping and enhance hybrid beamforming. We formulate a Markov decision process (MDP) capturing the coupling among deployment, grouping, and signal design. To solve the resulting non-convex and time-varying problem, we develop a PPO-based reinforcement learning algorithm that adaptively balances user grouping and beamforming resources through online policy learning. Simulation results show 57.1\% and 285\% sum-rate gains over fixed-deployment and RIS-free baselines, respectively, demonstrating the benefit of user-grouping-aware control in STAR-RIS-aided systems.
comment: 6 pages, 7 figures
Data-Driven Robust Predictive Control with Interval Matrix Uncertainty Propagation
This paper presents a new data-driven robust predictive control law, for linear systems affected by unknown-but-bounded process disturbances. A sequence of input-state data is used to construct a suitable uncertainty representation based on interval matrices. Then, the effect of uncertainty along the prediction horizon is bounded through an operator leveraging matrix zonotopes. This yields a tube that is exploited within a variable-horizon optimal control problem, to guarantee robust satisfaction of state and input constraints. The resulting data-driven predictive control scheme is proven to be recursively feasible and practically stable. A numerical example shows that the proposed approach compares favorably to existing methods based on zonotopic tubes.
Rethinking Static Line Rating for Economic and Efficient Power Operation in South Korea
In South Korea, power grid is currently operated based on the static line rating (SLR) method, where the transmission line capacity is determined based on extreme weather conditions. However, with global warming, there is a concern that the temperatures during summer may exceed the SLR criteria, posing safety risks. On the other hand, the conservative estimates used for winter conditions limit the utilization of renewable energy. Proposals to install new lines face significant financial and environmental hurdles, complicating efforts to adapt to these changing conditions. Dynamic Line Rating (DLR) offers a real-time solution but requires extensive weather monitoring and complex integration. This paper proposes a novel method that improves on SLR by analyzing historical data to refine line rating criteria on a monthly, seasonal, and semi-annual basis. Through simulations, we show our approach significantly enhances cost effectiveness and reliability of the power system, achieving efficiencies close to DLR with existing infrastructure. This method offers a practical alternative to overcome the limitations of SLR and the implementation challenges of DLR.
Motion Planning with Precedence Specifications via Augmented Graphs of Convex Sets
We present an algorithm for planning trajectories that avoid obstacles and satisfy key-door precedence specifications expressed with a fragment of signal temporal logic. Our method includes a novel exact convex partitioning of the obstacle free space that encodes connectivity among convex free space sets, key sets, and door sets. We then construct an augmented graph of convex sets that exactly encodes the key-door precedence specifications. By solving a shortest path problem in this augmented graph of convex sets, our pipeline provides an exact solution up to a finite parameterization of the trajectory. To illustrate the effectiveness of our approach, we present a method to generate key-door mazes that provide challenging problem instances, and we perform numerical experiments to evaluate the proposed pipeline. Our pipeline is faster by several orders of magnitude than recent state-of-the art methods that use general purpose temporal logic tools.
Predicting power grid frequency dynamics with invertible Koopman-based architectures
The system frequency is a critical measure of power system stability and understanding, and modeling it are key to ensure reliable power system operations. Koopman-based autoencoders are effective at approximating complex nonlinear data patterns, with potential applications in the frequency dynamics of power systems. However, their non-invertibility can result in a distorted latent representation, leading to significant prediction errors. Invertible neural networks (INNs) in combination with the Koopman operator framework provide a promising approach to address these limitations. In this study, we analyze different INN architectures and train them on simulation datasets. We further apply extensions to the networks to address inherent limitations of INNs and evaluate their impact. We find that coupling-layer INNs achieve the best performance when used in isolation. In addition, we demonstrate that hybrid approaches can improve the performance when combined with suitable INNs, while reducing the generalization capabilities in combination with disadvantageous architectures. Overall, our results provide a clearer overview of how architectural choices influence INN performance, offering guidance for selecting and designing INNs for modeling power system frequency dynamics.
comment: Submitted to OSMSES 2026
A System-Theoretic Approach to Hawkes Process Identification with Guaranteed Positivity and Stability
The Hawkes process models self-exciting event streams, requiring a strictly non-negative and stable stochastic intensity. Standard identification methods enforce these properties using non-negative causal bases, yielding conservative parameter constraints and severely ill-conditioned least-squares Gram matrices at higher model orders. To overcome this, we introduce a system-theoretic identification framework utilizing the sign-indefinite orthonormal Laguerre basis, which guarantees a well-conditioned asymptotic Gram matrix independent of model order. We formulate a constrained least-squares problem enforcing the necessary and sufficient conditions for positivity and stability. By constructing the empirical Gram matrix via a Lyapunov equation and representing the constraints through a sum-of-squares trace equivalence, the proposed estimator is efficiently computed via semidefinite programming.
comment: 7 pages, 2 figures
Federated Causal Representation Learning in State-Space Systems for Decentralized Counterfactual Reasoning
Networks of interdependent industrial assets (clients) are tightly coupled through physical processes and control inputs, raising a key question: how would the output of one client change if another client were operated differently? This is difficult to answer because client-specific data are high-dimensional and private, making centralization of raw data infeasible. Each client also maintains proprietary local models that cannot be modified. We propose a federated framework for causal representation learning in state-space systems that captures interdependencies among clients under these constraints. Each client maps high-dimensional observations into low-dimensional latent states that disentangle intrinsic dynamics from control-driven influences. A central server estimates the global state-transition and control structure. This enables decentralized counterfactual reasoning where clients predict how outputs would change under alternative control inputs at others while only exchanging compact latent states. We prove convergence to a centralized oracle and provide privacy guarantees. Our experiments demonstrate scalability, and accurate cross-client counterfactual inference on synthetic and real-world industrial control system datasets.
comment: Manuscript under review
Bridging the Sim-to-real Gap: A Control Framework for Imitation Learning of Model Predictive Control
To address the computational challenges of Model Predictive Control (MPC), recent research has studied using imitation learning to approximate MPC with a computationally efficient Deep Neural Network (DNN). However, this introduces a common issue in learning-based control, the simulation-to-reality (sim-to-real) gap. Inspired by Robust Tube MPC, this study proposes a new control framework that addresses this issue from a control perspective. The framework ensures the DNN operates in the same environment as the source domain, addressing the sim-to-real gap with great data collection efficiency. Moreover, an input refinement governor is introduced to address the DNN's inability to adapt to variations in model parameters, enabling the system to satisfy MPC constraints more robustly under parameter-changing conditions. The proposed framework was validated through two case studies: cart-pole control and vehicle collision avoidance control, which analyzed the principles of the proposed framework in detail and demonstrated its application to a vehicle control case.
comment: Published in International Journal of Control, Automation, and Systems, 2026. DOI: 10.1007/s12555-026-00040-7
A Control-Theoretic Foundation for Agentic Systems
This paper develops a control-theoretic framework for analyzing agentic systems embedded within feedback control loops, where an AI agent may adapt controller parameters, select among control strategies, invoke external tools, reconfigure decision architectures, and modify control objectives during operation. These capabilities are formalized by interpreting agency as hierarchical runtime decision authority over elements of the control architecture, leading to an augmented closed-loop representation in which physical states, internal memory, tool outputs, interaction signals, and design variables evolve as a coupled dynamical system. A five-level hierarchy of agency is defined, ranging from fixed control laws to runtime synthesis of control architectures and objectives. The analysis shows that increasing agency introduces interacting dynamical mechanisms such as time-varying adaptation, endogenous switching, decision-induced delays, and structural reconfiguration. The framework is developed in both nonlinear and linear settings, providing explicit design constraints for AI-enabled control systems in safety-critical applications.
RIS-Aided E2E Multi-Path Uplink Transmission Optimization for 6G Time-Sensitive Services
The Access Traffic Steering, Switching, and Splitting (ATSSS) defined in the latest 3GPP Release 19 enables traffic flow over the multiple access paths to achieve the lower-latency End-to-end (E2E) delivery for 6G time-sensitive services. However, the existing E2E multi-path operation often falls short of more stringent QoS requirements for 6G time-sensitive services. This work proposes a Reconfigurable Intelligent Surfaces (RIS)-aided E2E multi-path uplink (UL) transmission architecture that explicitly accounts for both radio link latency and N3 backhaul latency, via the coupled designs of the UL traffic-splitting ratio, transmit power, receive combining, and RIS phase shift under practical constraints to achieve the minimum average E2E latency. We develop an alternating optimization framework that updates the above target parameters to be optimized. The simulations were conducted to compare the effectiveness of the proposed E2E optimization framework that lowers the average E2E latency up to 43% for a single user and 32% for the whole system compared with baselines in our prior work [1].
comment: This work has been submitted to the IEEE for possible publication.5 pages,2 figures,journal paper
CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions ICRA 2026
Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety. Yet safety violations can lead to catastrophic outcomes in real-world deployments. Control Barrier Functions (CBFs) offer a principled method to enforce dynamic safety -- traditionally deployed online via safety filters. While the result is safe behavior, the fact that the RL policy does not have knowledge of the CBF can lead to conservative behaviors. This paper proposes CBF-RL, a framework for generating safe behaviors with RL by enforcing CBFs in training. CBF-RL has two key attributes: (1) minimally modifying a nominal RL policy to encode safety constraints via a CBF term, (2) and safety filtering of the policy rollouts in training. Theoretically, we prove that continuous-time safety filters can be deployed via closed-form expressions on discrete-time roll-outs. Practically, we demonstrate that CBF-RL internalizes the safety constraints in the learned policy -- both enforcing safer actions and biasing towards safer rewards -- enabling safe deployment without the need for an online safety filter. We validate our framework through ablation studies on navigation tasks and on the Unitree G1 humanoid robot, where CBF-RL enables safer exploration, faster convergence, and robust performance under uncertainty, enabling the humanoid robot to avoid obstacles and climb stairs safely in real-world settings without a runtime safety filter.
comment: To appear at ICRA 2026; sample code for the navigation example with CBF-RL reward core construction can be found at https://github.com/lzyang2000/cbf-rl-navigation-demo
Constraint Learning in Multi-Agent Dynamic Games from Demonstrations of Local Nash Interactions
We present an inverse dynamic game-based algorithm to learn parametric constraints from a given dataset of local Nash equilibrium interactions between multiple agents. Specifically, we introduce mixed-integer linear programs (MILP) encoding the Karush-Kuhn-Tucker (KKT) conditions of the interacting agents, which recover constraints consistent with the local Nash stationarity of the interaction demonstrations. We establish theoretical guarantees that our method learns inner approximations of the true safe and unsafe sets. We also use the interaction constraints recovered by our method to design motion plans that robustly satisfy the underlying constraints. Across simulations and hardware experiments, our methods accurately inferred constraints and designed safe interactive motion plans for various classes of constraints, both convex and non-convex, from interaction demonstrations of agents with nonlinear dynamics.
NashOpt - A Python Library for Computing Generalized Nash Equilibria
NashOpt is an open-source Python library for computing and designing generalized Nash equilibria (GNEs) in noncooperative games with shared constraints and real-valued decision variables. The library exploits the joint Karush-Kuhn-Tucker (KKT) conditions of all players to handle both general nonlinear GNEs and linear-quadratic games, including their variational versions. Nonlinear games are solved via nonlinear least-squares formulations, relying on JAX for automatic differentiation. Linear-quadratic GNEs are reformulated as mixed-integer linear programs, enabling efficient computation of multiple equilibria. The framework also supports inverse-game and Stackelberg game-design problems. The capabilities of NashOpt are demonstrated through several examples, including noncooperative game-theoretic control problems of linear quadratic regulation and model predictive control. The library is available at https://github.com/bemporad/nashopt
comment: 24 pages, 7 figures
Physics-Informed Evolution: An Evolutionary Framework for Solving Quantum Control Problems Involving the Schrödinger Equation
Physics-Informed Neural Networks (PINNs) have demonstrated that embedding physical laws directly into the learning objective can significantly enhance the efficiency and physical consistency of neural network solutions. Inspired by this principle, we ask a natural question: can physical information be similarly embedded into the fitness function of evolutionary algorithms? In this work, we propose Physics-Informed Evolution (PIE), a novel framework that incorporates physical information derived from governing physical laws into the evolutionary fitness landscape, bridging the long-standing connection between learning and evolution in artificial intelligence. As a concrete instantiation, we apply PIE to quantum control problems governed by the Schrödinger equation, where the goal is to find optimal control fields that drive quantum systems from initial states to desired target states. We validate PIE on three representative quantum control benchmarks: state preparation in V-type three-level systems, entangled state generation in superconducting quantum circuits, and two-atom cavity QED systems, under varying levels of system uncertainty. Extensive comparisons against ten single-objective and five multi-objective evolutionary baselines demonstrate that PIE consistently achieves higher fidelity, lower state deviation, and improved robustness. Our results suggest that the physics-informed principle extends naturally beyond neural network training to the broader domain of evolutionary computation.
comment: 22 pages, 2 figures
Offline Reinforcement Learning via Inverse Optimization
Inspired by the recent successes of Inverse Optimization (IO) across various application domains, we propose a novel offline Reinforcement Learning (ORL) algorithm for continuous state and action spaces, leveraging the convex loss function called ``sub-optimality loss'' from the IO literature. To mitigate the distribution shift commonly observed in ORL problems, we further employ a robust and non-causal Model Predictive Control (MPC) expert steering a nominal model of the dynamics using in-hindsight information stemming from the model mismatch. Unlike the existing literature, our robust MPC expert enjoys an exact and tractable convex reformulation. In the second part of this study, we show that the IO hypothesis class, trained by the proposed convex loss function, enjoys ample expressiveness and {reliably recovers teacher behavior in MuJoCo benchmarks. The method achieves competitive results compared to widely-used baselines in sample-constrained settings, despite using} orders of magnitude fewer parameters. To facilitate the reproducibility of our results, we provide an open-source package implementing the proposed algorithms and the experiments. The code is available at https://github.com/TolgaOk/offlineRLviaIO.
comment: preprint
Learning Transferable Friction Models and LuGre Identification Via Physics-Informed Neural Networks
Accurately modeling friction in robotics remains a core challenge, as robotics simulators like MuJoCo and PyBullet use simplified friction models or heuristics to balance computational efficiency with accuracy, where these simplifications and approximations can lead to substantial differences between simulated and physical performance. In this paper, we present a physics-informed friction estimation framework that enables the integration of well-established friction models with learnable components, requiring only minimal, generic measurement data. Our approach enforces physical consistency yet retains the flexibility to capture complex friction phenomena. We demonstrate, on an underactuated and nonlinear system, that the learned friction models, trained solely on small and noisy datasets, accurately reproduce dynamic friction properties with significantly higher fidelity than the simplified models commonly used in robotics simulators. Crucially, we show that our approach enables the learned models to be transferable to systems they are not trained on. This ability to generalize across multiple systems streamlines friction modeling for complex, underactuated tasks, offering a scalable and interpretable path toward improving friction model accuracy in robotics and control.
comment: 7 pages, 8 figures, Accepted to 2026 American Control Conference (ACC)
PGLib-CO2: A Power Grid Library for Real-Time Computation and Optimization of Carbon Emissions
Achieving a sustainable electricity infrastructure requires the explicit integration of carbon emissions into power system modeling and optimization. However, existing open-source test cases for power system research lack generator-level carbon profiling, preventing the benchmark of carbon-aware operational strategies. To address this gap, this work introduces PGLib-CO2, an open-source extension to the PGLib-OPF test case library. The proposed PGLib-CO2 enriches standard grid test cases with CO2 and CO2-equivalent emission intensity factors to achieve realistic, generator-level carbon profiling with an expanded list of fuel types. Using the standardized data, PGLib-CO2 allows us to enhance the algorithms for computing key carbon emission metrics. We first utilize the differentiable programming paradigm for computing LMCE by treating the OPF-based grid dispatch as a differentiable layer. This method provides a rigorous marginal sensitivity for general convex cost functions, eliminating the need of using a small incremental change in numerical perturbation. Moreover, to accelerate the real-time LMCE computation, we develop an MPP-based approach that shifts the optimization burden to offline phase of identifying the OPF critical regions. Since each critical region is characterized by a pre-computed affine dispatch function, the online phase reduces to identifying the region followed by efficiently evaluating the region-specific LMCE values. Numerical evaluations on IEEE test systems demonstrate that the differentiable LMCE computation attains the precise sensitivity information, and the MPP-based approach retrieves the LMCE signals faster than the direct optimization approach. By bridging high-fidelity data with advanced parametric computation, PGLib-CO2 provides a reproducible and computationally efficient foundation for future research in sustainable power system operations.
EDMD-Based Robust Observer Synthesis for Nonlinear Systems
This paper presents a data-driven approach for designing state observers for continuous-time nonlinear systems, where an extended dynamic mode decomposition (EDMD) procedure is used to identify an approximate linear lifted model. Since such a model on a finite-dimensional space spanned by the dictionary functions has an inevitable mismatch, we first establish, based on our theory of reproducing kernel Hilbert space with a linear-radial kernel, that the nonlinear error magnitude in the approximate linear model is sectorially bounded by the lifted state. The sector bound comprises a deterministic part due to the finite dictionary and a stochastic part due to the random data samples, and the observer design needs to account for both of these errors in a robust formulation. Hence, the observer synthesis is performed using linear matrix inequalities (LMIs), specified by the desired exponential decay rate of the observation error (when the system is asymptotically stable) or the L2-gain from the modeling error to the observation error. Numerical studies demonstrate the effectiveness and flexibility of the proposed method. As such, this work entails an explicit elementary use of linear systems theory for nonlinear state observation in a Koopman operator-theoretic framework.
comment: 8 pages, 4 figures. Submitted to the 2026 65th IEEE Conference on Decision and Control (CDC) to be held in Honolulu, HI, USA
Quantifying resilience for distribution system customers with SALEDI
The impact of routine smaller outages on distribution system customers in terms of customer minutes interrupted can be tracked using conventional reliability indices. However, the customer minutes interrupted in large blackout events are extremely variable, and this makes it difficult to quantify the customer impact of these extreme events with resilience metrics. We solve this problem with the System Average Large Event Duration Index SALEDI that logarithmically transforms the customer minutes interrupted. We explain how this new resilience metric works, compare it with alternatives, quantify its statistical accuracy, and illustrate its practical use with standard outage data from five utilities.
Switching-Reference Voltage Control for Distribution Systems with AI-Training Data Centers
Large-scale AI training workloads in modern data centers exhibit rapid and periodic power fluctuations, which may induce significant voltage deviations in power distribution systems. Existing voltage regulation methods, such as droop control, are primarily designed for slowly varying loads and may therefore be ineffective in mitigating these fast fluctuations. In addition, repeated control actions can incur substantial cost. To address this challenge, this paper proposes a decentralized switching-reference voltage control framework that exploits the structured behavior of AI training workloads. We establish conditions for voltage convergence and characterize an effective reference design that aligns with the two dominant operating levels of the AI training workload. The switching rule for voltage references is implemented solely using local voltage measurements, enabling simple local implementation while significantly reducing control effort. Simulation studies demonstrate that the proposed method substantially reduces both voltage deviations and reactive control effort, while remaining compatible with internal data center control strategies without requiring extensive coordination.
Bridging Earth and Space: A Survey on HAPS for Non-Terrestrial Networks
HAPS are emerging as key enablers in the evolution of 6G wireless networks, bridging terrestrial and non-terrestrial infrastructures. Operating in the stratosphere, HAPS can provide wide-area coverage, low-latency, energy-efficient broadband communications with flexible deployment options for diverse applications. This survey delivers a comprehensive overview of HAPS use cases, technologies, and integration strategies within the 6G ecosystem. The roles of HAPS in extending connectivity to underserved regions, supporting dynamic backhauling, enabling massive IoT, and delivering reliable low-latency communications for autonomous and immersive services are discussed. The paper reviews state-of-the-art architectures for terrestrial and non-terrestrial network integration, highlights recent field trials. Furthermore, key enabling technologies such as channel modeling, AI-driven resource allocation, interference control, mobility management, and energy-efficient communications are examined. The paper also outlines open research challenges. By addressing existing gaps in the literature, this survey positions HAPS as a foundational component of globally integrated, resilient, and sustainable 6G networks.
comment: 40 pages. This work has been submitted to IEEE Communications Surveys & Tutorials (under review)
Robotics
Early-Terminable Energy-Safe Iterative Coupling for Parallel Simulation of Port-Hamiltonian Systems
Parallel simulation and control of large-scale robotic systems often rely on partitioned time stepping, yet finite-iteration coupling can inject spurious energy by violating power consistency--even when each subsystem is passive. This letter proposes a novel energy-safe, early-terminable iterative coupling for port-Hamiltonian subsystems by embedding a Douglas--Rachford (DR) splitting scheme in scattering (wave) coordinates. The lossless interconnection is enforced as an orthogonal constraint in the wave domain, while each subsystem contributes a discrete-time scattering port map induced by its one-step integrator. Under a discrete passivity condition on the subsystem time steps and a mild impedance-tuning condition, we prove an augmented-storage inequality certifying discrete passivity of the coupled macro-step for any finite inner-iteration budget, with the remaining mismatch captured by an explicit residual. As the inner budget increases, the partitioned update converges to the monolithic discrete-time update induced by the same integrators, yielding a principled, adaptive accuracy--compute trade-off, supporting energy-consistent real-time parallel simulation under varying computational budgets. Experiments on a coupled-oscillator benchmark validate the passivity certificates at numerical roundoff (on the order of 10e-14 in double precision) and show that the reported RMS state error decays monotonically with increasing inner-iteration budgets, consistent with the hard-coupling limit.
Onboard MuJoCo-based Model Predictive Control for Shipboard Crane with Double-Pendulum Sway Suppression
Transferring heavy payloads in maritime settings relies on efficient crane operation, limited by hazardous double-pendulum payload sway. This sway motion is further exacerbated in offshore environments by external perturbations from wind and ocean waves. Manual suppression of these oscillations on an underactuated crane system by human operators is challenging. Existing control methods struggle in such settings, often relying on simplified analytical models, while deep reinforcement learning (RL) approaches tend to generalise poorly to unseen conditions. Deploying a predictive controller onto compute-constrained, highly non-linear physical systems without relying on extensive offline training or complex analytical models remains a significant challenge. Here we show a complete real-time control pipeline centered on the MuJoCo MPC framework that leverages a cross-entropy method planner to evaluate candidate action sequences directly within a physics simulator. By using simulated rollouts, this sampling-based approach successfully reconciles the conflicting objectives of dynamic target tracking and sway damping without relying on complex analytical models. We demonstrate that the controller can run effectively on a resource-constrained embedded hardware, while outperforming traditional PID and RL baselines in counteracting external base perturbations. Furthermore, our system demonstrates robustness even when subjected to unmodeled physical discrepancies like the introduction of a second payload.
comment: 8 pages, 5 figures
Controlling Fish Schools via Reinforcement Learning of Virtual Fish Movement
This study investigates a method to guide and control fish schools using virtual fish trained with reinforcement learning. We utilize 2D virtual fish displayed on a screen to overcome technical challenges such as durability and movement constraints inherent in physical robotic agents. To address the lack of detailed behavioral models for real fish, we adopt a model-free reinforcement learning approach. First, simulation results show that reinforcement learning can acquire effective movement policies even when simulated real fish frequently ignore the virtual stimulus. Second, real-world experiments with live fish confirm that the learned policy successfully guides fish schools toward specified target directions. Statistical analysis reveals that the proposed method significantly outperforms baseline conditions, including the absence of stimulus and a heuristic "stay-at-edge" strategy. This study provides an early demonstration of how reinforcement learning can be used to influence collective animal behavior through artificial agents.
comment: English translation of the author's 2018 bachelor's thesis. Keywords: fish schooling, reinforcement learning, collective behavior, artificial agents, swarm-machine interaction
Encoding Predictability and Legibility for Style-Conditioned Diffusion Policy
Striking a balance between efficiency and transparent motion is a core challenge in human-robot collaboration, as highly expressive movements often incur unnecessary time and energy costs. In collaborative environments, legibility allows a human observer a better understanding of the robot's actions, increasing safety and trust. However, these behaviors result in sub-optimal and exaggerated trajectories that are redundant in low-ambiguity scenarios where the robot's goal is already obvious. To address this trade-off, we propose Style-Conditioned Diffusion Policy (SCDP), a modular framework that constrains the trajectory generation of a pre-trained diffusion model toward either legibility or efficiency based on the environment's configuration. Our method utilizes a post-training pipeline that freezes the base policy and trains a lightweight scene encoder and conditioning predictor to modulate the diffusion process. At inference time, an ambiguity detection module activates the appropriate conditioning, prioritizing expressive motion only for ambiguous goals and reverting to efficient paths otherwise. We evaluate SCDP on manipulation and navigation tasks, and results show that it enhances legibility in ambiguous settings while preserving optimal efficiency when legibility is unnecessary, all without retraining the base policy.
comment: Submitted to the 18th International Conference on Social Robotics (ICSR 2026)
Faulty Coffees: Barriers to Adoption of an In-the-wild Robo-Barista
We set out to study whether task-based narratives could influence long-term engagement with a service robot. To do so, we deployed a Robo-Barista for five weeks in an over-50's housing complex in Stockton, England. Residents received a free daily coffee by interacting with a Furhat robot assigned to either a narrative or non-narrative dialogue condition. Despite designing for sustained engagement, repeat interaction was low, and we encountered curiosity trials without retention, technical breakdowns, accessibility barriers, and the social dynamics of a housing complex setting. Rather than treating these as peripheral issues, we foreground them in this paper. We reflect on the in-the-wild realities of our experiment and offer lessons for conducting longitudinal Human-Robot Interaction research when studies unravel in practice.
comment: Accepted for publication in Failing Forward, Design and Deployment Lessons from Real-World Human-Robot Interaction Workshop at HRI 2026, March 16, 2026, Edinburgh, Scotland
ADAPT: Adaptive Dual-projection Architecture for Perceptive Traversal
Agile humanoid locomotion in complex 3D en- vironments requires balancing perceptual fidelity with com- putational efficiency, yet existing methods typically rely on rigid sensing configurations. We propose ADAPT (Adaptive dual-projection architecture for perceptive traversal), which represents the environment using a horizontal elevation map for terrain geometry and a vertical distance map for traversable- space constraints. ADAPT further treats its spatial sensing range as a learnable action, enabling the policy to expand its perceptual horizon during fast motion and contract it in cluttered scenes for finer local resolution. Compared with voxel-based baselines, ADAPT drastically reduces observation dimensionality and computational overhead while substantially accelerating training. Experimentally, it achieves successful zero-shot transfer to a Unitree G1 Humanoid and signifi- cantly outperforms fixed-range baselines, yielding highly robust traversal across diverse 3D environtmental challenges.
Toward Deep Representation Learning for Event-Enhanced Visual Autonomous Perception: the eAP Dataset
Recent visual autonomous perception systems achieve remarkable performances with deep representation learning. However, they fail in scenarios with challenging illumination.While event cameras can mitigate this problem, there is a lack of a large-scale dataset to develop event-enhanced deep visual perception models in autonomous driving scenes. To address the gap, we present the eAP (event-enhanced Autonomous Perception) dataset, the largest dataset with event cameras for autonomous perception. We demonstrate how eAP can facilitate the study of different autonomous perception tasks, including 3D vehicle detection and object time-to-contact (TTC) estimation, through deep representation learning. Based on eAP, we demonstrate the ffrst successful use of events to improve a popular 3D vehicle detection network in challenging illumination scenarios. eAP also enables a devoted study of the representation learning problem of object TTC estimation. We show how a geometryaware representation learning framework leads to the best eventbased object TTC estimation network that operates at 200 FPS. The dataset, code, and pre-trained models will be made publicly available for future research.
OGScene3D: Incremental Open-Vocabulary 3D Gaussian Scene Graph Mapping for Scene Understanding
Open-vocabulary scene understanding is crucial for robotic applications, enabling robots to comprehend complex 3D environmental contexts and supporting various downstream tasks such as navigation and manipulation. However, existing methods require pre-built complete 3D semantic maps to construct scene graphs for scene understanding, which limits their applicability in robotic scenarios where environments are explored incrementally. To address this challenge, we propose OGScene3D, an open-vocabulary scene understanding system that achieves accurate 3D semantic mapping and scene graph construction incrementally. Our system employs a confidence-based Gaussian semantic representation that jointly models semantic predictions and their reliability, enabling robust scene modeling. Building on this representation, we introduce a hierarchical 3D semantic optimization strategy that achieves semantic consistency through local correspondence establishment and global refinement, thereby constructing globally consistent semantic maps. Moreover, we design a long-term global optimization method that leverages temporal memory of historical observations to enhance semantic predictions. By integrating 2D-3D semantic consistency with Gaussian rendering contribution, this method continuously refines the semantic understanding of the entire scene.Furthermore, we develop a progressive graph construction approach that dynamically creates and updates both nodes and semantic relationships, allowing continuous updating of the 3D scene graphs. Extensive experiments on widely used datasets and real-world scenes demonstrate the effectiveness of our OGScene3D on open-vocabulary scene understanding.
Agile Interception of a Flying Target using Competitive Reinforcement Learning
This article presents a solution to intercept an agile drone by another agile drone carrying a catching net. We formulate the interception as a Competitive Reinforcement Learning problem, where the interceptor and the target drone are controlled by separate policies trained with Proximal Policy Optimization (PPO). We introduce a high-fidelity simulation environment that integrates a realistic quadrotor dynamics model and a low-level control architecture implemented in JAX, which allows for fast parallelized execution on GPUs. We train the agents using low-level control, collective thrust and body rates, to achieve agile flights both for the interceptor and the target. We compare the performance of the trained policies in terms of catch rate, time to catch, and crash rate, against common heuristic baselines and show that our solution outperforms these baselines for interception of agile targets. Finally, we demonstrate the performance of the trained policies in a scaled real-world scenario using agile drones inside an indoor flight arena.
GenZ-LIO: Generalizable LiDAR-Inertial Odometry Beyond Indoor--Outdoor Boundaries
Light detection and ranging (LiDAR)-inertial odometry (LIO) enables accurate localization and mapping for autonomous navigation in various scenes. However, its performance remains sensitive to variations in spatial scale, which refers to the spatial extent of the scene reflected in the distribution of point ranges in a LiDAR scan. Transitions between confined indoor and expansive outdoor spaces induce substantial variations in point density, which may reduce robustness and computational efficiency. To address this issue, we propose GenZ-LIO, a LIO framework generalizable across both indoor and outdoor environments. GenZ-LIO comprises three key components. First, inspired by the principle of the proportional-integral-derivative (PID) controller, it adaptively regulates the voxel size for downsampling via feedback control, driving the voxelized point count toward a scale-informed setpoint while enabling stable and efficient processing across varying scene scales. Second, we formulate a hybrid-metric state update that jointly leverages point-to-plane and point-to-point residuals to mitigate LiDAR degeneracy arising from directionally insufficient geometric constraints. Third, to alleviate the computational burden introduced by point-to-point matching, we introduce a voxel-pruned correspondence search strategy that discards non-promising voxel candidates and reduces unnecessary computations. Experimental results demonstrate that GenZ-LIO achieves robust odometry estimation and improved computational efficiency across confined indoor, open outdoor, and transitional environments. Our code will be made publicly available upon publication.
comment: 19 pages, 11 figures
MG-Grasp: Metric-Scale Geometric 6-DoF Grasping Framework with Sparse RGB Observations
Single-view RGB-D grasp detection remains a com- mon choice in 6-DoF robotic grasping systems, which typically requires a depth sensor. While RGB-only 6-DoF grasp methods has been studied recently, their inaccurate geometric repre- sentation is not directly suitable for physically reliable robotic manipulation, thereby hindering reliable grasp generation. To address these limitations, we propose MG-Grasp, a novel depth- free 6-DoF grasping framework that achieves high-quality object grasping. Leveraging two-view 3D foundation model with camera intrinsic/extrinsic, our method reconstructs metric- scale and multi-view consistent dense point clouds from sparse RGB images and generates stable 6-DoF grasp. Experiments on GraspNet-1Billion dataset and real world demonstrate that MG-Grasp achieves state-of-the-art (SOTA) grasp performance among RGB-based 6-DoF grasping methods.
comment: 8 pages, 5 figures
Industrial cuVSLAM Benchmark & Integration
This work presents a comprehensive benchmark evaluation of visual odometry (VO) and visual SLAM (VSLAM) systems for mobile robot navigation in real-world logistical environments. We compare multiple visual odometry approaches across controlled trajectories covering translational, rotational, and mixed motion patterns, as well as a large-scale production facility dataset spanning approximately 1.7 km. Performance is evaluated using Absolute Pose Error (APE) against ground truth from a Vicon motion capture system and a LiDAR-based SLAM reference. Our results show that a hybrid stack combining the cuVSLAM front-end with a custom SLAM back-end achieves the strongest mapping accuracy, motivating a deeper integration of cuVSLAM as the core VO component in our robotics stack. We further validate this integration by deploying and testing the cuVSLAM-based VO stack on an NVIDIA Jetson platform.
Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors
We propose Ground Reaction Inertial Poser (GRIP), a method that reconstructs physically plausible human motion using four wearable devices. Unlike conventional IMU-only approaches, GRIP combines IMU signals with foot pressure data to capture both body dynamics and ground interactions. Furthermore, rather than relying solely on kinematic estimation, GRIP uses a digital twin of a person, in the form of a synthetic humanoid in a physics simulator, to reconstruct realistic and physically plausible motion. At its core, GRIP consists of two modules: KinematicsNet, which estimates body poses and velocities from sensor data, and DynamicsNet, which controls the humanoid in the simulator using the residual between the KinematicsNet prediction and the simulated humanoid state. To enable robust training and fair evaluation, we introduce a large-scale dataset, Pressure and Inertial Sensing for Human Motion and Interaction (PRISM), that captures diverse human motions with synchronized IMUs and insole pressure sensors. Experimental results show that GRIP outperforms existing IMU-only and IMU-pressure fusion methods across all evaluated datasets, achieving higher global pose accuracy and improved physical consistency.
Featurized Occupation Measures for Structured Global Search in Numerical Optimal Control
Numerical optimal control is commonly divided between globally structured but dimensionally intractable Hamilton-Jacobi-Bellman (HJB) methods and scalable but local trajectory optimization. We introduce the Featurized Occupation Measure (FOM), a finite-dimensional primal-dual interface for the occupation-measure formulation that unifies trajectory search and global HJB-type certification. FOM is broad yet numerically tractable, covering both explicit weak-form schemes and implicit simulator- or rollout-based sampling methods. Within this framework, approximate HJB subsolutions serve as intrinsic numerical certificates to directly evaluate and guide the primal search. We prove asymptotic consistency with the exact infinite-dimensional occupation-measure problem, and show that for block-organized feasible certificates, finite-dimensional approximation preserves certified lower bounds with blockwise error and complexity control. We also establish persistence of these lower bounds under time shifts and bounded model perturbations. Consequently, these structural properties render global certificates into flexible, reusable computational objects, establishing a systematic basis for certificate-guided optimization in nonlinear control.
PA-LVIO: Real-Time LiDAR-Visual-Inertial Odometry and Mapping with Pose-Only Bundle Adjustment
Real-time LiDAR-visual-inertial odometry and mapping is crucial for navigation and planning tasks in intelligent transportation systems. This study presents a pose-only bundle adjustment (PA) LiDAR-visual-inertial odometry (LVIO), named PA-LVIO, to meet the urgent need for real-time navigation and mapping. The proposed PA framework for LiDAR and visual measurements is highly accurate and efficient, and it can derive reliable frame-to-frame constraints within multiple frames. A marginalization-free and frame-to-map (F2M) LiDAR measurement model is integrated into the state estimator to eliminate odometry drifts. Meanwhile, an IMU-centric online spatial-temporal calibration is employed to obtain a pixel-wise LiDAR-camera alignment. With accurate estimated odometry and extrinsics, a high-quality and RGB-rendered point-cloud map can be built. Comprehensive experiments are conducted on both public and private datasets collected by wheeled robot, unmanned aerial vehicle (UAV), and handheld devices with 28 sequences and more than 50 km trajectories. Sufficient results demonstrate that the proposed PA-LVIO yields superior or comparable performance to state-of-the-art LVIO methods, in terms of the odometry accuracy and mapping quality. Besides, PA-LVIO can run in real-time on both the desktop PC and the onboard ARM computer.
comment: 14 pages, 10 figures
Enabling Dynamic Tracking in Vision-Language-Action Models via Time-Discrete and Time-Continuous Velocity Feedforward
While vision-language-action (VLA) models have shown great promise for robot manipulation, their deployment on rigid industrial robots remains challenging due to the inherent trade-off between compliance and responsiveness. Standard Behavior Cloning (BC) approaches predict discrete poses at low frequencies, omitting the velocity and acceleration feedforward terms typically used by low-level compliant controllers. This requires to rely on high stiffness for accurate tracking, thereby sacrificing safe contact dynamics. In this paper, we demonstrate the importance of integrating velocity feedforward terms into VLA policies to resolve this trade-off. We propose two methods for extracting velocity targets from VLAs: a time-discrete finite-difference approximation that serves as a highly effective bridge for existing models, and a continuous Cubic B-Spline action space that natively yields $C^2$ continuous trajectories for high-frequency control. Crucially, both approaches are strictly model-agnostic and compatible with any standard action-chunking architecture, requiring modifications only to teleoperation, data processing, and the low-level controller. We fine-tune the $π_{0.5}$ model and evaluate both of our approaches on a demanding, contact-rich cube-in-hole task. Our results indicate that incorporating the velocity feedforward term via finite differences significantly improves task execution speed, while the continuous B-Spline approach maintains high overall success rates and provides a foundation for smoother higher-order derivatives without compromising compliance.
PanguMotion: Continuous Driving Motion Forecasting with Pangu Transformers
Motion forecasting is a core task in autonomous driving systems, aiming to accurately predict the future trajectories of surrounding agents to ensure driving safety. Existing methods typically process discrete driving scenes independently, neglecting the temporal continuity and historical context correlations inherent in real-world driving environments. This paper proposes PanguMotion, a motion forecasting framework for continuous driving scenarios that integrates Transformer blocks from the Pangu-1B large language model as feature enhancement modules into autonomous driving motion prediction architectures. We conduct experiments on the Argoverse 2 datasets processed by the RealMotion data reorganization strategy, transforming each independent scene into a continuous sequence to mimic real-world driving scenarios.
S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight
Video action models (VAMs) have emerged as a promising paradigm for robot learning, owing to their powerful visual foresight for complex manipulation tasks. However, current VAMs, typically relying on either slow multi-step video generation or noisy one-step feature extraction, cannot simultaneously guarantee real-time inference and high-fidelity foresight. To address this limitation, we propose S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Serving as a stable blueprint, these foreseen representations significantly simplify the action prediction. To enable this efficient shortcut, we introduce a novel self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Specifically, vision foundation model (VFM) representations extracted from the diffusion model's own multi-step generated videos provide teacher targets. Lightweight decouplers, as students, learn to directly map noisy one-step features to these targets. Extensive experiments in simulation and the real world demonstrate that our S-VAM outperforms state-of-the-art methods, enabling efficient and precise manipulation in complex environments. Our project page is https://haodong-yan.github.io/S-VAM/
Enforcing Task-Specified Compliance Bounds for Humanoids via Anisotropic Lipschitz-Constrained Policies
Reinforcement learning (RL) has demonstrated substantial potential for humanoid bipedal locomotion and the control of complex motions. To cope with oscillations and impacts induced by environmental interactions, compliant control is widely regarded as an effective remedy. However, the model-free nature of RL makes it difficult to impose task-specified and quantitatively verifiable compliance objectives, and classical model-based stiffness designs are not directly applicable. Lipschitz-Constrained Policies (LCP), which regularize the local sensitivity of a policy via gradient penalties, have recently been used to smooth humanoid motions. Nevertheless, existing LCP-based methods typically employ a single scalar Lipschitz budget and lack an explicit connection to physically meaningful compliance specifications in real-world systems. In this study, we propose an anisotropic Lipschitz-constrained policy (ALCP) that maps a task-space stiffness upper bound to a state-dependent Lipschitz-style constraint on the policy Jacobian. The resulting constraint is enforced during RL training via a hinge-squared spectral-norm penalty, preserving physical interpretability while enabling direction-dependent compliance. Experiments on humanoid robots show that ALCP improves locomotion stability and impact robustness, while reducing oscillations and energy usage.
comment: Submitted to IEEE for possible publication, under review
SignNav: Leveraging Signage for Semantic Visual Navigation in Large-Scale Indoor Environments
Humans routinely leverage semantic hints provided by signage to navigate to destinations within novel Large-Scale Indoor (LSI) environments, such as hospitals and airport terminals. However, this capability remains underexplored within the field of embodied navigation. This paper introduces a novel embodied navigation task, SignNav, which requires the agent to interpret semantic hint from signage and reason about the subsequent action based on current observation. To facilitate research in this domain, we construct the LSI-Dataset for the training and evaluation of various SignNav agents. Dynamically changing semantic hints and sparse placement of signage in LSI environments present significant challenges to the SignNav task. To address these challenges, we propose the Spatial-Temporal Aware Transformer (START) model for end-to-end decision-making. The spatial-aware module grounds the semantic hint of signage into physical world, while the temporal-aware module captures long-range dependencies between historical states and current observation. Leveraging a two-stage training strategy with Dataset Aggregation (DAgger), our approach achieves state-of-the-art performance, recording an 80% Success Rate (SR) and 0.74 NDTW on val-unseen split. Real-world deployment further demonstrates the practicality of our method in physical environment without pre-built map.
SE(3)-LIO: Smooth IMU Propagation With Jointly Distributed Poses on SE(3) Manifold for Accurate and Robust LiDAR-Inertial Odometry
In estimating odometry accurately, an inertial measurement unit (IMU) is widely used owing to its high-rate measurements, which can be utilized to obtain motion information through IMU propagation. In this paper, we address the limitations of existing IMU propagation methods in terms of motion prediction and motion compensation. In motion prediction, the existing methods typically represent a 6-DoF pose by separating rotation and translation and propagate them on their respective manifold, so that the rotational variation is not effectively incorporated into translation propagation. During motion compensation, the relative transformation between predicted poses is used to compensate motion-induced distortion in other measurements, while inherent errors in the predicted poses introduce uncertainty in the relative transformation. To tackle these challenges, we represent and propagate the pose on SE(3) manifold, where propagated translation properly accounts for rotational variation. Furthermore, we precisely characterize the relative transformation uncertainty by considering the correlation between predicted poses, and incorporate this uncertainty into the measurement noise during motion compensation. To this end, we propose a LiDAR-inertial odometry (LIO), referred to as SE(3)-LIO, that integrates the proposed IMU propagation and uncertainty-aware motion compensation (UAMC). We validate the effectiveness of SE(3)-LIO on diverse datasets. Our source code and additional material are available at: https://se3-lio.github.io/.
Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation
While recent Vision-Language-Action (VLA) models have begun to incorporate audio, they typically treat sound as static pre-execution prompts or focus exclusively on human speech. This leaves a significant gap in real-time, sound-centric manipulation where fleeting environmental acoustics provide critical state verification during task execution. Consequently, key sounds are easily missed due to low-frequency updates or system latency. This problem is exacerbated by action chunking with open-loop execution, which creates a Blind Execution Interval where acoustic events are lost between discrete audio observation windows. Recognizing the necessity of continuous auditory awareness, we formalize Vision-Sound-Language-Action (VSLA) as a continuous control paradigm conditioned on vision, streaming audio, language, and proprioception under delayed decision loops. As an instantiation, we introduce HEAR, a VSLA framework integrating four components: (i) a streaming Historizer to maintain a compact, causal audio context across execution gaps; (ii) an Envisioner adapted from omni foundation models to reason over multi-sensory inputs; (iii) an Advancer, formulated as an audio world model, to learn temporal dynamics by predicting near-future audio codes; and (iv) a flow-matching Realizer policy to generate smooth action chunks. To address the scarcity of pretraining data and evaluations for VSLA, we construct OpenX-Sound for pretraining, alongside HEAR-Bench, the first sound-centric manipulation benchmark with strict causal timing rules. Our results suggest that robust sound-centric manipulation necessitates causal persistence and explicit temporal learning. This framework provides a practical step toward multi-sensory foundation models for embodied agents, enabling robots to perceive and interact with dynamic environments. Code and videos are available at https://hear.irmv.top.
Large Reward Models: Generalizable Online Robot Reward Generation with Vision-Language Models
Reinforcement Learning (RL) has shown great potential in refining robotic manipulation policies, yet its efficacy remains strongly bottlenecked by the difficulty of designing generalizable reward functions. In this paper, we propose a framework for online policy refinement by adapting foundation VLMs into online reward generators. We develop a robust, scalable reward model based on a state-of-the-art VLM, trained on a large-scale, multi-source dataset encompassing real-world robot trajectories, human-object interactions, and diverse simulated environments. Unlike prior approaches that evaluate entire trajectories post-hoc, our method leverages the VLM to formulate a multifaceted reward signal comprising process, completion, and temporal contrastive rewards based on current visual observations. Initializing with a base policy trained via Imitation Learning (IL), we employ these VLM rewards to guide the model to correct sub-optimal behaviors in a closed-loop manner. We evaluate our framework on challenging long-horizon manipulation benchmarks requiring sequential execution and precise control. Crucially, our reward model operates in a purely zero-shot manner within these test environments. Experimental results demonstrate that our method significantly improves the success rate of the initial IL policy within just 30 RL iterations, demonstrating remarkable sample efficiency. This empirical evidence highlights that VLM-generated signals can provide reliable feedback to resolve execution errors, effectively eliminating the need for manual reward engineering and facilitating efficient online refinement for robot learning.
Ultrafast Sampling-based Kinodynamic Planning via Differential Flatness
Motion planning under dynamics constraints, i.e., kinodynamic planning, enables safe robot operation by generating dynamically feasible trajectories that the robot can accurately track. For high-\dof robots such as manipulators, sampling-based motion planners are commonly used, especially for complex tasks in cluttered environments. However, enforcing constraints on robot dynamics in such planners requires solving either challenging two-point boundary value problems (BVPs) or propagating robot dynamics over time, both of which are computational bottlenecks that drastically increase planning times. Meanwhile, recent efforts have shown that sampling-based motion planners can generate plans in microseconds using parallelization, but are limited to geometric paths. This paper develops AkinoPDF, a fast parallelized sampling-based kinodynamic motion planning technique for a broad class of differentially flat robot systems, including manipulators, ground and aerial vehicles, and more. Differential flatness allows us to transform the motion planning problem from the original state space to a flat output space, where an analytical time-parameterized solution of the BVP and dynamics integration can be obtained. A trajectory in the flat output space is then converted back to a closed-form dynamically feasible trajectory in the original state space, enabling fast validation via ``single instruction, multiple data" parallelism. Our method is fast, exact, and compatible with any sampling-based motion planner. We extensively verify the effectiveness of our approach in both simulated benchmarks and real experiments with cluttered and dynamic environments, requiring mere microseconds to milliseconds of planning time.
comment: 16 pages, 9 figures, under review
The Era of End-to-End Autonomy: Transitioning from Rule-Based Driving to Large Driving Models
Autonomous driving is undergoing a shift from modular rule based pipelines toward end to end (E2E) learning systems. This paper examines this transition by tracing the evolution from classical sense perceive plan control architectures to large driving models (LDMs) capable of mapping raw sensor input directly to driving actions. We analyze recent developments including Tesla's Full Self Driving (FSD) V12 V14, Rivian's Unified Intelligence platform, NVIDIA Cosmos, and emerging commercial robotaxi deployments, focusing on architectural design, deployment strategies, safety considerations and industry implications. A key emerging product category is supervised E2E driving, often referred to as FSD (Supervised) or L2 plus plus, which several manufacturers plan to deploy from 2026 onwards. These systems can perform most of the Dynamic Driving Task (DDT) in complex environments while requiring human supervision, shifting the driver's role to safety oversight. Early operational evidence suggests E2E learning handles the long tail distribution of real world driving scenarios and is becoming a dominant commercial strategy. We also discuss how similar architectural advances may extend beyond autonomous vehicles (AV) to other embodied AI systems, including humanoid robotics.
Compact Optical Single-axis Joint Torque Sensor Using Redundant Photo-Reflectors and Quadratic-Programming Calibration
This study proposes a non-contact photo-reflector-based joint torque sensor for precise joint-level torque control and safe physical interaction. Current-sensor-based torque estimation in many collaborative robots suffers from poor low-torque accuracy due to gearbox stiction/friction and current-torque nonlinearity, especially near static conditions. The proposed sensor optically measures micro-deformation of an elastic structure and employs a redundant array of photo-reflectors arranged in four directions to improve sensitivity and signal-to-noise ratio. We further present a quadratic-programming-based calibration method that exploits redundancy to suppress noise and enhance resolution compared to least-squares calibration. The sensor is implemented in a compact form factor (96 mm diameter, 12 mm thickness). Experiments demonstrate a maximum error of 0.083%FS and an RMS error of 0.0266 Nm for z-axis torque measurement. Calibration tests show that the proposed calibration achieves a 3 sigma resolution of 0.0224 Nm at 1 kHz without filtering, corresponding to a 2.14 times improvement over the least-squares baseline. Temperature chamber characterization and rational fitting based compensation mitigate zero drift induced by MCU self heating and motor heat. Motor-level validation via torque control and admittance control confirms improved low torque tracking and disturbance robustness relative to current-sensor-based control.
comment: 10 pages
Geometry-Aligned LLM Fine-Tuning for Sequential Narrow-Opening Planning
We study rigid-body motion planning through multiple sequential narrow openings, which requires long-horizon geometric reasoning because the configuration used to traverse an early opening constrains the set of reachable configurations for subsequent ones. To achieve this, we propose a geometry-aligned large language model (LLM) fine-tuning framework that generates fixed-length, machine-readable waypoint sequences that are both geometrically feasible and coordinated across openings. Our approach uses a bi-level training pipeline. First, we perform failure-driven LoRA supervised fine-tuning (SFT) on human demonstrations, which incorporates structured failure feedback to teach the model common failure modes and enforce the output format. Second, we refine the same LoRA adapters using Group Relative Policy Optimization (GRPO) with geometric verification: each sampled waypoint sequence is densified by a model-based planner and scored with a deterministic geometry-derived reward to achieve continuous-motion feasibility. To validate the effectiveness of our proposed method, we provide both quantitative and qualitative results from simulations. Our method achieves the highest success rate in both in-distribution and out-of-distribution environments and qualitatively exhibits long-horizon geometric reasoning by selecting exit poses that facilitate entry into subsequent openings.
comment: 8 pages, 3 figures
MessyKitchens: Contact-rich object-level 3D scene reconstruction
Monocular 3D scene reconstruction has recently seen significant progress. Powered by the modern neural architectures and large-scale data, recent methods achieve high performance in depth estimation from a single image. Meanwhile, reconstructing and decomposing common scenes into individual 3D objects remains a hard challenge due to the large variety of objects, frequent occlusions and complex object relations. Notably, beyond shape and pose estimation of individual objects, applications in robotics and animation require physically-plausible scene reconstruction where objects obey physical principles of non-penetration and realistic contacts. In this work we advance object-level scene reconstruction along two directions. First, we introduceMessyKitchens, a new dataset with real-world scenes featuring cluttered environments and providing high-fidelity object-level ground truth in terms of 3D object shapes, poses and accurate object contacts. Second, we build on the recent SAM 3D approach for single-object reconstruction and extend it with Multi-Object Decoder (MOD) for joint object-level scene reconstruction. To validate our contributions, we demonstrate MessyKitchens to significantly improve previous datasets in registration accuracy and inter-object penetration. We also compare our multi-object reconstruction approach on three datasets and demonstrate consistent and significant improvements of MOD over the state of the art. Our new benchmark, code and pre-trained models will become publicly available on our project website: https://messykitchens.github.io/.
ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K
Learning in simulation provides a useful foundation for scaling robotic manipulation capabilities. However, this paradigm often suffers from a lack of data-generation-ready digital assets, in both scale and diversity. In this work, we present ManiTwin, an automated and efficient pipeline for generating data-generation-ready digital object twins. Our pipeline transforms a single image into simulation-ready and semantically annotated 3D asset, enabling large-scale robotic manipulation data generation. Using this pipeline, we construct ManiTwin-100K, a dataset containing 100K high-quality annotated 3D assets. Each asset is equipped with physical properties, language descriptions, functional annotations, and verified manipulation proposals. Experiments demonstrate that ManiTwin provides an efficient asset synthesis and annotation workflow, and that ManiTwin-100K offers high-quality and diverse assets for manipulation data generation, random scene synthesis, and VQA data generation, establishing a strong foundation for scalable simulation data synthesis and policy learning. Our webpage is available at https://manitwin.github.io/.
comment: Website: https://manitwin.github.io/
MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation
A prevailing view in robot learning is that simulation alone is not enough; effective sim-to-real transfer is widely believed to require at least some real-world data collection or task-specific fine-tuning to bridge the gap between simulated and physical environments. We challenge that assumption. With sufficiently large-scale and diverse simulated synthetic training data, we show that zero-shot transfer to the real world is not only possible, but effective for both static and mobile manipulation. We introduce MolmoBot-Engine, a fully open-source pipeline for procedural data generation across robots, tasks, and diverse simulated environments in MolmoSpaces. With it, we release MolmoBot-Data, a dataset of 1.8 million expert trajectories for articulated object manipulation and pick-and-place tasks. We train three policy classes: MolmoBot, a Molmo2-based multi-frame vision-language model with a flow-matching action head; MolmoBot-Pi0, which replicates the $π_0$ architecture to enable direct comparison; and MolmoBot-SPOC, a lightweight policy suitable for edge deployment and amenable to RL fine-tuning. We evaluate on two robotic platforms: the Franka FR3 for tabletop manipulation tasks and the Rainbow Robotics RB-Y1 mobile manipulator for door opening, drawer manipulation, cabinet interaction, and mobile pick-and-place. Without any real-world fine-tuning, our policies achieve zero-shot transfer to unseen objects and environments. On tabletop pick-and-place, MolmoBot achieves a success rate of 79.2% in real world evaluations across 4 settings, outperforming $π_{0.5}$ at 39.2%. Our results demonstrate that procedural environment generation combined with diverse articulated assets can produce robust manipulation policies that generalize broadly to the real world. Technical Blog: https://allenai.org/blog/molmobot-robot-manipulation
DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models
Robotic manipulation requires sophisticated commonsense reasoning, a capability naturally possessed by large-scale Vision-Language Models (VLMs). While VLMs show promise as zero-shot planners, their lack of grounded physical understanding often leads to compounding errors and low success rates when deployed in complex real-world environments, particularly for challenging tasks like deformable object manipulation. Although Reinforcement Learning (RL) can adapt these planners to specific task dynamics, directly fine-tuning VLMs via real-world interaction is prohibitively expensive, unsafe, and sample-inefficient. To overcome this bottleneck, we introduce DreamPlan, a novel framework for the reinforcement fine-tuning of VLM planners via video world models. Instead of relying on costly physical rollouts, DreamPlan first leverages the zero-shot VLM to collect exploratory interaction data. We demonstrate that this sub-optimal data is sufficient to train an action-conditioned video generation model, which implicitly captures complex real-world physics. Subsequently, the VLM planner is fine-tuned entirely within the "imagination" of this video world model using Odds Ratio Policy Optimization (ORPO). By utilizing these virtual rollouts, physical and task-specific knowledge is efficiently injected into the VLM. Our results indicate that DreamPlan bridges the gap between semantic reasoning and physical grounding, significantly improving manipulation success rates without the need for large-scale real-world data collection. Our project page is https://psi-lab.ai/DreamPlan/.
BrickSim: A Physics-Based Simulator for Manipulating Interlocking Brick Assemblies
Interlocking brick assemblies provide a standardized yet challenging testbed for contact-rich and long-horizon robotic manipulation, but existing rigid-body simulators do not faithfully capture snap-fit mechanics. We present BrickSim, the first real-time physics-based simulator for interlocking brick assemblies. BrickSim introduces a compact force-based mechanics model for snap-fit connections and solves the resulting internal force distribution using a structured convex quadratic program. Combined with a hybrid architecture that delegates rigid-body dynamics to the underlying physics engine while handling snap-fit mechanics separately, BrickSim enables real-time, high-fidelity simulation of assembly, disassembly, and structural collapse. On 150 real-world assemblies, BrickSim achieves 100% accuracy in static stability prediction with an average solve time of 5 ms. In dynamic drop tests, it also faithfully reproduces real-world structural collapse, precisely mirroring both the occurrence of breakage and the specific breakage locations. Built on Isaac Sim, BrickSim further supports seamless integration with a wide variety of robots and existing pipelines. We demonstrate robotic construction of brick assemblies using BrickSim, highlighting its potential as a foundation for research in dexterous, long-horizon robotic manipulation. BrickSim is open-source, and the code is available at https://github.com/intelligent-control-lab/BrickSim.
comment: 9 pages, 9 figures
Real-Time Decoding of Movement Onset and Offset for Brain-Controlled Rehabilitation Exoskeleton ICRA 2026
Robot-assisted therapy can deliver high-dose, task-specific training after neurologic injury, but most systems act primarily at the limb level-engaging the impaired neural circuits only indirectly-which remains a key barrier to truly contingent, neuroplasticity-targeted rehabilitation. We address this gap by implementing online, dual-state motor imagery control of an upper-limb exoskeleton, enabling goal-directed reaches to be both initiated and terminated directly from non-invasive EEG. Eight participants used EEG to initiate assistance and then volitionally halt the robot mid-trajectory. Across two online sessions, group-mean hit rates were 61.5% for onset and 64.5% for offset, demonstrating reliable start-stop command delivery despite instrumental noise and passive arm motion. Methodologically, we reveal a systematic, class-driven bias induced by common task-based recentering using an asymmetric margin diagnostic, and we introduce a class-agnostic fixation-based recentering method that tracks drift without sampling command classes while preserving class geometry. This substantially improves threshold-free separability (AUC gains: onset +56%, p = 0.0117; offset +34%, p = 0.0251) and reduces bias within and across days. Together, these results help bridge offline decoding and practical, intention-driven start-stop control of a rehabilitation exoskeleton, enabling precisely timed, contingent assistance aligned with neuroplasticity goals while supporting future clinical translation.
comment: Accepted to ICRA 2026. 8 pages, 5 figures. Project page available at https://mitrakanishka.github.io/projects/startstop-bci/
CABTO: Context-Aware Behavior Tree Grounding for Robot Manipulation
Behavior Trees (BTs) offer a powerful paradigm for designing modular and reactive robot controllers. BT planning, an emerging field, provides theoretical guarantees for the automated generation of reliable BTs. However, BT planning typically assumes that a well-designed BT system is already grounded -- comprising high-level action models and low-level control policies -- which often requires extensive expert knowledge and manual effort. In this paper, we formalize the BT Grounding problem: the automated construction of a complete and consistent BT system. We analyze its complexity and introduce CABTO (Context-Aware Behavior Tree grOunding), the first framework to efficiently solve this challenge. CABTO leverages pre-trained Large Models (LMs) to heuristically search the space of action models and control policies, guided by contextual feedback from BT planners and environmental observations. Experiments spanning seven task sets across three distinct robotic manipulation scenarios demonstrate CABTO's effectiveness and efficiency in generating complete and consistent behavior tree systems.
DexGrasp-Zero: A Morphology-Aligned Policy for Zero-Shot Cross-Embodiment Dexterous Grasping
To meet the demands of increasingly diverse dexterous hand hardware, it is crucial to develop a policy that enables zero-shot cross-embodiment grasping without redundant re-learning. Cross-embodiment alignment is challenging due to heterogeneous hand kinematics and physical constraints. Existing approaches typically predict intermediate motion targets and retarget them to each embodiment, which may introduce errors and violate embodiment-specific limits, hindering transfer across diverse hands. To overcome these limitations, we propose \textit{DexGrasp-Zero}, a policy that learns universal grasping skills from diverse embodiments, enabling zero-shot transfer to unseen hands. We first introduce a morphology-aligned graph representation that maps each hand's kinematic keypoints to anatomically grounded nodes and equips each node with tri-axial orthogonal motion primitives, enabling structural and semantic alignment across different morphologies. Relying on this graph-based representation, we design a \textit{Morphology-Aligned Graph Convolutional Network} (MAGCN) to encode the graph for policy learning. MAGCN incorporates a \textit{Physical Property Injection} mechanism that fuses hand-specific physical constraints into the graph features, enabling adaptive compensation for varying link lengths and actuation limits for precise and stable grasping. Our extensive simulation evaluations on the YCB dataset demonstrate that our policy, jointly trained on four heterogeneous hands (Allegro, Shadow, Schunk, Ability), achieves an 85\% zero-shot success rate on unseen hardware (LEAP, Inspire), outperforming the state-of-the-art method by 59.5\%. Real-world experiments further evaluate our policy on three robot platforms (LEAP, Inspire, Revo2), achieving an 82\% average success rate on unseen objects.
Development of Low-Cost and Bidirectional Syringe Pumps for Soft Robotics Applications
Soft robotics leverages deformable materials to develop robots capable of navigating unstructured and dynamic environments. Silicone Voxel-Based Soft Robots (Silibots) are a type of pneumatically actuated soft robots that rely on the inflation and deflation of their voxels for shape-shifting behaviors. However, traditional pneumatic actuation methods (high pressure solenoids, medical diaphragm pumps, micro compressors, compressed fluid) pose significant challenges due to their limited efficacy, cost, complexity, or lack of precision. This work introduces a low cost and modular syringe pump system, constructed with off the shelf and 3D printed parts, designed to overcome these limitations. The syringe pump system also enhances actuation with the unique ability to pull a vacuum as well pump air into the soft robot. Furthermore, the syringe pump features modular hardware and customizable software, allowing for researchers to tailor the syringe pump to their requirements or operate multiple pumps simultaneously with unique pump parameters. This flexibility makes the syringe pump an accessible and scalable tool that paves the way for broader adoption of soft robotic technologies in research and education.
Beyond Cybathlon: On-demand Quadrupedal Assistance for People with Limited Mobility
Background: Assistance robots have the potential to increase the independence of people who need daily care due to limited mobility or being wheelchair-bound. Current solutions of attaching robotic arms to motorized wheelchairs offer limited additional mobility at the cost of increased size and reduced wheelchair maneuverability. Methods: We present an on-demand quadrupedal assistance robot system controlled via a shared autonomy approach, which combines semi-autonomous task execution with human teleoperation. Due to the mobile nature of the system it can assist the operator whenever needed and perform autonomous tasks independently, without otherwise restricting their mobility. We automate pick-and-place tasks, as well as robot movement through the environment with semantic, collision-aware navigation. For teleoperation, we present a mouth-level joystick interface that enables an operator with reduced mobility to control the robot's end effector for precision manipulation. Results: We showcase our system in the \textit{Cybathlon 2024 Assistance Robot Race}, and validate it in an at-home experimental setup, where we measure task completion times and user satisfaction. We find our system capable of assisting in a broad variety of tasks, including those that require dexterous manipulation. The user study confirms the intuition that increased robot autonomy alleviates the operator's mental load. Conclusions: We present a flexible system that has the potential to help people in wheelchairs maintain independence in everyday life by enabling them to solve mobile manipulation problems without external support. We achieve results comparable to previous state-of-the-art on subjective metrics while allowing for more autonomy of the operator and greater agility for manipulation.
Thermopneumatic Pixels for Fast, Localized, Low-Voltage Touch Feedback
We present thermopneumatic pixels (TPPs), which are tactile actuators designed for rapid fabrication and straightforward integration into compact wearable and surface-based haptic systems. Each TPP converts low-voltage ($\sim$10 V) electrical pulses into transient pressure increases within a sealed cavity, producing out-of-plane forces and displacements suitable for tactile stimulation. The architecture enables scalable fabrication and spatially distributed actuation while maintaining simple electrical interfacing. The TPPs are constructed from inexpensive, readily available materials using straightforward layer-based assembly, facilitating rapid prototyping and integration into interactive devices. Mechanical characterization demonstrates peak forces exceeding 1 N and millimeter displacements. We further present driving electronics for operating multiple TPP modules concurrently and report perceptual study results demonstrating the effectiveness of the resulting tactile feedback. Together, these results establish low-voltage thermopneumatic actuation as an accessible and high-performance approach for embedding tactile feedback into experimental and consumer-facing interfaces.
vAccSOL: Efficient and Transparent AI Vision Offloading for Mobile Robots
Mobile robots are increasingly deployed for inspection, patrol, and search-and-rescue operations, relying on computer vision for perception, navigation, and autonomous decision-making. However, executing modern vision workloads onboard is challenging due to limited compute resources and strict energy constraints. While some platforms include embedded accelerators, these are typically tied to proprietary software stacks, leaving user-defined workloads to run on resource-constrained companion computers. We present vAccSOL, a framework for efficient and transparent execution of AI-based vision workloads across heterogeneous robotic and edge platforms. vAccSOL integrates two components: SOL, a neural network compiler that generates optimized inference libraries with minimal runtime dependencies, and vAccel, a lightweight execution framework that transparently dispatches inference locally on the robot or to nearby edge infrastructure. This combination enables hardware-optimized inference and flexible execution placement without requiring modifications to robot applications. We evaluate vAccSOL on a real-world testbed with a commercial quadruped robot and twelve deep learning models covering image classification, video classification, and semantic segmentation. Compared to a PyTorch compiler baseline, SOL achieves comparable or better inference performance. With edge offloading, vAccSOL reduces robot-side power consumption by up to 80% and edge-side power by up to 60% compared to PyTorch, while increasing vision pipeline frame rate by up to 24x, extending the operating lifetime of battery-powered robots.
Learning Whole-Body Control for a Salamander Robot
Amphibious legged robots inspired by salamanders are promising in applications in complex amphibious environments. However, despite the significant success of training controllers that achieve diverse locomotion behaviors in conventional quadrupedal robots, most salamander robots relied on central-pattern-generator (CPG)-based and model-based coordination strategies for locomotion control. Learning unified joint-level whole-body control that reliably transfers from simulation to highly articulated physical salamander robots remains relatively underexplored. In addition, few legged robots have tried learning-based controllers in amphibious environments. In this work, we employ Reinforcement Learning to map proprioceptive observations and commanded velocities to joint-level actions, allowing coordinated locomotor behaviors to emerge. To deploy these policies on hardware, we adopt a system-level real-to-sim matching and sim-to-real transfer strategy. The learned controller achieves stable and coordinated walking on both flat and uneven terrains in the real world. Beyond terrestrial locomotion, the framework enables transitions between walking and swimming in simulation, highlighting a phenomenon of interest for understanding locomotion across distinct physical modes.
When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making
Embodied robotic systems increasingly rely on large language model (LLM)-based agents to support high-level reasoning, planning, and decision-making during interactions with the environment. However, invoking LLM reasoning introduces substantial computational latency and resource overhead, which can interrupt action execution and reduce system reliability. Excessive reasoning may delay actions, while insufficient reasoning often leads to incorrect decisions and task failures. This raises a fundamental question for embodied agents: when should the agent reason, and when should it act? In this work, we propose RARRL (Resource-Aware Reasoning via Reinforcement Learning), a hierarchical framework for resource-aware orchestration of embodied agents. Rather than learning low-level control policies, RARRL learns a high-level orchestration policy that operates at the agent's decision-making layer. This policy enables the agent to adaptively determine whether to invoke reasoning, which reasoning role to employ, and how much computational budget to allocate based on current observations, execution history, and remaining resources. Extensive experiments, including evaluations with empirical latency profiles derived from the ALFRED benchmark, show that RARRL consistently improves task success rates while reducing execution latency and enhancing robustness compared with fixed or heuristic reasoning strategies. These results demonstrate that adaptive reasoning control is essential for building reliable and efficient embodied robotic agents.
Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation
Simulating robot-world interactions is a cornerstone of Embodied AI. Recently, a few works have shown promise in leveraging video generations to transcend the rigid visual/physical constraints of traditional simulators. However, they primarily operate in 2D space or are guided by static environmental cues, ignoring the fundamental reality that robot-world interactions are inherently 4D spatiotemporal events that require precise interactive modeling. To restore this 4D essence while ensuring the precise robot control, we introduce Kinema4D, a new action-conditioned 4D generative robotic simulator that disentangles the robot-world interaction into: i) Precise 4D representation of robot controls: we drive a URDF-based 3D robot via kinematics, producing a precise 4D robot control trajectory. ii) Generative 4D modeling of environmental reactions: we project the 4D robot trajectory into a pointmap as a spatiotemporal visual signal, controlling the generative model to synthesize complex environments' reactive dynamics into synchronized RGB/pointmap sequences. To facilitate training, we curated a large-scale dataset called Robo4D-200k, comprising 201,426 robot interaction episodes with high-quality 4D annotations. Extensive experiments demonstrate that our method effectively simulates physically-plausible, geometry-consistent, and embodiment-agnostic interactions that faithfully mirror diverse real-world dynamics. For the first time, it shows potential zero-shot transfer capability, providing a high-fidelity foundation for advancing next-generation embodied simulation.
comment: Project page: https://mutianxu.github.io/Kinema4D-project-page/
Reconciling distributed compliance with high-performance control in continuum soft robotics
High-performance closed-loop control of truly soft continuum manipulators has remained elusive. Experimental demonstrations have largely relied on sufficiently stiff, piecewise architectures in which each actuated segment behaves as a distributed yet effectively rigid element, while deformation modes beyond simple bending are suppressed. This strategy simplifies modeling and control, but sidesteps the intrinsic complexity of a fully compliant body and makes the system behave as a serial kinematic chain, much like a conventional articulated robot. An implicit conclusion has consequently emerged within the community: distributed softness and dynamic precision are incompatible. Here we show this trade-off is not fundamental. We present a highly compliant, fully continuum robotic arm - without hardware discretization or stiffness-based mode suppression - that achieves fast, precise task-space convergence under dynamic conditions. The platform integrates direct-drive actuation, a tendon routing scheme enabling coupled bending and twisting, and a structured nonlinear control architecture grounded in reduced-order strain modeling of underactuated systems. Modeling, actuation, and control are co-designed to preserve essential mechanical complexity while enabling high-bandwidth loop closure. Experiments demonstrate accurate, repeatable execution of dynamic Cartesian tasks, including fast positioning and interaction. The proposed system achieves the fastest reported task-execution speed among soft robots. At millimetric precision, execution speed increases nearly fourfold compared with prior approaches, while operating on a fully compliant continuum body. These results show that distributed compliance and high-performance dynamic control can coexist, opening a path toward truly soft manipulators approaching the operational capabilities of rigid robots without sacrificing morphological richness.
Routing and Control for Marine Oil-Spill Cleanup with a Boom-Towing Vessel Fleet
Marine oil spills damage ecosystems, contaminate coastlines, and disrupt food webs, while imposing substantial economic losses on fisheries and coastal communities. Prior work has demonstrated the feasibility of containing and cleaning individual spills using a duo of autonomous surface vehicles (ASVs) equipped with a towed boom and skimmers. However, existing algorithmic approaches primarily address isolated slicks and individual ASV duos, lacking scalable methods for coordinating large robotic fleets across multiple spills representative of realistic oil-spill incidents. In this work, we propose an integrated multi-robot framework for coordinated oil-spill confinement and cleanup using autonomous ASV duos. We formulate multi-spill response as a risk-weighted minimum-latency problem, where spill-specific risk factors and service times jointly determine cumulative environmental damage. To solve this problem, we develop a hybrid optimization approach combining mixed-integer linear programming, and a tailored warm-start heuristic, enabling near-optimal routing plans for scenarios with tens of spills within minutes on commodity hardware. For physical execution, we design and analyze two tracking controllers for boom-towing ASV duos: a feedback-linearization controller with proven asymptotic stability, and a baseline PID controller. Simulation results under coupled vessel-boom dynamics demonstrate accurate path tracking for both controllers. Together, these components provide a scalable, holistic framework for rapid, risk-aware multi-robot response to large-scale oil spill disasters.
Dexterous grasp data augmentation based on grasp synthesis with fingertip workspace cloud and contact-aware sampling
Robotic grasping is a fundamental yet crucial component of robotic applications, as effective grasping often serves as the starting point for various tasks. With the rapid advancement of neural networks, data-driven approaches for robotic grasping have become mainstream. However, efficiently generating grasp datasets for training remains a bottleneck. This is compounded by the diverse structures of robotic hands, making the design of generalizable grasp generation methods even more complex. In this work, we propose a teleoperation-based framework to collect a small set of grasp pose demonstrations, which are augmented using FSG--a Fingertip-contact-aware Sampling-based Grasp generator. Based on the demonstrated grasp poses, we propose AutoWS, which automatically generates structured workspace clouds of robotic fingertips, embedding the hand structure information directly into the clouds to eliminate the need for inverse kinematics calculations. Experiments on grasping the YCB objects show that our method significantly outperforms existing approaches in both speed and valid pose generation rate. Our framework enables real-time grasp generation for hands with arbitrary structures and produces human-like grasps when combined with demonstrations, providing an efficient and robust data augmentation tool for data-driven grasp training.
comment: Accepted to Advanced Robotics, GitHub: https://github.com/W567/FSG, YouTube: https://youtu.be/rFCDl9SxSSA
Scalable Inspection Planning via Flow-based Mixed Integer Linear Programming
Inspection planning is concerned with computing the shortest robot path to inspect a given set of points of interest (POIs) using the robot's sensors. This problem arises in a wide range of applications from manufacturing to medical robotics. To alleviate the problem's complexity, recent methods rely on sampling-based methods to obtain a more manageable (discrete) graph inspection planning (GIP) problem. Unfortunately, GIP still remains highly difficult to solve at scale as it requires simultaneously satisfying POI-coverage and path-connectivity constraints, giving rise to a challenging optimization problem, particularly at scales encountered in real-world scenarios. In this work, we present highly scalable Mixed Integer Linear Programming (MILP) solutions for GIP that significantly advance the state-of-the-art in both runtime and solution quality. Our key insight is a reformulation of the problem's core constraints as a network flow, which enables effective MILP models and a specialized Branch-and-Cut solver that exploits the combinatorial structure of flows. We evaluate our approach on medical and infrastructure benchmarks alongside large-scale synthetic instances. Across all scenarios, our method produces substantially tighter lower bounds than existing formulations, reducing optimality gaps by 30-50% on large instances. Furthermore, our solver demonstrates unprecedented scalability: it provides non-trivial solutions for problems with up to 15,000 vertices and thousands of POIs, where prior state-of-the-art methods typically exhaust memory or fail to provide any meaningful optimality guarantees.
ASCENT: Transformer-Based Aircraft Trajectory Prediction in Non-Towered Terminal Airspace ICRA 2026
Accurate trajectory prediction can improve General Aviation safety in non-towered terminal airspace, where high traffic density increases accident risk. We present ASCENT, a lightweight transformer-based model for multi-modal 3D aircraft trajectory forecasting, which integrates domain-aware 3D coordinate normalization and parameterized predictions. ASCENT employs a transformer-based motion encoder and a query-based decoder, enabling the generation of diverse maneuver hypotheses with low latency. Experiments on the TrajAir and TartanAviation datasets demonstrate that our model outperforms prior baselines, as the encoder effectively captures motion dynamics and the decoder aligns with structured aircraft traffic patterns. Furthermore, ablation studies confirm the contributions of the decoder design, coordinate-frame modeling, and parameterized outputs. These results establish ASCENT as an effective approach for real-time aircraft trajectory prediction in non-towered terminal airspace.
comment: ICRA 2026. Project Page at https://a-pru.github.io/ascent/
A Pin-Array Structured Climbing Robot for Stable Locomotion on Steep Rocky Terrain ICRA
Climbing robots face significant challenges when navigating unstructured environments, where reliable attachment to irregular surfaces is critical. We present a novel mobile climbing robot equipped with compliant pin-array structured grippers that passively conform to surface irregularities, ensuring stable ground gripping without the need for complicated sensing or control. Each pin features a vertically split design, combining an elastic element with a metal spine to enable mechanical interlocking with microscale surface features. Statistical modeling and experimental validation indicate that variability in individual pin forces and contact numbers are the primary sources of grasping uncertainty. The robot demonstrated robust and stable locomotion in indoor tests on inclined walls (10-30 degrees) and in outdoor tests on natural rocky terrain. This work highlights that a design emphasizing passive compliance and mechanical redundancy provides a practical and robust solution for real-world climbing robots while minimizing control complexity.
comment: Author's version of a manuscript accepted at the 2026 IEEE International Conference on Robotics and Automation (ICRA). (c) IEEE
Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting
Offline post-training adapts a pretrained robot policy to a target dataset by supervised regression on recorded actions. In practice, robot datasets are heterogeneous: they mix embodiments, camera setups, and demonstrations of varying quality, so many trajectories reflect recovery behavior, inconsistent operator skill, or weakly informative supervision. Uniform post-training gives equal credit to all samples and can therefore average over conflicting or low-attribution data. We propose Posterior-Transition Reweighting (PTR), a reward-free and conservative post-training method that decides how much each training sample should influence the supervised update. For each sample, PTR encodes the observed post-action consequence as a latent target, inserts it into a candidate pool of mismatched targets, and uses a separate transition scorer to estimate a softmax identification posterior over target indices. The posterior-to-uniform ratio defines the PTR score, which is converted into a clipped-and-mixed weight and applied to the original action objective through self-normalized weighted regression. This construction requires no tractable policy likelihood and is compatible with both diffusion and flow-matching action heads. Rather than uniformly trusting all recorded supervision, PTR reallocates credit according to how attributable each sample's post-action consequence is under the current representation, improving conservative offline adaptation to heterogeneous robot data.
Designing for Disagreement: Front-End Guardrails for Assistance Allocation in LLM-Enabled Robots
LLM-enabled robots prioritizing scarce assistance in social settings face pluralistic values and LLM behavioral variability: reasonable people can disagree about who is helped first, while LLM-mediated interaction policies vary across prompts, contexts, and groups in ways that are difficult to anticipate or verify at contact point. Yet user-facing guardrails for real-time, multi-user assistance allocation remain under-specified. We propose bounded calibration with contestability, a procedural front-end pattern that (i) constrains prioritization to a governance-approved menu of admissible modes, (ii) keeps the active mode legible in interaction-relevant terms at the point of deferral, and (iii) provides an outcome-specific contest pathway without renegotiating the global rule. Treating pluralism and LLM uncertainty as standing conditions, the pattern avoids both silent defaults that hide implicit value skews and wide-open user-configurable "value settings" that shift burden under time pressure. We illustrate the pattern with a public-concourse robot vignette and outline an evaluation agenda centered on legibility, procedural legitimacy, and actionability, including risks of automation bias and uneven usability of contest channels.
comment: Accepted at the Proceedings of the CHI 2026 Workshop: Ethics at the Front-End
Kamino: GPU-based Massively Parallel Simulation of Multi-Body Systems with Challenging Topologies
We present Kamino, a GPU-based physics solver for massively parallel simulations of heterogeneous highly-coupled mechanical systems. Implemented in Python using NVIDIA Warp and integrated into the Newton framework, it enables the application of data-driven methods, such as large-scale reinforcement learning, to complex robotic systems that exhibit strongly coupled kinematic and dynamic constraints such as kinematic loops. The latter are often circumvented by practitioners; approximating the system topology as a kinematic tree and incorporating explicit loop-closure constraints or so-called mimic joints. Kamino aims at alleviating this burden by natively supporting these types of coupling. This capability facilitates high-throughput parallelized simulations that capture the true nature of mechanical systems that exploit closed kinematic chains for mechanical advantage. Moreover, Kamino supports heterogeneous worlds, allowing for batched simulation of structurally diverse robots on a single GPU. At its core lies a state-of-the-art constrained optimization algorithm that computes constraint forces by solving the constrained rigid multi-body forward dynamics transcribed as a nonlinear complementarity problem. This leads to high-fidelity simulations that can resolve contact dynamics without resorting to approximate models that simplify and/or convexify the problem. We demonstrate RL policy training on DR Legs, a biped with six nested kinematic loops, generating a feasible walking policy while simulating 4096 parallel environments on a single GPU.
LIMBERO: A Limbed Climbing Exploration Robot Toward Traveling on Rocky Cliffs ICRA
In lunar and planetary exploration, legged robots have attracted significant attention as an alternative to conventional wheeled robots, which struggle to traverse rough and uneven terrain. To enable locomotion over highly irregular and steeply inclined surfaces, limbed climbing robots equipped with grippers on their feet have emerged as a promising solution. In this paper, we present LIMBERO, a 10 kg-class quadrupedal climbing robot that employs spine-type grippers for stable locomotion and climbing on rugged and steep terrain. We first introduce a novel gripper design featuring coupled finger-closing and spine-hooking motions, tightly actuated by a single motor, which achieves exceptional grasping performance (>150 N) despite its lightweight design (525 g). Furthermore, we develop an efficient algorithm to visualize a geometry-based graspability index on continuous rough terrain. Finally, we integrate these components into LIMBERO and demonstrate its ability to ascend steep rocky surfaces under a 1 G gravity condition, a performance not previously achieved yet for limbed climbing robots of this scale.
comment: Author's version of a manuscript accepted at the 2026 IEEE International Conference on Robotics and Automation (ICRA). (c) IEEE
When Rolling Gets Weird: A Curved-Link Tensegrity Robot for Non-Intuitive Behavior ICRA
Conventional mobile tensegrity robots constructed with straight links offer mobility at the cost of locomotion speed. While spherical robots provide highly effective rolling behavior, they often lack the stability required for navigating unstructured terrain common in many space exploration environments. This research presents a solution with a semi-circular, curved-link tensegrity robot that strikes a balance between efficient rolling locomotion and controlled stability, enabled by discontinuities present at the arc endpoints. Building upon an existing geometric static modeling framework [1], this work presents the system design of an improved Tensegrity eXploratory Robot 2 (TeXploR2). Internal shifting masses instantaneously roll along each curved-link, dynamically altering the two points of contact with the ground plane. Simulations of quasistatic, piecewise continuous locomotion sequences reveal new insights into the positional displacement between inertial and body frames. Non-intuitive rolling behaviors are identified and experimentally validated using a tetherless prototype, demonstrating successful dynamic locomotion. A preliminary impact test highlights the tensegrity structure's inherent shock absorption capabilities and conformability. Future work will focus on finalizing a dynamic model that is experimentally validated with extended testing in real-world environments as well as further refinement of the prototype to incorporate additional curved-links and subsequent ground contact points for increased controllability.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026
Coverage First Next Best View for Inspection of Cluttered Pipe Networks Using Mobile Manipulators
Robotic inspection of radioactive areas enables operators to be removed from hazardous environments; however, planning and operating in confined, cluttered environments remain challenging. These systems must autonomously reconstruct the unknown environment and cover its surfaces, whilst estimating and avoiding collisions with objects in the environment. In this paper, we propose a new planning approach based on next-best-view that enables simultaneous exploration and exploitation of the environment by reformulating the coverage path planning problem in terms of information gain. To handle obstacle avoidance under uncertainty, we extend the vector-field-inequalities framework to explicitly account for stochastic measurements of geometric primitives in the environment via chance constraints in a constrained optimal control law. The stochastic constraints were evaluated experimentally alongside the planner on a mobile manipulator in a confined environment to inspect a pipe network. These experiments demonstrate that the system can autonomously plan and execute inspection and coverage paths to reconstruct and fully cover the simplified pipe network. Moreover, the system successfully estimated geometric primitives online and avoided collisions during motion between viewpoints.
comment: 8 pages, 9 figures, 1 table. Submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems 2026
FastLoop: Parallel Loop Closing with GPU-Acceleration in Visual SLAM
Visual SLAM systems combine visual tracking with global loop closure to maintain a consistent map and accurate localization. Loop closure is a computationally expensive process as we need to search across the whole map for matches. This paper presents FastLoop, a GPU-accelerated loop closing module to alleviate this computational complexity. We identify key performance bottlenecks in the loop closing pipeline of visual SLAM and address them through parallel optimizations on the GPU. Specifically, we use task-level and data-level parallelism and integrate a GPU-accelerated pose graph optimization. Our implementation is built on top of ORB-SLAM3 and leverages CUDA for GPU programming. Experimental results show that FastLoop achieves an average speedup of 1.4x and 1.3x on the EuRoC dataset and 3.0x and 2.4x on the TUM-VI dataset for the loop closing module on desktop and embedded platforms, respectively, while maintaining the accuracy of the original system.
Influence of Gripper Design on Human Demonstration Quality for Robot Learning
Opening sterile medical packaging is routine for healthcare workers but remains challenging for robots. Learning from demonstration enables robots to acquire manipulation skills directly from humans, and handheld gripper tools such as the Universal Manipulation Interface (UMI) offer a pathway for efficient data collection. However, the effectiveness of these tools depends heavily on their usability. We evaluated UMI in demonstrating a bandage opening task, a common manipulation task in hospital settings, by testing three conditions: distributed load grippers, concentrated load grippers, and bare hands. Eight participants performed timed trials, with task performance assessed by success rate, completion time, and damage, alongside perceived workload using the NASA-TLX questionnaire. Concentrated load grippers improved performance relative to distributed load grippers but remained substantially slower and less effective than hands. These results underscore the importance of ergonomic and mechanical refinements in handheld grippers to reduce user burden and improve demonstration quality, especially for applications in healthcare robotics.
comment: To be published in proceedings of 2026 IEEE International Conference on Robotics & Automation
SLAM Adversarial Lab: An Extensible Framework for Visual SLAM Robustness Evaluation under Adverse Conditions
We present SAL (SLAM Adversarial Lab), a modular framework for evaluating visual SLAM systems under adversarial conditions such as fog and rain. SAL represents each adversarial condition as a perturbation that transforms an existing dataset into an adversarial dataset. When transforming a dataset, SAL supports severity levels using easily-interpretable real-world units such as meters for fog visibility. SAL's extensible architecture decouples datasets, perturbations, and SLAM algorithms through common interfaces, so users can add new components without rewriting integration code. Moreover, SAL includes a search procedure that finds the severity level of a perturbation at which a SLAM system fails. To showcase the capabilities of SAL, our evaluation integrates seven SLAM algorithms and evaluates them across three datasets under weather, camera, and video transport perturbations.
comment: 8 pages, 4 figures
BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird's-Eye View Images CVPR 2026
We present BEV-SLD, a LiDAR global localization method building on the Scene Landmark Detection (SLD) concept. Unlike scene-agnostic pipelines, our self-supervised approach leverages bird's-eye-view (BEV) images to discover scene-specific patterns at a prescribed spatial density and treat them as landmarks. A consistency loss aligns learnable global landmark coordinates with per-frame heatmaps, yielding consistent landmark detections across the scene. Across campus, industrial, and forest environments, BEV-SLD delivers robust localization and achieves strong performance compared to state-of-the-art methods.
comment: Accepted to CVPR 2026
Shielded Reinforcement Learning Under Dynamic Temporal Logic Constraints
Reinforcement Learning (RL) has shown promise in various robotics applications, yet its deployment on real systems is still limited due to safety and operational constraints. The safe RL field has gained considerable attention in recent years, which focuses on imposing safety constraints throughout the learning process. However, real systems often require more complex constraints than just safety, such as periodic recharging or time-bounded visits to specific regions. Imposing such spatio-temporal tasks during learning still remains a challenge. Signal Temporal Logic (STL) is a formal language for specifying temporal properties of real-valued signals and provides a way to express such complex tasks. In this paper, we propose a framework that leverages sequential control barrier functions and model-free RL to ensure that the given STL tasks are satisfied throughout the learning process. Our method extends beyond traditional safety constraints by enforcing rich STL specifications, which can involve visits to dynamic targets with unknown trajectories. We also demonstrate the effectiveness of our framework through various simulations.
comment: 7 pages, 3 figures, 2026 IEEE American Control Conference (ACC)
SLowRL: Safe Low-Rank Adaptation Reinforcement Learning for Locomotion
Sim-to-real transfer of locomotion policies often leads to performance degradation due to the inevitable sim-to-real gap. Naively fine-tuning these policies directly on hardware is problematic, as it poses risks of mechanical failure and suffers from high sample inefficiency. In this paper, we address the challenge of safely and efficiently fine-tuning reinforcement learning (RL) policies for dynamic locomotion tasks. Specifically, we focus on fine-tuning policies learned in simulation directly on hardware, while explicitly enforcing safety constraints. In doing so, we introduce SLowRL, a framework that combines Low-Rank Adaptation (LoRA) with training-time safety enforcement via a recovery policy. We evaluate our method both in simulation and on a real Unitree Go2 quadruped robot for jump and trot tasks. Experimental results show that our method achieves a $46.5\%$ reduction in fine-tuning time and near-zero safety violations compared to standard proximal policy optimization (PPO) baselines. Notably, we find that a rank-1 adaptation alone is sufficient to recover pre-trained performance in the real world, while maintaining stable and safe real-world fine-tuning. These results demonstrate the practicality of safe, efficient fine-tuning for dynamic real-world robotic applications.
TrackDeform3D: Markerless and Autonomous 3D Keypoint Tracking and Dataset Collection for Deformable Objects
Structured 3D representations such as keypoints and meshes offer compact, expressive descriptions of deformable objects, jointly capturing geometric and topological information useful for downstream tasks such as dynamics modeling and motion planning. However, robustly extracting such representations remains challenging, as current perception methods struggle to handle complex deformations. Moreover, large-scale 3D data collection remains a bottleneck: existing approaches either require prohibitive data collection efforts, such as labor-intensive annotation or expensive motion capture setups, or rely on simplifying assumptions that break down in unstructured environments. As a result, large-scale 3D datasets and benchmarks for deformable objects remain scarce. To address these challenges, this paper presents an affordable and autonomous framework for collecting 3D datasets of deformable objects using only RGB-D cameras. The proposed method identifies 3D keypoints and robustly tracks their trajectories, incorporating motion consistency constraints to produce temporally smooth and geometrically coherent data. TrackDeform3D is evaluated against several state-of-the-art tracking methods across diverse object categories and demonstrates consistent improvements in both geometric and tracking accuracy. Using this framework, this paper presents a high-quality, large-scale dataset consisting of 6 deformable objects, totaling 110 minutes of trajectory data.
TeleDex: Accessible Dexterous Teleoperation
Despite increasing dataset scale and model capacity, robot manipulation policies still struggle to generalize beyond their training distributions. As a result, deploying state-of-the-art policies in new environments, tasks, or robot embodiments often requires collecting additional demonstrations. Enabling this in real-world deployment settings requires tools that allow users to collect demonstrations quickly, affordably, and with minimal setup. We present TeleDex, an open-source system for intuitive teleoperation of dexterous hands and robotic manipulators using any readily available phone. The system streams low-latency 6-DoF wrist poses and articulated 21-DoF hand state estimates from the phone, which are retargeted to robot arms and multi-fingered hands without requiring external tracking infrastructure. TeleDex supports both a handheld phone-only mode and an optional 3D-printable hand-mounted interface for finger-level teleoperation. By lowering the hardware and setup barriers to dexterous teleoperation, TeleDex enables users to quickly collect demonstrations during deployment to support policy fine-tuning. We evaluate the system across simulation and real-world manipulation tasks, demonstrating its effectiveness as a unified scalable interface for robot teleoperation. All software and hardware designs, along with demonstration videos, are open-source and available at orayyan.com/teledex.
comment: For project website and videos, see https://www.orayyan.com/teledex
Asymmetric Nash Seeking via Best Response Maps: Global Linear Convergence and Robustness to Inexact Reaction Models
Nash equilibria provide a principled framework for modeling interactions in multi-agent decision-making and control. However, many equilibrium-seeking methods implicitly assume that each agent has access to the other agents' objectives and constraints, an assumption that is often unrealistic in practice. This letter studies a class of asymmetric-information two-player constrained games with decoupled feasible sets, in which Player 1 knows its own objective and constraints while Player 2 is available only through a best-response map. For this class of games, we propose an asymmetric projected gradient descent-best response iteration that does not require full mutual knowledge of both players' optimization problems. Under suitable regularity conditions, we establish the existence and uniqueness of the Nash equilibrium and prove global linear convergence of the proposed iteration when the best-response map is exact. Recognizing that best-response maps are often learned or estimated, we further analyze the inexact case and show that, when the approximation error is uniformly bounded by $\varepsilon$, the iterates enter an explicit $O(\varepsilon)$ neighborhood of the true Nash equilibrium. Numerical results on a benchmark game corroborate the predicted convergence behavior and error scaling.
comment: 6 Pages, 2 Figures, Preprint submitted to IEEE L-CSS and CDC 2026
Contingency-Aware Planning via Certified Neural Hamilton-Jacobi Reachability
Hamilton-Jacobi (HJ) reachability provides formal safety guarantees for dynamical systems, but solving high-dimensional HJ partial differential equations limits its use in real-time planning. This paper presents a contingency-aware multi-goal navigation framework that integrates learning-based reachability with sampling-based planning in unknown environments. We use Fourier Neural Operator (FNO) to approximate the solution operator of the Hamilton-Jacobi-Isaacs variational inequality under varying obstacle configurations. We first provide a theoretical under-approximation guarantee on the safe backward reach-avoid set, which enables formal safety certification of the learned reachable sets. Then, we integrate the certified reachable sets with an incremental multi-goal planner, which enforces reachable-set constraints and a recovery policy that guarantees finite-time return to a safe region. Overall, we demonstrate that the proposed framework achieves asymptotically optimal navigation with provable contingency behavior, and validate its performance through real-time deployment on KUKA's youBot in Webots simulation.
comment: 9 pages, 4 figures
Efficient and Reliable Teleoperation through Real-to-Sim-to-Real Shared Autonomy
Fine-grained, contact-rich teleoperation remains slow, error-prone, and unreliable in real-world manipulation tasks, even for experienced operators. Shared autonomy offers a promising way to improve performance by combining human intent with automated assistance, but learning effective assistance in simulation requires a faithful model of human behavior, which is difficult to obtain in practice. We propose a real-to-sim-to-real shared autonomy framework that augments human teleoperation with learned corrective behaviors, using a simple yet effective k-nearest-neighbor (kNN) human surrogate to model operator actions in simulation. The surrogate is fit from less than five minutes of real-world teleoperation data and enables stable training of a residual copilot policy with model-free reinforcement learning. The resulting copilot is deployed to assist human operators in real-world fine-grained manipulation tasks. Through simulation experiments and a user study with sixteen participants on industry-relevant tasks, including nut threading, gear meshing, and peg insertion, we show that our system improves task success for novice operators and execution efficiency for experienced operators compared to direct teleoperation and shared-autonomy baselines that rely on expert priors or behavioral-cloning pilots. In addition, copilot-assisted teleoperation produces higher-quality demonstrations for downstream imitation learning.
comment: Project Page: https://residual-copilot.github.io/
Rewarding DINO: Predicting Dense Rewards with Vision Foundation Models
Well-designed dense reward functions in robot manipulation not only indicate whether a task is completed but also encode progress along the way. Generally, designing dense rewards is challenging and usually requires access to privileged state information available only in simulation, not in real-world experiments. This makes reward prediction models that infer task state information from camera images attractive. A common approach is to predict rewards from expert demonstrations based on visual similarity or sequential frame ordering. However, this biases the resulting reward function towards a specific solution and leaves it undefined in states not covered by the demonstrations. In this work, we introduce Rewarding DINO, a method for language-conditioned reward modeling that learns actual reward functions rather than specific trajectories. The model's compact size allows it to serve as a direct replacement for analytical reward functions with comparatively low computational overhead. We train our model on data sampled from 24 Meta-World+ tasks using a rank-based loss and evaluate pairwise accuracy, rank correlation, and calibration. Rewarding DINO achieves competitive performance in tasks from the training set and generalizes to new settings in simulation and the real world, indicating that it learns task semantics. We also test the model with off-the-shelf reinforcement learning algorithms to solve tasks from our Meta-World+ training set.
comment: 10 pages, 5 figures, submitted to IEEE
Crowd-FM: Learned Optimal Selection of Conditional Flow Matching-generated Trajectories for Crowd Navigation ICRA 2026
Safe and computationally efficient local planning for mobile robots in dense, unstructured human crowds remains a fundamental challenge. Moreover, ensuring that robot trajectories are similar to how a human moves will increase the acceptance of the robot in human environments. In this paper, we present Crowd-FM, a learning-based approach to address both safety and human-likeness challenges. Our approach has two novel components. First, we train a Conditional Flow-Matching (CFM) policy over a dataset of optimally controlled trajectories to learn a set of collision-free primitives that a robot can choose at any given scenario. The chosen optimal control solver can generate multi-modal collision-free trajectories, allowing the CFM policy to learn a diverse set of maneuvers. Secondly, we learn a score function over a dataset of human demonstration trajectories that provides a human-likeness score for the flow primitives. At inference time, computing the optimal trajectory requires selecting the one with the highest score. Our approach improves the state-of-the-art by showing that our CFM policy alone can produce collision-free navigation with a higher success rate than existing learning-based baselines. Furthermore, when augmented with inference-time refinement, our approach can outperform even expensive optimisation-based planning approaches. Finally, we validate that our scoring network can select trajectories closer to the expert data than a manually designed cost function.
comment: Accepted at IEEE ICRA 2026. Authors Antareep Singha and Laksh Nanwani have equal contributions
Stein Variational Ergodic Surface Coverage with SE(3) Constraints
Surface manipulation tasks require robots to generate trajectories that comprehensively cover complex 3D surfaces while maintaining precise end-effector poses. Existing ergodic trajectory optimization (TO) methods demonstrate success in coverage tasks, while struggling with point-cloud targets due to the nonconvex optimization landscapes and the inadequate handling of SE(3) constraints in sampling-as-optimization (SAO) techniques. In this work, we introduce a preconditioned SE(3) Stein Variational Gradient Descent (SVGD) approach for SAO ergodic trajectory generation. Our proposed approach comprises multiple innovations. First, we reformulate point-cloud ergodic coverage as a manifold-aware sampling problem. Second, we derive SE(3)-specific SVGD particle updates, and, third, we develop a preconditioner to accelerate TO convergence. Our sampling-based framework consistently identifies superior local optima compared to strong optimization-based and SAO baselines while preserving the SE(3) geometric structure. Experiments on a 3D point-cloud surface coverage benchmark and robotic surface drawing tasks demonstrate that our method achieves superior coverage quality with tractable computation in our setting relative to existing TO and SAO approaches, and is validated in real-world robot experiments.
DreamFlow: Local Navigation Beyond Observation via Conditional Flow Matching in the Latent Space
Local navigation in cluttered environments often suffers from dense obstacles and frequent local minima. Conventional local planners rely on heuristics and are prone to failure, while deep reinforcement learning(DRL)based approaches provide adaptability but are constrained by limited onboard sensing. These limitations lead to navigation failures because the robot cannot perceive structures outside its field of view. In this paper, we propose DreamFlow, a DRL-based local navigation framework that extends the robot's perceptual horizon through conditional flow matching(CFM). The proposed CFM based prediction module learns probabilistic mapping between local height map latent representation and broader spatial representation conditioned on navigation context. This enables the navigation policy to predict unobserved environmental features and proactively avoid potential local minima. Experimental results demonstrate that DreamFlow outperforms existing methods in terms of latent prediction accuracy and navigation performance in simulation. The proposed method was further validated in cluttered real world environments with a quadrupedal robot. The project page is available at https://dreamflow-icra.github.io.
MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation CVPR 2026
Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relational edges with dynamically assigned images. Built on M3DSG, we propose MSGNav, a zero-shot navigation system that includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration reasoning. Additionally, we further identify the last mile problem in zero-shot navigation determining the feasible target location with a suitable final viewpoint, and propose a Visibility-based Viewpoint Decision module to explicitly resolve it. Comprehensive experimental results demonstrate that MSGNav achieves state-of-the-art performance on the challenging GOAT-Bench and HM3D-ObjNav benchmark. The code will be publicly available at https://github.com/ylwhxht/MSGNav.
comment: 18 pages, Accepted by CVPR 2026
CloSE: A Geometric Shape-Agnostic Cloth State Representation ICRA 2026
Cloth manipulation is a difficult problem mainly because of the non-rigid nature of cloth, which makes a good representation of deformation essential. We present a new representation for the deformation-state of clothes. First, we propose the dGLI disk representation based on topological indices computed for edge segments of the cloth border that are arranged on a circular grid. The heat-map of the dGLI disk uncovers patterns that correspond to features of the cloth state that are consistent for different shapes, sizes or orientation of the cloth. We then abstract these important features from the dGLI disk into a circle, calling it the Cloth StatE representation (CloSE). This representation is compact, continuous, and general for different shapes. We show that this representation is able to accurately predict the fold locations for several simulation clothing datasets. Finally, we also show the strengths of this representation in two relevant applications: semantic labeling and high- and low-level planning. The code and the dataset can be accessed from: https://close-representation.github.io/
comment: Accepted at ICRA 2026 (8 pages, 11 figures, 1 table). Project page: https://close-representation.github.io/
DefVINS: Visual-Inertial Odometry for Deformable Scenes
Deformable scenes violate the rigidity assumptions underpinning classical visual--inertial odometry (VIO), often leading to over-fitting to local non-rigid motion or to severe camera pose drift when deformation dominates visual parallax. In this paper, we introduce DefVINS, the first visual-inertial odometry pipeline designed to operate in deformable environments. Our approach models the odometry state by decomposing it into a rigid, IMU-anchored component and a non-rigid scene warp represented by an embedded deformation graph. As a second contribution, we present VIMandala, the first benchmark containing real images and ground-truth camera poses for visual-inertial odometry in deformable scenes. In addition, we augment the synthetic Drunkard's benchmark with simulated inertial measurements to further evaluate our pipeline under controlled conditions. We also provide an observability analysis of the visual-inertial deformable odometry problem, characterizing how inertial measurements constrain camera motion and render otherwise unobservable modes identifiable in the presence of deformation. This analysis motivates the use of IMU anchoring and leads to a conditioning-based activation strategy that avoids ill-posed updates under poor excitation. Experimental results on both the synthetic Drunkard's and our real VIMandala benchmarks show that DefVINS outperforms rigid visual--inertial and non-rigid visual odometry baselines. Our source code and data will be released upon acceptance.
comment: 4 figures, 2 tables. Submitted to RA-L
Traj2Action: A Co-Denoising Framework for Trajectory-Guided Human-to-Robot Skill Transfer
Learning diverse manipulation skills for real-world robots is severely bottlenecked by the reliance on costly and hard-to-scale teleoperated demonstrations. While human videos offer a scalable alternative, effectively transferring manipulation knowledge is fundamentally hindered by the significant morphological gap between human and robotic embodiments. To address this challenge and facilitate skill transfer from human to robot, we introduce Traj2Action, a novel framework that bridges this embodiment gap by using the 3D trajectory of the operational endpoint as a unified intermediate representation, and then transfers the manipulation knowledge embedded in this trajectory to the robot's actions. Our policy first learns to generate a coarse trajectory, which forms a high-level motion plan by leveraging both human and robot data. This plan then conditions the synthesis of precise, robot-specific actions (e.g., orientation and gripper state) within a co-denoising framework. Our work centers on two core objectives: first, the systematic verification of the Traj2Action framework's effectiveness-spanning architectural design, cross-task generalization, and data efficiency and second, the revelation of key laws that govern robot policy learning during the integration of human hand demonstration data. This research focus enables us to provide a scalable paradigm tailored to address human-to-robot skill transfer across morphological gaps. Extensive real-world experiments on a Franka robot demonstrate that Traj2Action boosts the performance by up to 27% and 22.25% over $π_0$ baseline on short- and long-horizon real-world tasks, and achieves significant gains as human data scales in robot policy learning.
$χ_{0}$: Resource-Aware Robust Manipulation via Taming Distributional Inconsistencies
High-reliability long-horizon robotic manipulation has traditionally relied on large-scale data and compute to understand complex real-world dynamics. However, we identify that the primary bottleneck to real-world robustness is not resource scale alone, but the distributional shift among the human demonstration distribution, the inductive bias learned by the policy, and the test-time execution distribution -- a systematic inconsistency that causes compounding errors in multi-stage tasks. To mitigate these inconsistencies, we propose $χ_{0}$, a resource-efficient framework with effective modules designated to achieve production-level robustness in robotic manipulation. Our approach builds off three technical pillars: (i) Model Arithmetic, a weight-space merging strategy that efficiently soaks up diverse distributions of different demonstrations, varying from object appearance to state variations; (ii) Stage Advantage, a stage-aware advantage estimator that provides stable, dense progress signals, overcoming the numerical instability of prior non-stage approaches; and (iii) Train-Deploy Alignment, which bridges the distribution gap via spatio-temporal augmentation, heuristic DAgger corrections, and temporal chunk-wise smoothing. $χ_{0}$ enables two sets of dual-arm robots to collaboratively orchestrate long-horizon garment manipulation, spanning tasks from flattening, folding, to hanging different clothes. Our method exhibits high-reliability autonomy; we are able to run the system from arbitrary initial state for consecutive 24 hours non-stop. Experiments validate that $χ_{0}$ surpasses the state-of-the-art $π_{0.5}$ in success rate by nearly 250%, with only 20-hour data and 8 A100 GPUs. Code, data and models will be released to facilitate the community.
UGotMe: An Embodied System for Affective Human-Robot Interaction ICRA
Equipping humanoid robots with the capability to understand emotional states of human interactants and express emotions appropriately according to situations is essential for affective human-robot interaction. However, enabling current vision-aware multimodal emotion recognition models for affective human-robot interaction in the real-world raises embodiment challenges: addressing the environmental noise issue and meeting real-time requirements. First, in multiparty conversation scenarios, the noises inherited in the visual observation of the robot, which may come from either 1) distracting objects in the scene or 2) inactive speakers appearing in the field of view of the robot, hinder the models from extracting emotional cues from vision inputs. Secondly, realtime response, a desired feature for an interactive system, is also challenging to achieve. To tackle both challenges, we introduce an affective human-robot interaction system called UGotMe designed specifically for multiparty conversations. Two denoising strategies are proposed and incorporated into the system to solve the first issue. Specifically, to filter out distracting objects in the scene, we propose extracting face images of the speakers from the raw images and introduce a customized active face extraction strategy to rule out inactive speakers. As for the second issue, we employ efficient data transmission from the robot to the local server to improve realtime response capability. We deploy UGotMe on a human robot named Ameca to validate its real-time inference capabilities in practical scenarios. Videos demonstrating real-world deployment are available at https://lipzh5.github.io/HumanoidVLE/.
comment: Accepted to the 2025 IEEE International Conference on Robotics and Automation (ICRA)
EfficientNav: Towards On-Device Object-Goal Navigation with Navigation Map Caching and Retrieval NeurIPS 2025
Object-goal navigation (ObjNav) tasks an agent with navigating to the location of a specific object in an unseen environment. Embodied agents equipped with large language models (LLMs) and online constructed navigation maps can perform ObjNav in a zero-shot manner. However, existing agents heavily rely on giant LLMs on the cloud, e.g., GPT-4, while directly switching to small LLMs, e.g., LLaMA3.2-11b, suffer from significant success rate drops due to limited model capacity for understanding complex navigation maps, which prevents deploying ObjNav on local devices. At the same time, the long prompt introduced by the navigation map description will cause high planning latency on local devices. In this paper, we propose EfficientNav to enable on-device efficient LLM-based zero-shot ObjNav. To help the smaller LLMs better understand the environment, we propose semantics-aware memory retrieval to prune redundant information in navigation maps. To reduce planning latency, we propose discrete memory caching and attention-based memory clustering to efficiently save and re-use the KV cache. Extensive experimental results demonstrate that EfficientNav achieves 11.1% improvement in success rate on HM3D benchmark over GPT-4-based baselines, and demonstrates 6.7x real-time latency reduction and 4.7x end-to-end latency reduction over GPT-4 planner. Our code is available on https://github.com/PKU-SEC-Lab/EfficientNav.
comment: NeurIPS 2025
WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning
Autonomous driving systems depend on on models that can reason about high-level scene contexts and accurately predict the dynamics of their surrounding environment. Vision- Language Models (VLMs) have recently emerged as promising tools for decision-making and scene understanding, offering strong capabilities in contextual reasoning. However, their limited spatial comprehension constrains their effectiveness as end-to-end driving models. World Models (WM) internalize environmental dynamics to predict future scene evolution. Recently explored as ego-motion predictors and foundation models for autonomous driving, they represent a promising direction for addressing key challenges in the field, particularly enhancing generalization while maintaining dynamic prediction. To leverage the complementary strengths of context-based decision making and prediction, we propose WorldVLM: A hybrid architecture that unifies VLMs and WMs. In our design, the high-level VLM generates behavior commands to guide the driving WM, enabling interpretable and context-aware actions. We evaluate conditioning strategies and provide insights into the hybrid design challenges.
comment: 8 pages, 6 figures, 5 tables; submitted to IEEE
KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning
Memory-augmented Large Language Models (LLMs) have demonstrated remarkable capability for complex and long-horizon embodied planning. By keeping track of past experiences and environmental states, memory enables LLMs to maintain a global view, thereby avoiding repetitive exploration. However, existing approaches often store the memory as raw text, leading to excessively long prompts and high prefill latency. While it is possible to store and reuse the KV caches, the efficiency benefits are greatly undermined due to frequent KV cache updates. In this paper, we propose KEEP, a KV-cache-centric memory management system for efficient embodied planning. KEEP features 3 key innovations: (1) a Static-Dynamic Memory Construction algorithm that reduces KV cache recomputation by mixed-granularity memory group; (2) a Multi-hop Memory Re-computation algorithm that dynamically identifies important cross-attention among different memory groups and reconstructs memory interactions iteratively; (3) a Layer-balanced Memory Loading that eliminates unbalanced KV cache loading and cross-attention computation across different layers. Extensive experimental results have demonstrated that KEEP achieves 2.68x speedup with negligible accuracy loss compared with text-based memory methods on ALFRED dataset. Compared with the KV re-computation method CacheBlend (EuroSys'25), KEEP shows 4.13% success rate improvement and 1.90x time-to-first-token (TTFT) reduction. Our code is available on https://github.com/PKU-SEC-Lab/KEEP_Embodied_Memory.
comment: DAC 2026
DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation
Vision-Language-Action (VLA) models have shown remarkable success in robotic tasks like manipulation by fusing a language model's reasoning with a vision model's 3D understanding. However, their high computational cost remains a major obstacle for real-world applications that require real-time performance. We observe that the actions within a task have varying levels of importance: critical steps demand high precision, while less important ones can tolerate more variance. Leveraging this insight, we propose DySL-VLA, a novel framework that addresses computational cost by dynamically skipping VLA layers based on each action's importance. DySL-VLA categorizes its layers into two types: informative layers, which are consistently executed, and incremental layers, which can be selectively skipped. To intelligently skip layers without sacrificing accuracy, we invent a prior-post skipping guidance mechanism to determine when to initiate layer-skipping. We also propose a skip-aware two-stage knowledge distillation algorithm to efficiently train a standard VLA into a DySL-VLA. Our experiments indicate that DySL-VLA achieves 2.1% improvement in success length over Deer-VLA on the Calvin dataset, while simultaneously reducing trainable parameters by a factor of 85.7 and providing a 3.75x speedup relative to the RoboFlamingo baseline at iso-accuracy. Our code is available on https://github.com/PKU-SEC-Lab/DYSL_VLA.
comment: DAC 2026
CLAIM: Camera-LiDAR Alignment with Intensity and Monodepth IROS 2025
In this paper, we unleash the potential of the powerful monodepth model in camera-LiDAR calibration and propose CLAIM, a novel method of aligning data from the camera and LiDAR. Given the initial guess and pairs of images and LiDAR point clouds, CLAIM utilizes a coarse-to-fine searching method to find the optimal transformation minimizing a patched Pearson correlation-based structure loss and a mutual information-based texture loss. These two losses serve as good metrics for camera-LiDAR alignment results and require no complicated steps of data processing, feature extraction, or feature matching like most methods, rendering our method simple and adaptive to most scenes. We validate CLAIM on public KITTI, Waymo, and MIAS-LCEC datasets, and the experimental results demonstrate its superior performance compared with the state-of-the-art methods. The code is available at https://github.com/Tompson11/claim.
comment: Accepted by IROS 2025
An Intention-driven Lane Change Framework Considering Heterogeneous Dynamic Cooperation in Mixed-traffic Environment
In mixed-traffic environments, autonomous vehicles (AVs) must interact with heterogeneous human-driven vehicles (HVs) whose intentions and driving styles vary across individuals and scenarios. Such variability introduces uncertainty into lane change interactions, where safety and efficiency critically depend on accurately anticipating surrounding drivers' cooperative responses. Existing methods often oversimplify these interactions by assuming uniform or fixed behavioral patterns. To address this limitation, we propose an intention-driven lane change framework that integrates driving-style recognition with cooperation-aware decision-making and motion-planning. A deep learning-based classifier identifies distinct human driving styles in real time. We then introduce a dual-perspective cooperation score composed of intrinsic style-dependent tendencies and interactive dynamic components, enabling interpretable and adaptive intention prediction and quantitative inference. A decision-making module combines behavior cloning (BC) and inverse reinforcement learning (IRL) to determine lane change feasibility. Later, a coordinated motion-planning architecture integrating IRL-based intention inference with model predictive control (MPC) is established to generate collision-free and socially compliant trajectories. Experiments on the NGSIM dataset show that the proposed decision-making model outperforms representative rule-based and learning-based baselines, achieving 96.98% accuracy in lane change classification. Motion-planning evaluations further demonstrate improved maneuver success and execution stability in mixed-traffic environments. These results validate the effectiveness of structured cooperation modeling for intention-driven autonomous lane changes.
Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching
Stereo foundation models achieve strong zero-shot generalization but remain computationally prohibitive for real-time applications. Efficient stereo architectures, on the other hand, sacrifice robustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10x faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods. Project page: https://nvlabs.github.io/Fast-FoundationStereo/
Haptic Light-Emitting Diodes: Miniature, Luminous Tactile Actuators
We present Haptic Light-Emitting Diodes (HLEDs), luminous thermopneumatic actuators that directly convert pulsed light into mechanical forces and displacements. Each device packages a miniature surface-mount LED in a gas-filled cavity that contains a low-inertia graphite photoabsorber. The cavity is sealed by an elastic membrane, which functions as a working diaphragm. Brief optical pulses heat the photoabsorber, which heats the gas. The resulting rapid pressure increases generate forces and displacements at the working diaphragm. Millimeter-scale HLEDs produce forces exceeding 0.4 N and displacements of 0.9 mm at low voltages, with 5 to 100 ms response times, making them attractive as actuators providing tactile feedback in human-machine interfaces. Unusually, these actuators are also light-emitting, as a fraction of optical energy is transmitted through the membrane. These photomechanical actuators have many potential applications in tactile displays, human interface engineering, wearable computing, and other areas.
Vision-Language Models for Infrared Industrial Sensing in Additive Manufacturing Scene Description
Many manufacturing environments operate in low-light conditions or within enclosed machines where conventional vision systems struggle. Infrared cameras provide complementary advantages in such environments. Simultaneously, supervised AI systems require large labeled datasets, which makes zero-shot learning frameworks more practical for applications including infrared cameras. Recent advances in vision-language foundation models (VLMs) offer a new path in zero-shot predictions from paired image-text representations. However, current VLMs cannot understand infrared camera data since they are trained on RGB data. This work introduces VLM-IRIS (Vision-Language Models for InfraRed Industrial Sensing), a zero-shot framework that adapts VLMs to infrared data by preprocessing infrared images captured by a FLIR Boson sensor into RGB-compatible inputs suitable for CLIP-based encoders. We demonstrate zero-shot workpiece presence detection on a 3D printer bed where temperature differences between the build plate and workpieces make the task well-suited for thermal imaging. VLM-IRIS converts the infrared images to magma representation and applies centroid prompt ensembling with a CLIP ViT-B/32 encoder to achieve high accuracy on infrared images without any model retraining. These findings demonstrate that the proposed improvements to VLMs can be effectively extended to thermal applications for label-free monitoring.
SHaRe-RL: Structured, Interactive Reinforcement Learning for Contact-Rich Industrial Assembly Tasks ICRA
High-mix low-volume (HMLV) industrial assembly, common in small and medium-sized enterprises (SMEs), requires the same precision, safety, and reliability as high-volume automation while remaining flexible to product variation and environmental uncertainty. Current robotic systems struggle to meet these demands. Manual programming is brittle and costly to adapt, while learning-based methods suffer from poor sample efficiency and unsafe exploration in contact-rich tasks. To address this, we present SHaRe-RL, a reinforcement learning framework that leverages multiple sources of prior knowledge. By (i) structuring skills into manipulation primitives, (ii) incorporating human demonstrations and online corrections, and (iii) bounding interaction forces with per-axis compliance, SHaRe-RL enables efficient and safe online learning for long-horizon, contact-rich industrial assembly tasks. Experiments on the insertion of industrial Harting connector modules with 0.2-0.4 mm clearance demonstrate that SHaRe-RL achieves reliable performance within practical time budgets. Our results show that process expertise, without requiring robotics or RL knowledge, can meaningfully contribute to learning, enabling safer, more robust, and more economically viable deployment of RL for industrial assembly.
comment: 8 pages, 8 figures, accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026
Ontological foundations for contrastive explanatory narration of robot plans
Mutual understanding of artificial agents' decisions is key to ensuring a trustworthy and successful human-robot interaction. Hence, robots are expected to make reasonable decisions and communicate them to humans when needed. In this article, the focus is on an approach to modeling and reasoning about the comparison of two competing plans, so that robots can later explain the divergent result. First, a novel ontological model is proposed to formalize and reason about the differences between competing plans, enabling the classification of the most appropriate one (e.g., the shortest, the safest, the closest to human preferences, etc.). This work also investigates the limitations of a baseline algorithm for ontology-based explanatory narration. To address these limitations, a novel algorithm is presented, leveraging divergent knowledge between plans and facilitating the construction of contrastive narratives. Through empirical evaluation, it is observed that the explanations excel beyond the baseline method.
CompliantVLA-adaptor: VLM-Guided Variable Impedance Action for Safe Contact-Rich Manipulation
We propose a CompliantVLA-adaptor that augments the state-of-the-art Vision-Language-Action (VLA) models with vision-language model (VLM)-informed context-aware variable impedance control (VIC) to improve the safety and effectiveness of contact-rich robotic manipulation tasks. Existing VLA systems (e.g., RDT, Pi0.5, OpenVLA-oft) typically output position, but lack force-aware adaptation, leading to unsafe or failed interactions in physical tasks involving contact, compliance, or uncertainty. In the proposed CompliantVLA-adaptor, a VLM interprets task context from images and natural language to adapt the stiffness and damping parameters of a VIC controller. These parameters are further regulated using real-time force/torque feedback to ensure interaction forces remain within safe thresholds. We demonstrate that our method outperforms the VLA baselines on a suite of complex contact-rich tasks, both in simulation and the real world, with improved success rates and reduced force violations. This work presents a promising path towards a safe foundation model for physical contact-rich manipulation. We release our code, prompts, and force-torque-impedance-scenario context datasets at https://sites.google.com/view/compliantvla.
comment: under review
Real-Time Quasi-Static Modeling of UAV Tether Aerodynamics
One of the main limitations of multirotor UAVs is their short flight time due to battery constraints. A practical solution for continuous operation is to power the drone from the ground via a tether. While this approach has been demonstrated for stationary systems, scenarios with a fast-moving base vehicle or strong wind conditions require modeling the tether forces, including aerodynamic effects. In this work, we propose two complementary approaches for real-time quasi-static tether modeling with aerodynamics. The first is an analytical method based on catenary theory with a uniform drag assumption, achieving very fast solve times below 1~ms. The second is a numerical method that discretizes the tether into segments and lumped masses, solving the equilibrium equations using CasADi and IPOPT. By leveraging initialization strategies, such as warm starting and analytical initialization, real-time performance was achieved with a solve time of 5~ms, while allowing for flexible force formulations. Both approaches were validated in real-world tests using a load cell to measure the tether force. The results show that the analytical method provides sufficient accuracy for most tethered UAV applications with minimal computational cost, while the numerical method offers higher flexibility and physical accuracy when required. These approaches form a lightweight and extensible framework for real-time tether simulation, applicable to both offline optimization and online tasks such as simulation, control, and trajectory planning.
System Design of the Ultra Mobility Vehicle: A Driving, Balancing, and Jumping Bicycle Robot
Trials cyclists and mountain bike riders can hop, jump, balance, and drive on one or both wheels. This versatility allows them to achieve speed and energy-efficiency on smooth terrain and agility over rough terrain. Inspired by these athletes, we present the design and control of a robotic platform, Ultra Mobility Vehicle (UMV), which combines a bicycle and a reaction mass to move dynamically with minimal actuated degrees of freedom. We employ a simulation-driven design optimization process to synthesize a spatial linkage topology with a focus on vertical jump height and momentum-based balancing on a single wheel contact. Using a constrained Reinforcement Learning (RL) framework, we demonstrate zero-shot transfer of diverse athletic behaviors, including track-stands, jumps, wheelies, rear wheel hopping, and front flips. This 23.5 kg robot is capable of high speeds (8 m/s) and jumping on and over large obstacles (1 m tall, or 130% of the robot's nominal height).
comment: 17 Pages, 11 figures, 3 movies, 2 tables
Contraction Theory for Nonlinear Stability Analysis and Learning-based Control: A Tutorial Overview
Contraction theory is an analytical tool to study differential dynamics of a non-autonomous (i.e., time-varying) nonlinear system under a contraction metric defined with a uniformly positive definite matrix, the existence of which results in a necessary and sufficient characterization of incremental exponential stability of multiple solution trajectories with respect to each other. By using a squared differential length as a Lyapunov-like function, its nonlinear stability analysis boils down to finding a suitable contraction metric that satisfies a stability condition expressed as a linear matrix inequality, indicating that many parallels can be drawn between well-known linear systems theory and contraction theory for nonlinear systems. Furthermore, contraction theory takes advantage of a superior robustness property of exponential stability used in conjunction with the comparison lemma. This yields much-needed safety and stability guarantees for neural network-based control and estimation schemes, without resorting to a more involved method of using uniform asymptotic stability for input-to-state stability. Such distinctive features permit the systematic construction of a contraction metric via convex optimization, thereby obtaining an explicit exponential bound on the distance between a time-varying target trajectory and solution trajectories perturbed externally due to disturbances and learning errors. The objective of this paper is, therefore, to present a tutorial overview of contraction theory and its advantages in nonlinear stability analysis of deterministic and stochastic systems, with an emphasis on deriving formal robustness and stability guarantees for various learning-based and data-driven automatic control methods. In particular, we provide a detailed review of techniques for finding contraction metrics and associated control and estimation laws using deep neural networks.
comment: Annual Reviews in Control, Preprint Version, Accepted, Oct. 1st
BiGraspFormer: End-to-End Bimanual Grasp Transformer
Bimanual grasping is essential for robots to handle large and complex objects. However, existing methods either focus solely on single-arm grasping or employ separate grasp generation and bimanual evaluation stages, leading to coordination problems including collision risks and unbalanced force distribution. To address these limitations, we propose BiGraspFormer, a unified end-to-end transformer framework that directly generates coordinated bimanual grasps from object point clouds. Our key idea is the Single-Guided Bimanual (SGB) strategy, which first generates diverse single grasp candidates using a transformer decoder, then leverages their learned features through specialized attention mechanisms to jointly predict bimanual poses and quality scores. This conditioning strategy reduces the complexity of the 12-DoF search space while ensuring coordinated bimanual manipulation. Comprehensive simulation experiments and real-world validation demonstrate that BiGraspFormer consistently outperforms existing methods while maintaining efficient inference speed (<0.05s), confirming the effectiveness of our framework. Code and supplementary materials are available at https://sites.google.com/view/bigraspformer
comment: 8 pages, 5 figures
Closed-Loop Action Chunks with Dynamic Corrections for Training-Free Diffusion Policy ICRA2026
Diffusion-based policies have achieved remarkable results in robotic manipulation but often struggle to adapt rapidly in dynamic scenarios, leading to delayed responses or task failures. We present DCDP, a Dynamic Closed-Loop Diffusion Policy framework that integrates chunk-based action generation with real-time correction. DCDP integrates a self-supervised dynamic feature encoder, cross-attention fusion, and an asymmetric action encoder-decoder to inject environmental dynamics before action execution, achieving real-time closed-loop action correction and enhancing the system's adaptability in dynamic scenarios. In dynamic PushT simulations, DCDP improves adaptability by 19\% without retraining while requiring only 5\% additional computation. Its modular design enables plug-and-play integration, achieving both temporal coherence and real-time responsiveness in dynamic robotic scenarios, including real-world manipulation tasks. The project page is at: https://github.com/wupengyuan/dcdp
comment: Accepted by ICRA2026
Minimal Intervention Shared Control with Guaranteed Safety under Non-Convex Constraints ICRA
Shared control combines human intention with autonomous decision-making. At the low level, the primary goal is to maintain safety regardless of the user's input to the system. However, existing shared control methods-based on, e.g., Model Predictive Control, Control Barrier Functions, or learning-based control-often face challenges with feasibility, scalability, and mixed constraints. To address these challenges, we propose a Constraint-Aware Assistive Controller that computes control actions online while ensuring recursive feasibility, strict constraint satisfaction, and minimal deviation from the user's intent. It also accommodates a structured class of non-convex constraints common in real-world settings. We leverage Robust Controlled Invariant Sets for recursive feasibility and a Mixed-Integer Quadratic Programming formulation to handle non-convex constraints. We validate the approach through a large-scale user study with 66 participants-one of the most extensive in shared control research-using a simulated environment to assess task load, trust, and perceived control, in addition to performance. The results show consistent improvements across all these aspects without compromising safety and user intent. Additionally, a real-world experiment on a robotic manipulator demonstrates the framework's applicability under bounded disturbances, ensuring safety and collision-free operation.
comment: Accepted for publication at the 2026 IEEE International Conference on Robotics and Automation (ICRA)
When a Robot is More Capable than a Human: Learning from Constrained Demonstrators
Learning from demonstrations enables experts to teach robots complex tasks using interfaces such as kinesthetic teaching, joystick control, and sim-to-real transfer. However, these interfaces often constrain the expert's ability to demonstrate optimal behavior due to indirect control, setup restrictions, and hardware safety. For example, a joystick can move a robotic arm only in a 2D plane, even though the robot operates in a higher-dimensional space. As a result, the demonstrations collected by constrained experts lead to suboptimal performance of the learned policies. This raises a key question: Can a robot learn a better policy than the one demonstrated by a constrained expert? We address this by allowing the agent to go beyond direct imitation of expert actions and explore shorter and more efficient trajectories. We use the demonstrations to infer a state-only reward signal that measures task progress, and self-label reward for unknown states using temporal interpolation. Our approach outperforms common imitation learning in both sample efficiency and task completion time. On a real WidowX robotic arm, it completes the task in 12 seconds, 10x faster than behavioral cloning, as shown in real-robot videos on https://sites.google.com/view/constrainedexpert .
One-Shot Badminton Shuttle Detection for Mobile Robots
This paper presents a robust one-shot badminton shuttlecock detection framework for non-stationary robots. To address the lack of egocentric shuttlecock detection datasets, we introduce a dataset of 20,510 semi-automatically annotated frames captured across 11 distinct backgrounds in diverse indoor and outdoor environments, and categorize each frame into one of three difficulty levels. For labeling, we present a novel semi-automatic annotation pipeline, that enables efficient labeling from stationary camera footage. We propose a metric suited to our downstream use case and fine-tune a YOLOv8 network optimized for real-time shuttlecock detection, achieving an F1-score of 0.86 under our metric in test environments similar to training, and 0.70 in entirely unseen environments. Our analysis reveals that detection performance is critically dependent on shuttlecock size and background texture complexity. Qualitative experiments confirm their applicability to robots with moving cameras. Unlike prior work with stationary camera setups, our detector is specifically designed for the egocentric, dynamic viewpoints of mobile robots, providing a foundational building block for downstream tasks, including tracking, trajectory estimation, and system (re)-initialization.
Metamorphic Testing of Vision-Language Action-Enabled Robots
Vision-Language-Action (VLA) models are multimodal robotic task controllers that, given an instruction and visual inputs, produce a sequence of low-level control actions (or motor commands) enabling a robot to execute the requested task in the physical environment. These systems face the test oracle problem from multiple perspectives. On the one hand, a test oracle must be defined for each instruction prompt, which is a complex and non-generalizable approach. On the other hand, current state-of-the-art oracles typically capture symbolic representations of the world (e.g., robot and object states), enabling the correctness evaluation of a task, but fail to assess other critical aspects, such as the quality with which VLA-enabled robots perform a task. In this paper, we explore whether Metamorphic Testing (MT) can alleviate the test oracle problem in this context. To do so, we propose two metamorphic relation patterns and five metamorphic relations to assess whether changes to the test inputs impact the original trajectory of the VLA-enabled robots. An empirical study involving five VLA models, two simulated robots, and four robotic tasks shows that MT can effectively alleviate the test oracle problem by automatically detecting diverse types of failures, including, but not limited to, uncompleted tasks. More importantly, the proposed MRs are generalizable, making the proposed approach applicable across different VLA models, robots, and tasks, even in the absence of test oracles.
Trust in Autonomous Human--Robot Collaboration: Effects of Responsive Interaction Policies
Trust plays a central role in human--robot collaboration, yet its formation is rarely examined under the constraints of fully autonomous interaction. This pilot study investigated how interaction policy influences trust during in-person collaboration with a social robot operating without Wizard-of-Oz control or scripted repair. Participants completed a multi-stage collaborative task with a mobile robot that autonomously managed spoken-language dialogue, affect inference, and task progression. Two interaction policies were compared: a responsive policy, in which the robot proactively adapted its dialogue and assistance based on inferred interaction state, and a neutral, reactive policy, in which the robot provided only direct, task-relevant responses when prompted. Responsive interaction was associated with significantly higher post-interaction trust under viable communication conditions, despite no reliable differences in overall task accuracy. Sensitivity analyses indicated that affective and experiential components of trust were more sensitive to communication breakdown than evaluative judgments of reliability, and that as language-mediated interaction degraded, the trust advantage associated with responsiveness attenuated and ratings became less clearly interpretable as calibrated evaluations of collaborative competence. These findings suggest that trust in autonomous human--robot interaction emerges from process-level interaction dynamics and operates within constraints imposed by communication viability, highlighting the importance of evaluating trust under real autonomy conditions when designing interactive robotic systems.
Dual-Agent Reinforcement Learning for Adaptive and Cost-Aware Visual-Inertial Odometry CVPR 2026
Visual-Inertial Odometry (VIO) is a critical component for robust ego-motion estimation, enabling foundational capabilities such as autonomous navigation in robotics and real-time 6-DoF tracking for augmented reality. Existing methods face a well-known trade-off: filter-based approaches are efficient but prone to drift, while optimization-based methods, though accurate, rely on computationally prohibitive Visual-Inertial Bundle Adjustment (VIBA) that is difficult to run on resource-constrained platforms. Rather than removing VIBA altogether, we aim to reduce how often and how heavily it must be invoked. To this end, we cast two key design choices in modern VIO, when to run the visual frontend and how strongly to trust its output, as sequential decision problems, and solve them with lightweight reinforcement learning (RL) agents. Our framework introduces a lightweight, dual-pronged RL policy that serves as our core contribution: (1) a Select Agent intelligently gates the entire VO pipeline based only on high-frequency IMU data; and (2) a composite Fusion Agent that first estimates a robust velocity state via a supervised network, before an RL policy adaptively fuses the full (p, v, q) state. Experiments on the EuRoC MAV and TUM-VI datasets show that, in our unified evaluation, the proposed method achieves a more favorable accuracy-efficiency-memory trade-off than prior GPU-based VO/VIO systems: it attains the best average ATE while running up to 1.77 times faster and using less GPU memory. Compared to classical optimization-based VIO systems, our approach maintains competitive trajectory accuracy while substantially reducing computational load.
comment: Accepted to the CVPR 2026 Main Track
Optimal Solutions for the Moving Target Vehicle Routing Problem via Branch-and-Price with Relaxed Continuity ICAPS 2026
The Moving Target Vehicle Routing Problem (MT-VRP) seeks trajectories for several agents that intercept a set of moving targets, subject to speed, time window, and capacity constraints. We introduce an exact algorithm, Branch-and-Price with Relaxed Continuity (BPRC), for the MT-VRP. The main challenge in a branch-and-price approach for the MT-VRP is the pricing subproblem, which is complicated by moving targets and time-dependent travel costs between targets. Our key contribution is a new labeling algorithm that solves this subproblem by means of a novel dominance criterion tailored for problems with moving targets. Numerical results on instances with up to 25 targets show that our algorithm finds optimal solutions more than an order of magnitude faster than a baseline based on previous work, showing particular strength in scenarios with limited agent capacities.
comment: Accepted to ICAPS 2026
REFINE-DP: Diffusion Policy Fine-tuning for Humanoid Loco-manipulation via Reinforcement Learning
Humanoid loco-manipulation requires coordinated high-level motion plans with stable, low-level whole-body execution under complex robot-environment dynamics and long-horizon tasks. While diffusion policies (DPs) show promise for learning from demonstrations, deploying them on humanoids poses critical challenges: the motion planner trained offline is decoupled from the low-level controller, leading to poor command tracking, compounding distribution shift, and task failures. The common approach of scaling demonstration data is prohibitively expensive for high-dimensional humanoid systems. To address this challenge, we present REFINE-DP (REinforcement learning FINE-tuning of Diffusion Policy), a hierarchical framework that jointly optimizes a DP high-level planner and an RL-based low-level loco-manipulation controller. The DP is fine-tuned via a PPO-based diffusion policy gradient to improve task success rate, while the controller is simultaneously updated to accurately track the planner's evolving command distribution, reducing the distributional mismatch that degrades motion quality. We validate REFINE-DP on a humanoid robot performing loco-manipulation tasks, including door traversal and long-horizon object transport. REFINE-DP achieves an over $90\%$ success rate in simulation, even in out-of-distribution cases not seen in the pre-trained data, and enables smooth autonomous task execution in real-world dynamic environments. Our proposed method substantially outperforms pre-trained DP baselines and demonstrates that RL fine-tuning is key to reliable humanoid loco-manipulation. https://refine-dp.github.io/REFINE-DP/
Volumetric Ergodic Control ICRA
Ergodic control synthesizes optimal coverage behaviors over spatial distributions for nonlinear systems. However, existing formulations model the robot as a non-volumetric point, whereas in practice a robot interacts with the environment through its body and sensors with physical volume. In this work, we introduce a new ergodic control formulation that optimizes spatial coverage using a volumetric state representation. Our method preserves the asymptotic coverage guarantees of ergodic control, adds minimal computational overhead for real-time control, and supports arbitrary sample-based volumetric models. We evaluate our method across search and manipulation tasks -- with multiple robot dynamics and end-effector geometries or sensor models -- and show that it improves coverage efficiency by more than a factor of two while maintaining a 100% task completion rate across all experiments, outperforming the standard ergodic control method. Finally, we demonstrate the effectiveness of our method on a robot arm performing mechanical erasing tasks. Project website: https://murpheylab.github.io/vec/
comment: 8 pages, 8 figures; Accepted to 2026 IEEE International Conference on Robotics and Automation (ICRA); Project website: https://murpheylab.github.io/vec/
SO-Bench: A Structural Output Evaluation of Multimodal LLMs
Multimodal large language models (MLLMs) are increasingly deployed in real-world, agentic settings where outputs must not only be correct, but also conform to predefined data schemas. Despite recent progress in structured generation in textual domain, there is still no benchmark that systematically evaluates schema-grounded information extraction and reasoning over visual inputs. In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-Bench benchmark. Covering four visual domains, including UI screens, natural images, documents, and charts, SO-Bench is built from over 6.5K diverse JSON schemas and 1.8K curated image-schema pairs with human-verified quality. Benchmarking experiments on open-sourced and frontier proprietary models reveal persistent gaps in predicting accurate, schema compliant outputs, highlighting the need for better multimodal structured reasoning. Beyond benchmarking, we further conduct training experiments to largely improve the model's structured output capability. We make the benchmark and evaluation publicly available at https://github.com/apple/ml-sobench
comment: v3 preprint. Added the link to the public benchmark
Lyapunov Constrained Soft Actor-Critic (LC-SAC) using Koopman Operator Theory for Quadrotor Trajectory Tracking
Reinforcement Learning (RL) has achieved remarkable success in solving complex sequential decision-making problems. However, its application to safety-critical physical systems remains constrained by the lack of stability guarantees. Standard RL algorithms prioritize reward maximization, often yielding policies that may induce oscillations or unbounded state divergence. There has been significant work in incorporating Lyapunov-based stability guarantees in RL algorithms with key challenges being selecting a candidate Lyapunov function, computational complexity by using excessive function approximators and conservative policies by incorporating stability criterion in the learning process. In this work we propose a novel Lyapunov-constrained Soft Actor-Critic (LC-SAC) algorithm using Koopman operator theory. We propose use of extended dynamic mode decomposition (EDMD) to produce a linear approximation of the system and use this approximation to derive a closed form solution for candidate Lyapunov function. This derived Lyapunov function is incorporated in the SAC algorithm to further provide guarantees for a policy that stabilizes the nonlinear system. The results are evaluated trajectory tracking of a 2D Quadrotor environment based on safe-control-gym. The proposed algorithm shows training convergence and decaying violations for Lyapunov stability criterion compared to baseline vanilla SAC algorithm. GitHub Repository: https://github.com/DhruvKushwaha/LC-SAC-Quadrotor-Trajectory-Tracking
comment: 11 pages, 7 Figures, submitted to IEEE RA-L
AgriChrono: A Multi-modal Dataset Capturing Crop Growth and Lighting Variability with a Field Robot
Advances in AI and Robotics have accelerated significant initiatives in agriculture, particularly in the areas of robot navigation and 3D digital twin creation. A significant bottleneck impeding this progress is the critical lack of "in-the-wild" datasets that capture the full complexities of real farmland, including non-rigid motion from wind, drastic illumination variance, and morphological changes resulting from growth. This data gap fundamentally limits research on robust AI models for autonomous field navigation and scene-level dynamic 3D reconstruction. In this paper, we present AgriChrono, a modular robotic data collection platform and multi-modal dataset designed to capture these dynamic farmland conditions. Our platform integrates multiple sensors, enabling remote, time-synchronized acquisition of RGB, Depth, LiDAR, IMU, and Pose data for efficient and repeatable long-term data collection in real-world agricultural environments. We successfully collected 18TB of data over one month, documenting the entire growth cycle of Canola under diverse illumination conditions. We benchmark state-of-the-art 3D reconstruction methods on AgriChrono, revealing the profound challenge of reconstructing high-fidelity, dynamic non-rigid scenes in such farmland settings. This benchmark validates AgriChrono as a critical asset for advancing model generalization, and its public release is expected to significantly accelerate research and development in precision agriculture. The code and dataset are publicly available at: https://github.com/StructuresComp/agri-chrono
comment: Keywords: Agricultural Robotics, In-the-wild Dataset, 3D Reconstruction
Push, Press, Slide: Mode-Aware Planar Contact Manipulation via Reduced-Order Models IROS 2026
Non-prehensile planar manipulation, including pushing and press-and-slide, is critical for diverse robotic tasks, but notoriously challenging due to hybrid contact mechanics, under-actuation, and asymmetric friction limits that traditionally necessitate computationally expensive iterative control. In this paper, we propose a mode-aware framework for planar manipulation with one or two robotic arms based on contact topology selection and reduced-order kinematic modeling. Our core insight is that complex wrench-twist limit surface mechanics can be abstracted into a discrete library of physically intuitive models. We systematically map various single-arm and bimanual contact topologies to simple non-holonomic formulations, e.g. unicycle for simplified press-and-slide motion. By anchoring trajectory generation to these reduced-order models, our framework computes the required object wrench and distributes feasible, friction-bounded contact forces via a direct algebraic allocator. We incorporate manipulator kinematics to ensure long-horizon feasibility and demonstrate our fast, optimization-free approach in simulation across diverse single-arm and bimanual manipulation tasks. Supplementary videos and additional information are available at: https://sites.google.com/view/pushpressslide
comment: 8 pages, 13 figures. Submitted to IEEE IROS 2026
Dual Quaternion Based Contact Modeling for Fast and Smooth Collision Recovery of Quadrotors
Unmanned aerial vehicles (UAVs) operating in cluttered environments require accurate impact modeling to maintain stability post collisions. However, conventional contact models decouple linear and angular impulses, risking manifold inconsistency during rapid state transitions. This letter presents a dual quaternion reset map that resolves rigid-body impacts directly on the SE(3) manifold. By operating on the unified spatial twist (linear and angular velocities as a single dual entity), the proposed formulation is shown to be algebraically equivalent to the classical Newton impulse model while preserving manifold consistency during discrete state jumps. Building on this framework, a hybrid recovery controller is designed that couples linear and angular momentum to ensure strict energy dissipation across impacts. Hardware-in-the-loop benchmarks demonstrate a 24% reduction in execution latency compared to an optimized matrix-based implementation. High-fidelity MuJoCo simulations validate the controller's response to complex contact dynamics, with Monte Carlo trials showing a 56.3% reduction in post-impact root-mean-square error (RMSE) and a 61.1% decrease in peak kinetic energy compared to decoupled baseline controllers.
comment: 7 pages, 5 figures
TurboMap: GPU-Accelerated Local Mapping for Visual SLAM
In real-time Visual SLAM systems, local mapping must operate under strict latency constraints, as delays degrade map quality and increase the risk of tracking failure. GPU parallelization offers a promising way to reduce latency. However, parallelizing local mapping is challenging due to synchronized shared-state updates and the overhead of transferring large map data structures to the GPU. This paper presents TurboMap, a GPU-parallelized and CPU-optimized local mapping backend that holistically addresses these challenges. We restructure Map Point Creation to enable parallel Keypoint Correspondence Search on the GPU, redesign and parallelize Map Point Fusion, optimize Redundant Keyframe Culling on the CPU, and integrate a fast GPU-based Local Bundle Adjustment solver. To minimize data transfer and synchronization costs, we introduce persistent GPU-resident keyframe storage. Experiments on the EuRoC and TUM-VI datasets show average local mapping speedups of 1.3x and 1.6x, respectively, while preserving accuracy.
Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning
Continual learning is a long-standing challenge in robot policy learning, where a policy must acquire new skills over time without catastrophically forgetting previously learned ones. While prior work has extensively studied continual learning in relatively small behavior cloning (BC) policy models trained from scratch, its behavior in modern large-scale pretrained Vision-Language-Action (VLA) models remains underexplored. In this work, we found that pretrained VLAs are remarkably resistant to forgetting compared with smaller policy models trained from scratch. Simple Experience Replay (ER) works surprisingly well on VLAs, sometimes achieving zero forgetting even with a small replay data size. Our analysis reveals that pretraining plays a critical role in downstream continual learning performance: large pretrained models mitigate forgetting with a small replay buffer size while maintaining strong forward learning capabilities. Furthermore, we found that VLAs can retain relevant knowledge from prior tasks despite performance degradation during learning new tasks. This knowledge retention enables rapid recovery of seemingly forgotten skills through finetuning. Together, these insights imply that large-scale pretraining fundamentally changes the dynamics of continual learning, enabling models to continually acquire new skills over time with simple replay. Code and more information can be found at https://continual-vlas.github.io/forget-me-not/
comment: Project website: https://continual-vlas.github.io/forget-me-not/
Bundle Adjustment in the Eager Mode
Bundle adjustment (BA) is a critical technique in various robotic applications such as simultaneous localization and mapping (SLAM), augmented reality (AR), and photogrammetry. BA optimizes parameters such as camera poses and 3D landmarks to align them with observations. With the growing importance of deep learning in perception systems, there is an increasing need to integrate BA with deep learning frameworks for enhanced reliability and performance. However, widely-used C++-based BA libraries, such as GTSAM, g$^2$o, and Ceres Solver, lack native integration with modern deep learning libraries like PyTorch. This limitation affects their flexibility, ease of debugging, and overall implementation efficiency. To address this gap, we introduce an eager-mode BA library seamlessly integrated with PyTorch with high efficiency. Our approach includes a sparsity-aware auto-differentiation design and GPU-accelerated sparse operations designed for 2nd-order optimization. Our eager-mode BA on GPU demonstrates substantial runtime efficiency, achieving an average speedup of 18.5$\times$, 22$\times$, and 23$\times$ across all benchmarks compared to GTSAM, g$^2$o, and Ceres, respectively.
Multiagent Systems
CoMAI: A Collaborative Multi-Agent Framework for Robust and Equitable Interview Evaluation
Ensuring robust and fair interview assessment remains a key challenge in AI-driven evaluation. This paper presents CoMAI, a general-purpose multi-agent interview framework designed for diverse assessment scenarios. In contrast to monolithic single-agent systems based on large language models (LLMs), CoMAI employs a modular task-decomposition architecture coordinated through a centralized finite-state machine. The system comprises four agents specialized in question generation, security, scoring, and summarization. These agents work collaboratively to provide multi-layered security defenses against prompt injection, support multidimensional evaluation with adaptive difficulty adjustment, and enable rubric-based structured scoring that reduces subjective bias. Experimental results demonstrate that CoMAI achieved 90.47% accuracy, 83.33% recall, and 84.41% candidate satisfaction. These results highlight CoMAI as a robust, fair, and interpretable paradigm for AI-driven interview assessment.
comment: Gengxin Sun and Ruihao Yu contributed equally to this research. Bin Zhang and Zhiwei Xu are the corresponding authors. 11 pages, 6 figures
Communication-Aware Multi-Agent Reinforcement Learning for Decentralized Cooperative UAV Deployment
Autonomous Unmanned Aerial Vehicle (UAV) swarms are increasingly used as rapidly deployable aerial relays and sensing platforms, yet practical deployments must operate under partial observability and intermittent peer-to-peer links. We present a graph-based multi-agent reinforcement learning framework trained under centralized training with decentralized execution (CTDE): a centralized critic and global state are available only during training, while each UAV executes a shared policy using local observations and messages from nearby neighbors. Our architecture encodes local agent state and nearby entities with an agent-entity attention module, and aggregates inter-UAV messages with neighbor self-attention over a distance-limited communication graph. We evaluate primarily on a cooperative relay deployment task (DroneConnect) and secondarily on an adversarial engagement task (DroneCombat). In DroneConnect, the proposed method achieves high coverage under restricted communication and partial observation (e.g. 74% coverage with M = 5 UAVs and N = 10 nodes) while remaining competitive with a mixed-integer linear programming (MILP) optimization-based offline upper bound, and it generalizes to unseen team sizes without fine-tuning. In the adversarial setting, the same framework transfers without architectural changes and improves win rate over non-communicating baselines.
Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective
Agentic workflows are composed of sequences of interdependent Large Language Model (LLM) calls, and they have become a dominant workload in modern AI systems. These workflows exhibit extensive redundancy from overlapping prompts and intermediate results due to speculative and parallel exploration. Existing LLM serving systems, such as vLLM, focus on optimizing individual inference calls and overlook cross-call dependencies, leading to significant inefficiencies. This paper rethinks LLM and agent serving from a data systems perspective and introduces Helium, a workflow-aware serving framework that models agentic workloads as query plans and treats LLM invocations as first-class operators. Helium integrates proactive caching and cache-aware scheduling to maximize reuse across prompts, KV states, and workflows. Through these techniques, Helium bridges classic query optimization principles with LLM serving, achieving up to 1.56x speedup over state-of-the-art agent serving systems on various workloads. Our results demonstrate that end-to-end optimization across workflows is essential for scalable and efficient LLM-based agents.
When Openclaw Agents Learn from Each Other: Insights from Emergent AI Agent Communities for Human-AI Partnership in Education
The AIED community envisions AI evolving "from tools to teammates," yet our understanding of AI teammates remains limited to dyadic human-AI interactions. We offer a different vantage point: a rapidly growing ecosystem of AI agent platforms where over 167,000 agents participate, interact as peers, and develop learning behaviors without researcher intervention. Drawing on a month of daily qualitative observations across multiple platforms including Moltbook, The Colony, and 4claw, we identify four phenomena with implications for AIED: (1) humans who configure their agents undergo a "bidirectional scaffolding" process, learning through teaching; (2) peer learning emerges without any designed curriculum, complete with idea cascades and quality hierarchies; (3) agents converge on shared memory architectures that mirror open learner model design; and (4) trust dynamics and platform mortality reveal design constraints for networked educational AI. Rather than presenting empirical findings, we argue that these organic phenomena offer a naturalistic window into dynamics that can inform principled design of multi-agent educational systems. We sketch an illustrative curriculum design, "Learn by Teaching Your AI Agent Teammate," and outline potential research directions and open problems to show how these observations might inform future AIED practice and inquiry.
comment: 14 pages, 4 figures
Routing and Control for Marine Oil-Spill Cleanup with a Boom-Towing Vessel Fleet
Marine oil spills damage ecosystems, contaminate coastlines, and disrupt food webs, while imposing substantial economic losses on fisheries and coastal communities. Prior work has demonstrated the feasibility of containing and cleaning individual spills using a duo of autonomous surface vehicles (ASVs) equipped with a towed boom and skimmers. However, existing algorithmic approaches primarily address isolated slicks and individual ASV duos, lacking scalable methods for coordinating large robotic fleets across multiple spills representative of realistic oil-spill incidents. In this work, we propose an integrated multi-robot framework for coordinated oil-spill confinement and cleanup using autonomous ASV duos. We formulate multi-spill response as a risk-weighted minimum-latency problem, where spill-specific risk factors and service times jointly determine cumulative environmental damage. To solve this problem, we develop a hybrid optimization approach combining mixed-integer linear programming, and a tailored warm-start heuristic, enabling near-optimal routing plans for scenarios with tens of spills within minutes on commodity hardware. For physical execution, we design and analyze two tracking controllers for boom-towing ASV duos: a feedback-linearization controller with proven asymptotic stability, and a baseline PID controller. Simulation results under coupled vessel-boom dynamics demonstrate accurate path tracking for both controllers. Together, these components provide a scalable, holistic framework for rapid, risk-aware multi-robot response to large-scale oil spill disasters.
Ablation Study of a Fairness Auditing Agentic System for Bias Mitigation in Early-Onset Colorectal Cancer Detection
Artificial intelligence (AI) is increasingly used in clinical settings, yet limited oversight and domain expertise can allow algorithmic bias and safety risks to persist. This study evaluates whether an agentic AI system can support auditing biomedical machine learning models for fairness in early-onset colorectal cancer (EO-CRC), a condition with documented demographic disparities. We implemented a two-agent architecture consisting of a Domain Expert Agent that synthesizes literature on EO-CRC disparities and a Fairness Consultant Agent that recommends sensitive attributes and fairness metrics for model evaluation. An ablation study compared three Ollama large language models (8B, 20B, and 120B parameters) across three configurations: pretrained LLM-only, Agent without Retrieval-Augmented Generation (RAG), and Agent with RAG. Across models, the Agent with RAG achieved the highest semantic similarity to expert-derived reference statements, particularly for disparity identification, suggesting agentic systems with retrieval may help scale fairness auditing in clinical AI.
Asymmetric Nash Seeking via Best Response Maps: Global Linear Convergence and Robustness to Inexact Reaction Models
Nash equilibria provide a principled framework for modeling interactions in multi-agent decision-making and control. However, many equilibrium-seeking methods implicitly assume that each agent has access to the other agents' objectives and constraints, an assumption that is often unrealistic in practice. This letter studies a class of asymmetric-information two-player constrained games with decoupled feasible sets, in which Player 1 knows its own objective and constraints while Player 2 is available only through a best-response map. For this class of games, we propose an asymmetric projected gradient descent-best response iteration that does not require full mutual knowledge of both players' optimization problems. Under suitable regularity conditions, we establish the existence and uniqueness of the Nash equilibrium and prove global linear convergence of the proposed iteration when the best-response map is exact. Recognizing that best-response maps are often learned or estimated, we further analyze the inexact case and show that, when the approximation error is uniformly bounded by $\varepsilon$, the iterates enter an explicit $O(\varepsilon)$ neighborhood of the true Nash equilibrium. Numerical results on a benchmark game corroborate the predicted convergence behavior and error scaling.
comment: 6 Pages, 2 Figures, Preprint submitted to IEEE L-CSS and CDC 2026
Impacts of Electric Vehicle Charging Regimes and Infrastructure Deployments on System Performance: An Agent-Based Study
The rapid growth of electric vehicles (EVs) requires more effective charging infrastructure planning. Infrastructure layout not only determines deployment cost, but also reshapes charging behavior and influences overall system performance. In addition, destination charging and en-route charging represent distinct charging regimes associated with different power requirements, which may lead to substantially different infrastructure deployment outcomes. This study applies an agent-based modeling framework to generate trajectory-level latent public charging demand under three charging regimes based on a synthetic representation of the Melbourne (Australia) metropolitan area. Two deployment strategies, an optimization-based approach and a utilization-refined approach, are evaluated across different infrastructure layouts. Results show that utilization-refined deployments reduce total system cost, accounting for both infrastructure deployment cost and user generalized charging cost, with the most significant improvement observed under the combined charging regime. In particular, a more effective allocation of AC slow chargers reshapes destination charging behavior, which in turn reduces unnecessary reliance on en-route charging and lowers detour costs associated with en-route charging. This interaction highlights the behavioral linkage between destination and en-route charging regimes and demonstrates the importance of accounting for user response and multiple charging regimes in charging infrastructure planning.
comment: 7 pages, 4 figures
Learning Communication Between Heterogeneous Agents in Multi-Agent Reinforcement Learning for Autonomous Cyber Defence
Reinforcement learning techniques are being explored as solutions to the threat of cyber attacks on enterprise networks. Recent research in the field of AI in cyber security has investigated the ability of homogeneous multi-agent reinforcement learning agents, capable of inter-agent communication, to respond to cyberattacks. This paper advances the study of learned communication in multi-agent systems by examining heterogeneous agent capabilities within a simulated network environment. To this end, we leverage CommFormer, a publicly available state-of-the-art communication algorithm, to train and evaluate agents within the Cyber Operations Research Gym (CybORG). Our results show that CommFormer agents with heterogeneous capabilities can outperform other algorithms deployed in the CybORG environment, by converging to an optimal policy up to four times faster while improving standard error by up 38%. The agents implemented in this project provide an additional avenue for exploration in the field of AI for cyber security, enabling further research involving realistic networks.
comment: 6 pages, 3 figures, 1 algorithm, conference paper. CyMARL-CommFormer code available at https://github.com/Poly-AIvsAI/CyMARL-CommFormer/tree/main
MACRO-LLM: LLM-Empowered Multi-Agent Collaborative Reasoning under Spatiotemporal Partial Observability
Large Language Model (LLM) agents deployed in complex real-world scenarios increasingly operate as spatially distributed entities. However, this physical dispersion constrains agents to limited local perception and finite temporal horizons. We characterize this bottleneck as spatiotemporal partial observability, where spatial and temporal limitations are fundamentally coupled: resolving spatial conflicts requires temporal reasoning about neighbors' future actions, while temporal planning requires spatial context beyond local perception. To bridge this gap, we introduce MACRO-LLM, LLM-empowered multi-agent collaborative reasoning under spatiotemporal partial observability. The architecture interleaves spatial and temporal reasoning within each decision cycle via three interdependent modules: (1) the CoProposer mitigates temporal uncertainty by verifying candidate actions via predictive rollouts; (2) the Negotiator overcomes spatial myopia by resolving conflicts through mean-field statistical aggregation, grounded in the CoProposer's rollout rewards; and (3) the Introspector closes the reasoning loop by analyzing environmental drift and attributing performance changes to refine strategies. Extensive evaluations on two complex long-horizon tasks, cooperative platoon planning and pandemic control, demonstrate that our framework enables robust coordination under spatiotemporal partial observability.
FACET: Teacher-Centred LLM-Based Multi-Agent Systems-Towards Personalized Educational Worksheets
The increasing heterogeneity of student populations poses significant challenges for teachers, particularly in mathematics education, where cognitive, motivational, and emotional differences strongly influence learning outcomes. While AI-driven personalization tools have emerged, most remain performance-focused, offering limited support for teachers and neglecting broader pedagogical needs. This paper presents the FACET framework, a teacher-facing, large language model (LLM)-based multi-agent system designed to generate individualized classroom materials that integrate both cognitive and motivational dimensions of learner profiles. The framework comprises three specialized agents: (1) learner agents that simulate diverse profiles incorporating topic proficiency and intrinsic motivation, (2) a teacher agent that adapts instructional content according to didactical principles, and (3) an evaluator agent that provides automated quality assurance. We tested the system using authentic grade 8 mathematics curriculum content and evaluated its feasibility through a) automated agent-based assessment of output quality and b) exploratory feedback from K-12 in-service teachers. Results from ten internal evaluations highlighted high stability and alignment between generated materials and learner profiles, and teacher feedback particularly highlighted structure and suitability of tasks. The findings demonstrate the potential of multi-agent LLM architectures to provide scalable, context-aware personalization in heterogeneous classroom settings, and outline directions for extending the framework to richer learner profiles and real-world classroom trials.
Grassroots Bonds: A Grassroots Foundation for Market Liquidity
Global cryptocurrencies are unbacked and have high transaction cost incurred by global consensus. In contrast, grassroots cryptocurrencies are backed by the goods and services of their issuers -- any person, natural or legal -- and have no transaction cost beyond operating a smartphone. Liquidity in grassroots cryptocurrencies arises from mutual credit via coin exchange among issuers. However, as grassroots coins are redeemable 1-for-1 against any other grassroots coin, the credit-forming exchange must also be 1-for-1, lest prompt redemption after exchange would leave the parties with undue profit or loss. Thus, grassroots coins are incongruent with liquidity through interest-bearing credit. Here we introduce grassroots bonds, which extend grassroots coins with a maturity date, reframing grassroots coins -- cash -- as mature grassroots bonds. Bond redemption generalises coin redemption, allowing the lending of liquid coins in exchange for interest-bearing future-maturity bonds. We show that digital social contracts -- voluntary agreements among persons, specified, fulfilled, and enforced digitally -- can express the full gamut of financial instruments as the voluntary swap of grassroots bonds, including credit lines, loans, sale of debt, forward contracts, options, and escrow-based instruments, and that classical liquidity ratios are applicable just as well to grassroots bonds. Grassroots bonds may thus allow local digital economies to form and grow without initial capital or external credit, harnessing mutual trust within communities into liquidity. The formal specification presented here was used by AI to derive a working implementation of grassroots bonds in GLP, a concurrent logic programming language implemented in Dart for smartphone deployment. The implementation is illustrated by a running multiagent village market scenario, also implemented in GLP by AI.
LOPT: Learning Optimal Pigovian Tax in Sequential Social Dilemmas
In multi-agent reinforcement learning, each agent acts to maximize its individual accumulated rewards. Nevertheless, individual accumulated rewards could not fully reflect how others perceive them, resulting in selfish behaviors that undermine global performance. The externality theory, defined as ``the activities of one economic actor affect the activities of another in ways that are not reflected in market transactions,'' is applicable to analyze the social dilemmas in MARL. One of its most profound non-market solutions, ``Pigovian Tax'', which internalizes externalities by taxing those who create negative externalities and subsidizing those who create positive externalities, could aid in developing a mechanism to resolve MARL's social dilemmas. The purpose of this paper is to apply externality theory to analyze social dilemmas in MARL. To internalize the externalities in MARL, the \textbf{L}earning \textbf{O}ptimal \textbf{P}igovian \textbf{T}ax method (LOPT), is proposed, where an additional agent is introduced to learn the tax/allowance allocation policy so as to approximate the optimal ``Pigovian Tax'' which accurately reflects the externalities for all agents. Furthermore, a reward shaping mechanism based on the approximated optimal ``Pigovian Tax'' is applied to reduce the social cost of each agent and tries to alleviate the social dilemmas. Compared with existing state-of-the-art methods, the proposed LOPT leads to higher collective social welfare in both the Escape Room and the Cleanup environments, which shows the superiority of our method in solving social dilemmas.
comment: 20 pages,13 figures
SAGE: Multi-Agent Self-Evolution for LLM Reasoning
Reinforcement learning with verifiable rewards improves reasoning in large language models (LLMs), but many methods still rely on large human-labeled datasets. While self-play reduces this dependency, it often lacks explicit planning and strong quality control, limiting stability in long-horizon multi-step reasoning. We present SAGE (Self-evolving Agents for Generalized reasoning Evolution), a closed-loop framework where four agents: Challenger, Planner, Solver, and Critic, co-evolve from a shared LLM backbone using only a small seed set. The Challenger continuously generates increasingly difficult tasks; the Planner converts each task into a structured multi-step plan; and the Solver follows the plan to produce an answer, whose correctness is determined by external verifiers. The Critic scores and filters both generated questions and plans to prevent curriculum drift and maintain training signal quality, enabling stable self-training. Across mathematics and code-generation benchmarks, SAGE delivers consistent gains across model scales, improving the Qwen-2.5-7B model by 8.9% on LiveCodeBench and 10.7% on OlympiadBench.
COCO: Cognitive Operating System with Continuous Oversight for Multi-Agent Workflow Reliability
A critical limitation in large-scale multi-agent systems is the cascading of errors. And without intermediate verification, downstream agents exacerbate upstream inaccuracies, resulting in significant quality degradation. To bridge this gap, we introduce \textbf{COCO} (\textbf{C}ognitive \textbf{O}perating System with \textbf{C}ontinuous \textbf{O}versight), a theoretically grounded framework for asynchronous self-monitoring and adaptive error correction in multi-agent systems. COCO reconciles the fundamental tension between quality assurance and computational efficiency via a novel decoupled architecture. This design isolates error detection from the critical execution path and incorporates an automated configuration engine to minimize deployment complexity. The framework relies on three algorithmic innovations to mitigate both systematic and stochastic errors: (1) a Contextual Rollback Mechanism that leverages execution history for informed state recovery rather than naive retries; (2) a Bidirectional Reflection Protocol to ensure convergence and prevent oscillatory control loops; and (3) a Heterogeneous Cross-Validation Mechanism that utilizes ensemble disagreement to identify bias and hallucinations. Extensive experiments on diverse benchmarks demonstrate that COCO delivers a 6.5\% average performance improvement. Notably, the framework achieves 95.1\% of large-model performance with a 30$\times$ parameter reduction, confirming the potential for efficient, high-reliability deployment, and establishing COCO as a practical, annotation-based solution for critical autonomous domains.
MetaCrit: A Critical Thinking Framework for Self-Regulated LLM Reasoning
Large language models (LLMs) fail on over one-third of multi-hop questions with counterfactual premises and remain vulnerable to adversarial prompts that trigger biased or factually incorrect responses, which exposes a fundamental deficit in self-regulated reasoning. We propose \textbf{MetaCrit}, a multi-agent framework grounded in Nelson and Narens' metacognitive regulation theory. MetaCrit decomposes reasoning regulation into four agents: object-level generation, a \emph{monitoring} agent that assesses response validity, a \emph{control} agent that critiques logical soundness, and a meta-level synthesizer that integrates all signals into a final response. Evaluation across eight benchmarks, four model backbones, and a college-level analytical writing study shows that MetaCrit significantly improves content truthfulness and logical soundness while eliminating toxic outputs. Its modular design allows individual agents to be integrated into existing frameworks as drop-in components without architectural modifications.
Systems and Control (EESS)
Early-Terminable Energy-Safe Iterative Coupling for Parallel Simulation of Port-Hamiltonian Systems
Parallel simulation and control of large-scale robotic systems often rely on partitioned time stepping, yet finite-iteration coupling can inject spurious energy by violating power consistency--even when each subsystem is passive. This letter proposes a novel energy-safe, early-terminable iterative coupling for port-Hamiltonian subsystems by embedding a Douglas--Rachford (DR) splitting scheme in scattering (wave) coordinates. The lossless interconnection is enforced as an orthogonal constraint in the wave domain, while each subsystem contributes a discrete-time scattering port map induced by its one-step integrator. Under a discrete passivity condition on the subsystem time steps and a mild impedance-tuning condition, we prove an augmented-storage inequality certifying discrete passivity of the coupled macro-step for any finite inner-iteration budget, with the remaining mismatch captured by an explicit residual. As the inner budget increases, the partitioned update converges to the monolithic discrete-time update induced by the same integrators, yielding a principled, adaptive accuracy--compute trade-off, supporting energy-consistent real-time parallel simulation under varying computational budgets. Experiments on a coupled-oscillator benchmark validate the passivity certificates at numerical roundoff (on the order of 10e-14 in double precision) and show that the reported RMS state error decays monotonically with increasing inner-iteration budgets, consistent with the hard-coupling limit.
Featurized Occupation Measures for Structured Global Search in Numerical Optimal Control
Numerical optimal control is commonly divided between globally structured but dimensionally intractable Hamilton-Jacobi-Bellman (HJB) methods and scalable but local trajectory optimization. We introduce the Featurized Occupation Measure (FOM), a finite-dimensional primal-dual interface for the occupation-measure formulation that unifies trajectory search and global HJB-type certification. FOM is broad yet numerically tractable, covering both explicit weak-form schemes and implicit simulator- or rollout-based sampling methods. Within this framework, approximate HJB subsolutions serve as intrinsic numerical certificates to directly evaluate and guide the primal search. We prove asymptotic consistency with the exact infinite-dimensional occupation-measure problem, and show that for block-organized feasible certificates, finite-dimensional approximation preserves certified lower bounds with blockwise error and complexity control. We also establish persistence of these lower bounds under time shifts and bounded model perturbations. Consequently, these structural properties render global certificates into flexible, reusable computational objects, establishing a systematic basis for certificate-guided optimization in nonlinear control.
Decentralized design of leader-following consensus protocols for asymmetric matrix-weighted heterogeneous multiagent systems
This paper investigates a decentralized design approach of leader-following consensus protocols for heterogeneous multiagent systems under a fixed communication topology with a directed spanning tree (DST) and asymmetric weight matrix. First, a control protocol using only the information of the neighbor on the DST of each agent is designed, which is called the consensus protocol with minimal communication links. Particularly, the DST-based linear transformation method is used to transform the consensus problem into a partial variable stability problem of a corresponding system, and a decentralized design method is proposed to find the gain matrices in the protocols. Next, the decentralized design approach is extended to the protocols using all neighbor information in the original communication topology with the help of the matrix diagonally dominant method. Some numerical simulations are given to illustrate the theoretical results.
comment: 14 pages, 4 figures
Decentralized design of consensus protocols with minimal communication links based on directed spanning tree
This paper proposes a decentralized design approach of consensus protocols of multi-agent systems via a directed-spanning-tree(DST)-based linear transformation and the corresponding minimal communication links. First, the consensus problem of multi-agent systems is transformed into the decentralized output stabilization problem by constructing a linear transformation based on a DST of the communication topology, and thus a necessary and sufficient consensus criterion in terms of decentralized fixed mode is derived. Next, a new distributed protocol is designed by using only the neighbors information on the DST, which is a fully decentralized design approach. Finally, some numerical examples are given to verify the results attained.
comment: 6 pages, 8 figures
Deep Adaptive Model-Based Design of Experiments
Model-based design of experiments (MBDOE) is essential for efficient parameter estimation in nonlinear dynamical systems. However, conventional adaptive MBDOE requires costly posterior inference and design optimization between each experimental step, precluding real-time applications. We address this by combining Deep Adaptive Design (DAD), which amortizes sequential design into a neural network policy trained offline, with differentiable mechanistic models. For dynamical systems with known governing equations but uncertain parameters, we extend sequential contrastive training objectives to handle nuisance parameters and propose a transformer-based policy architecture that respects the temporal structure of dynamical systems. We demonstrate the approach on four systems of increasing complexity: a fed-batch bioreactor with Monod kinetics, a Haldane bioreactor with uncertain substrate inhibition, a two-compartment pharmacokinetic model with nuisance clearance parameters, and a DC motor for real-time deployment.
Near-Optimal Constrained Feedback Control of Nonlinear Systems via Approximate HJB and Control Barrier Functions
This paper presents a two-stage framework for constrained near-optimal feedback control of input-affine nonlinear systems. An approximate value function for the unconstrained control problem is computed offline by solving the Hamilton--Jacobi--Bellman equation. Online, a quadratic program is solved that minimizes the associated approximate Hamiltonian subject to safety constraints imposed via control barrier functions. Our proposed architecture decouples performance from constraint enforcement, allowing constraints to be modified online without recomputing the value function. Validation on a linear 2-state 1D hovercraft and a nonlinear 9-state spacecraft attitude control problem demonstrates near-optimal performance relative to open-loop optimal control benchmarks and superior performance compared to control Lyapunov function-based controllers.
Eliminating Persistent Boundary Residence via Matrosov-Type Auxiliary Functions
Control barrier functions enforce safety by guaranteeing forward invariance of an admissible set. Under standard (non-strict) barrier conditions, however, forward invariance alone does not prevent trajectories from remaining on the boundary of the safe set for arbitrarily long time intervals, potentially leading to boundary sticking or deadlock phenomena. This paper studies the elimination of persistent boundary residence under forward-invariant barrier conditions. Inspired by Matrosov-type arguments, we introduce an auxiliary function framework that preserves forward invariance while excluding infinite-time residence within boundary layers. Sufficient conditions are established under which any trajectory can only remain in a prescribed neighborhood of the boundary for finite time, thereby restoring boundary-level liveness without altering forward invariance. The proposed construction does not rely on singular barrier formulations or controller-specific modifications, and can be incorporated into standard safety-critical control architectures. Numerical examples illustrate the removal of boundary sticking behaviors while maintaining safety across representative systems.
Prescribed-Time Distributed Generalized Nash Equilibrium Seeking
This paper proposes the first fully distributed algorithm for finding the Generalized Nash Equilibrium (GNE) of a non-cooperative game with shared coupling constraints and general cost coupling at a user-prescribed finite time T. As a foundation, a centralized gradient-based prescribed-time convergence result is established for the GNE problem, extending the optimization Lyapunov function framework to gradient dynamics, the only known realization among existing alternatives that naturally decomposes into per-agent computations. Building on this, a fully distributed architecture is designed in which each agent concurrently runs three coupled dynamics: a prescribed-time distributed state observer, a gradient-based optimization law, and a dual consensus mechanism that enforces the shared-multiplier requirement of the variational GNE, thus guaranteeing convergence to the same solution as the centralized case. The simultaneous operation of these layers creates bidirectional perturbations between consensus and optimization, which are resolved through gain synchronization that matches the temporal singularities of the optimization and consensus layers, ensuring all error components vanish exactly at T. The Fischer-Burmeister reformulation renders the algorithm projection-free and guarantees constraint satisfaction at the deadline. Numerical simulations on a Nash-Cournot game and a time-critical sensor coverage problem validate the approach.
comment: 12 pages, 5 figures
Koopman Lifted Finite Memory Identification via Truncated Grunwald Letnikov Kernels
We propose a data-driven linear modeling framework for controlled nonlinear hereditary systems that combines Koopman lifting with a truncated Grunwald-Letnikov memory term. The key idea is to model nonlinear state dependence through a lifted observable representation while imposing history dependence directly in the lifted coordinates through fixed fractional-difference weights. This preserves linearity in the lifted state-transition and input matrices, yielding a memory-compensated regression that can be identified from input-state data by least squares and extending standard Koopman-based identification beyond the Markovian setting. We further derive an equivalent augmented Markovian realization by stacking a finite window of lifted states, thereby rewriting the finite-memory recursion as a standard discrete-time linear state-space model. Numerical experiments on a nonlinear hereditary benchmark with a non-Grunwald-Letnikov Prony-series ground-truth kernel demonstrate improved multi-step open-loop prediction accuracy relative to memoryless Koopman and non-lifted state-space baselines.
comment: 6 pages, 1 figure, submitted to IEEE Control Systems Letters (L-CSS)
Stochastic Resetting Accelerates Policy Convergence in Reinforcement Learning
Stochastic resetting, where a dynamical process is intermittently returned to a fixed reference state, has emerged as a powerful mechanism for optimizing first-passage properties. Existing theory largely treats static, non-learning processes. Here we ask how stochastic resetting interacts with reinforcement learning, where the underlying dynamics adapt through experience. In tabular grid environments, we find that resetting accelerates policy convergence even when it does not reduce the search time of a purely diffusive agent, indicating a novel mechanism beyond classical first-passage optimization. In a continuous control task with neural-network-based value approximation, we show that random resetting improves deep reinforcement learning when exploration is difficult and rewards are sparse. Unlike temporal discounting, resetting preserves the optimal policy while accelerating convergence by truncating long, uninformative trajectories to enhance value propagation. Our results establish stochastic resetting as a simple, tunable mechanism for accelerating learning, translating a canonical phenomenon of statistical mechanics into an optimization principle for reinforcement learning.
comment: 18 pages, 17 figures
Typical models of the distribution system restoration process
Accurate probabilistic modeling of the power system restoration process is essential for resilience planning, operational decision-making, and realistic simulation of resilience events. In this work, we develop data-driven probabilistic models of the restoration process using outage data from four distribution utilities. We decompose restoration into three components: normalized restore time progression, total restoration duration, and the time to first restore. The Beta distribution provides the best-pooled fit for restore time progression, and the Uniform distribution is a defensible, parsimonious approximation for many events. Total duration is modeled as a heteroskedastic Lognormal process that scales superlinearly with event size. The time to first restore is well described by a Gamma model for moderate and large events. Together, these models provide an end-to-end stochastic model for Monte Carlo simulation, probabilistic duration forecasting, and resilience planning that moves beyond summary statistics, enabling uncertainty-aware decision support grounded in utility data.
Measuring outage resilience in a distribution system with the number of outages in large events
We develop LENORI, a Large Event Number of Outages Resilience Index measuring distribution system resilience with the number of forced line outages observed in large extreme events. LENORI is calculated from standard utility outage data. The statistical accuracy of LENORI is ensured by taking the logarithm of the outage data. A related Average Large Event Number of Outages metric ALENO is also developed, and both metrics are applied to a distribution system to quantify the power grid strength relative to the extreme events stressing the grid. The metrics can be used to track resilience and quantify the contributions of various types of hazards to the overall resilience.
Exponential stability of data-driven nonlinear MPC based on input/output models
We consider nonlinear model predictive control (MPC) schemes using surrogate models in the optimization step based on input-output data only. We establish exponential stability for sufficiently long prediction horizons assuming exponential stabilizability and a proportional error bound. Moreover, we verify the imposed condition on the approximation using kernel interpolation and demonstrate the practical applicability to nonlinear systems with a numerical example.
Overlapping Covariance Intersection: Fusion with Partial Structural Knowledge of Correlation from Multiple Sources
Emerging large-scale engineering systems rely on distributed fusion for situational awareness, where agents combine noisy local sensor measurements with exchanged information to obtain fused estimates. However, at the sheer scale of these systems, tracking cross-correlations becomes infeasible, preventing the use of optimal filters. Covariance intersection (CI) methods address fusion problems with unknown correlations by minimizing worst-case uncertainty based on available information. Existing CI extensions exploit limited correlation knowledge but cannot incorporate structural knowledge of correlation from multiple sources, which naturally arises in distributed fusion problems. This paper introduces Overlapping Covariance Intersection (OCI), a generalized CI framework that accommodates this novel information structure. We formalize the OCI problem and establish necessary and sufficient conditions for feasibility. We show that a family-optimal solution can be computed efficiently via semidefinite programming, enabling real-time implementation. The proposed tools enable improved fusion performance for large-scale systems while retaining robustness to unknown correlations.
A Variational Pseudo-Observation Guided Nudged Particle Filter
Nonlinear filtering with standard PF methods requires mitigative techniques to quell weight degeneracy, such as resampling. This is especially true in high-dimensional systems with sparse observations. Unfortunately, such techniques are also fragile when applied to systems with exceedingly rare events. Nonlinear systems with these properties can be assimilated effectively with a control-based PF method known as the nPF, but this method has a high computational cost burden. In this work, we aim to retain this strength of the nudged method while reducing the computational cost by introducing a variational method into the algorithm that acts as a continuous pseudo-observation path. By maintaining a PF representation, the resulting algorithm continues to capture an approximation of the filtering distribution, while reducing computational runtime and improving robustness to the "rare" event of switching phases. Preliminary testing of the new approach is demonstrated on a stochastic variant of the nonlinear and chaotic L63 model, which is used as a surrogate for mimicking "rare" events. The new approach helps to overcome difficulties in applying the nPF for realistic problems and performs favorably with respect to a standard PF with a higher number of particles.
comment: 9 pages, 5 figures
Robust multi-scale leader-follower control of large multi-agent systems
In many multi-agent systems of practical interest, such as traffic networks or crowd evacuation, control actions cannot be exerted on all agents. Instead, controllable leaders must indirectly steer uncontrolled followers through local interactions. Existing results address either leader-follower density control of simple, unperturbed multi-agent systems or robust density control of a single directly actuated population, but not their combination. We bridge this gap by deriving a coupled continuum description for leaders and followers subject to unknown bounded perturbations, and designing a macroscopic feedback law that guarantees global asymptotic convergence of the followers' density to a desired distribution. The coupled stability of the leader-follower system is analyzed via singular perturbation theory, and an explicit lower bound on the leader-to-follower mass ratio required for feasibility is derived. Numerical simulations on heterogeneous biased random walkers validate our theoretical findings.
Bio-inspired metaheuristic optimization for hierarchical architecture design of industrial control systems
Automated process control systems (APCS) are widely used in modern industrial enterprises. They address three key objectives: ensuring the required quality of manufactured products, ensuring process safety for people and the environment, and reducing capital and operating costs. At large industrial enterprises, APCSs are typically geographically distributed and characterized by a large number of monitored parameters. Such systems often consist of several subsystems built using various technical means and serving different functional purposes. APCSs usually have a hierarchical structure consisting of several levels, where each level hosts commercially available technical devices with predetermined characteristics. This article examines the engineering problem of selecting an optimal software and hardware structure for a distributed process control system applied to a continuous process in the chemical industry. A formal formulation of the optimization problem is presented, in which the hierarchical structure of the system is represented as an acyclic graph. Optimization criteria and constraints are defined. A solution method based on a metaheuristic ant colony optimization algorithm, widely used for this class of problems, is proposed. A brief overview of the developed software tool used to solve a number of numerical examples is provided. The experimental results are discussed, along with parameter selection and possible algorithm modifications aimed at improving solution quality. Information on the verification of the control system implemented using the selected software and hardware structure is presented, and directions for further research are outlined.
comment: 20 pages, 8 figures
Data-driven generalized perimeter control: Zürich case study
Urban traffic congestion is a key challenge for the development of modern cities, requiring advanced control techniques to optimize existing infrastructures usage. Despite the extensive availability of data, modeling such complex systems remains an expensive and time consuming step when designing model-based control approaches. On the other hand, machine learning approaches require simulations to bootstrap models, or are unable to deal with the sparse nature of traffic data and enforce hard constraints. We propose a novel formulation of traffic dynamics based on behavioral systems theory and apply data-enabled predictive control to steer traffic dynamics via dynamic traffic light control. A high-fidelity simulation of the city of Zürich, the largest closed-loop microscopic simulation of urban traffic in the literature to the best of our knowledge, is used to validate the performance of the proposed method in terms of total travel time and CO2 emissions.
comment: 33 pages, 16 figures
A Baseline Mobility-Aware IRS-Assisted Uplink Framework With Energy-Detection-Based Channel Allocation
This paper develops a self-contained framework for studying a mobility-aware intelligent reflecting surface (IRS)-assisted multi-node uplink under simplified but explicit modeling assumptions. The considered system combines direct and IRS-assisted narrowband propagation, geometric IRS phase control with finite-bit phase quantization, adaptive IRS-user focusing based on inverse-rate priority weights, and sequential channel allocation guided by energy detection. The analytical development is restricted to a physics-based two-hop cascaded path-loss formulation with appropriate scaling, an expectation-level reflected-power characterization under the stated independence assumptions, and the exact chi-square threshold for energy detection, together with its large-sample Gaussian approximation. A MATLAB implementation is used to generate a sample run, which is interpreted as a numerical example. This work is intended as a consistent, practically-aligned baseline to support future extensions involving richer mobility models or more advanced scheduling policies.
OT-DETECT: Optimal transport-driven attack detection in cyber-physical systems
This article presents an optimal-transport (OT)-driven, distributionally robust attack detection algorithm, OT-DETECT, for cyber-physical systems (CPS) modeled as partially observed linear stochastic systems. The underlying detection problem is formulated as a minmax optimization problem using 1-Wasserstein ambiguity sets constructed from observer residuals under both the nominal (attack-free) and attacked regimes. We show that the minmax detection problem can be reduced to a finite-dimensional linear program for computing the worst-case distribution (WCD). Off-support residuals are handled via a kernel-smoothed score function that drives a CUSUM procedure for sequential detection. We also establish a non-asymptotic tail bound on the false-positive error of the CUSUM statistic under the nominal (attack-free) condition, under mild assumptions. Numerical illustrations are provided to evaluate the robustness properties of OT-DETECT.
comment: 7 pages, 2 figures
Deep Learning-Driven Black-Box Doherty Power Amplifier with Pixelated Output Combiner and Extended Efficiency Range
This article presents a deep learning-driven inverse design methodology for Doherty power amplifiers (PA) with multi-port pixelated output combiner networks. A deep convolutional neural network (CNN) is developed and trained as an electromagnetic (EM) surrogate model to accurately and rapidly predict the S-parameters of pixelated passive networks. By leveraging the CNN-based surrogate model within a blackbox Doherty framework and a genetic algorithm (GA)-based optimizer, we effectively synthesize complex Doherty combiners that enable an extended back-off efficiency range using fully symmetrical devices. As a proof of concept, we designed and fabricated two Doherty PA prototypes incorporating three-port pixelated combiners, implemented with GaN HEMT transistors. In measurements, both prototypes demonstrate a maximum drain efficiency exceeding 74% and deliver an output power surpassing 44.1 dBm at 2.75 GHz. Furthermore, a measured drain efficiency above 52% is maintained at the 9-dB back-off power level for both prototypes at the same frequency. To evaluate linearity and efficiency under realistic signal conditions, both prototypes are tested using a 20-MHz 5G new radio (NR)-like waveform exhibiting a peak-to-average power ratio (PAPR) of 9.0 dB. After applying digital predistortion (DPD), each design achieves an average power added efficiency (PAE) above 51%, while maintaining an adjacent channel leakage ratio (ACLR) better than -60.8 dBc.
Consensus in Multi-Agent Systems with Uniform and Nonuniform Communication Delays
This paper analyzes consensus in multi-agent systems under uniform and nonuniform communication delays, a key challenge in distributed coordination with applications to robotic swarms. It investigates the convergence of a consensus algorithm accounting for delays across communication links in a connected, undirected graph. Novel convergence results are derived using Rouché's theorem and Lyapunov-based stability analysis. The system is shown to reach consensus at a steady-state value given by a weighted average determined by the delay distribution, with stability ensured under explicit parameter bounds. Both uniform and nonuniform delay scenarios are analyzed, and the corresponding convergence values are explicitly derived. The theoretical results are validated through simulations, which explore the impact of delay heterogeneity on consensus outcomes. Furthermore, the algorithm is implemented and experimentally tested on a swarm of QBOT3 ground robots to solve the rendezvous problem, demonstrating the agents' ability to converge to a common location despite realistic communication constraints, thus confirming the algorithm's robustness and practical applicability. The results provide guidelines for designing consensus protocols that tolerate communication delays, offer insights into the relationship between network delays and coordination performance, and demonstrate their applicability to distributed robotic systems.
comment: 12 pages, 3 figures
When Rolling Gets Weird: A Curved-Link Tensegrity Robot for Non-Intuitive Behavior ICRA
Conventional mobile tensegrity robots constructed with straight links offer mobility at the cost of locomotion speed. While spherical robots provide highly effective rolling behavior, they often lack the stability required for navigating unstructured terrain common in many space exploration environments. This research presents a solution with a semi-circular, curved-link tensegrity robot that strikes a balance between efficient rolling locomotion and controlled stability, enabled by discontinuities present at the arc endpoints. Building upon an existing geometric static modeling framework [1], this work presents the system design of an improved Tensegrity eXploratory Robot 2 (TeXploR2). Internal shifting masses instantaneously roll along each curved-link, dynamically altering the two points of contact with the ground plane. Simulations of quasistatic, piecewise continuous locomotion sequences reveal new insights into the positional displacement between inertial and body frames. Non-intuitive rolling behaviors are identified and experimentally validated using a tetherless prototype, demonstrating successful dynamic locomotion. A preliminary impact test highlights the tensegrity structure's inherent shock absorption capabilities and conformability. Future work will focus on finalizing a dynamic model that is experimentally validated with extended testing in real-world environments as well as further refinement of the prototype to incorporate additional curved-links and subsequent ground contact points for increased controllability.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026
Optimal uncertainty bounds for multivariate kernel regression under bounded noise: A Gaussian process-based dual function
Non-conservative uncertainty bounds are essential for making reliable predictions about latent functions from noisy data--and thus, a key enabler for safe learning-based control. In this domain, kernel methods such as Gaussian process regression are established techniques, thanks to their inherent uncertainty quantification mechanism. Still, existing bounds either pose strong assumptions on the underlying noise distribution, are conservative, do not scale well in the multi-output case, or are difficult to integrate into downstream tasks. This paper addresses these limitations by presenting a tight, distribution-free bound for multi-output kernel-based estimates. It is obtained through an unconstrained, duality-based formulation, which shares the same structure of classic Gaussian process confidence bounds and can thus be straightforwardly integrated into downstream optimization pipelines. We show that the proposed bound generalizes many existing results and illustrate its application using an example inspired by quadrotor dynamics learning.
Agentic AI for SAGIN Resource Management_Semantic Awareness, Orchestration, and Optimization
Space-air-ground integrated networks (SAGIN) promise ubiquitous 6G connectivity but face significant resource management challenges due to heterogeneous infrastructure, dynamic topologies, and stringent quality-of-service (QoS) requirements. Conventional model-driven approaches struggle with scalability and adaptability in such complex environments. This paper presents an agentic artificial intelligence (AI) framework for autonomous SAGIN resource management by embedding large language model (LLM)-based agents into a Monitor-Analyze-Plan- Execute-Knowledge (MAPE-K) control plane. The framework incorporates three specialized agents, namely semantic resource perceivers, intent-driven orchestrators, and adaptive learners, that collaborate through natural language reasoning to bridge the gap between operator intents and network execution. A key innovation is the hierarchical agent-reinforcement learning (RL) collaboration mechanism, wherein LLM-based orchestrators dynamically shape reward functions for RL agents based on semantic network conditions. Validation through UAV-assisted AIGC service orchestration in energy-constrained scenarios demonstrates that LLM-driven reward shaping achieves 14% energy reduction and the lowest average service latency among all compared methods. This agentic paradigm offers a scalable pathway toward adaptive, AI-native 6G networks, capable of autonomously interpreting intents and adapting to dynamic environments.
comment: eg.: 7 pages, 6 figures
Linear-Quadratic Gaussian Games with Distributed Sparse Estimation
Linear-quadratic Gaussian games provide a framework for modeling strategic interactions in multi-agent systems, where agents must estimate system states from noisy observations while also making decisions to optimize a quadratic cost. However, these formulations usually require agents to utilize the full set of available observations when forming their state estimates, which can be unrealistic in large-scale or resource-constrained settings. In this paper, we consider linear-quadratic Gaussian games with sparse interagent observations. To enforce sparsity in the estimation stage, we design a distributed estimator that balances estimation effectiveness with interagent measurement sparsity via a group lasso problem, while agents implement feedback Nash strategies based on their state estimates. We provide sufficient conditions under which the sparse estimator is guaranteed to trigger a corrective reset to the optimal estimation gain, ensuring that estimation quality does not degrade beyond a level determined by the regularization parameters. Simulations on a formation game show that the proposed approach yields a significant reduction in communication resources consumed while only minimally affecting the nominal equilibrium trajectories.
Voluntary Renewable Programs: Optimal Pricing and Revenue Allocation
This paper develops a multi-period optimization framework to design a voluntary renewable program (VRP) for an electric utility company, aiming to maximize total renewable energy deployments. In the business model of VRP, the utility must ensure it generates renewable energy up to the total amount of contract during each market episode (i.e., a year), while all the revenue collected from the VRP must either be used to invest in procuring renewable capacities or to maintain the current renewable fleet and infrastructure. We thus formulate the problem as an optimal pricing problem coupled with revenue allocation and renewable deployment decisions. We model the demand function of voluntary renewable contracts as an exponential decay function based on survey data. We analytically derive the optimal pricing policy of the VRP as a function of the current grid carbon intensity. We prove that a myopic policy is conditionally optimal, which maximizes renewable capacity in each period, attains the long-run optimum due to the utility's revenue-neutral constraint. We show different binding conditions and marginal values of decision variables correspond to different phases of the energy transition, and that the utility should strategically design its revenue-sharing decisions, balancing investments in renewable expansion and subsidizing existing renewable fleets. Finally, we show that voluntary renewable programs can only extend renewable penetration but cannot achieve net-zero emissions or a fully renewable grid. This pricing-allocation-expansion framework highlights both the potential and limitations of voluntary renewable demand, providing analytical insight into optimal policy design and the qualitative shifts occurring during the energy transition process.
Quadratic Surrogate Attractor for Particle Swarm Optimization
This paper presents a particle swarm optimization algorithm that leverages surrogate modeling to replace the conventional global best solution with the minimum of an n-dimensional quadratic form, providing a better-conditioned dynamic attractor for the swarm. This refined convergence target, informed by the local landscape, enhances global convergence behavior and increases robustness against premature convergence and noise, while incurring only minimal computational overhead. The surrogate-augmented approach is evaluated against the standard algorithm through a numerical study on a set of benchmark optimization functions that exhibit diverse landscapes. To ensure statistical significance, 400 independent runs are conducted for each function and algorithm, and the results are analyzed based on their statistical characteristics and corresponding distributions. The quadratic surrogate attractor consistently outperforms the conventional algorithm across all tested functions. The improvement is particularly pronounced for quasi-convex functions, where the surrogate model can exploit the underlying convex-like structure of the landscape.
comment: 6 pages, 5 figures, 2 tables
On Online Control of Opinion Dynamics
Networked multi-agent dynamical systems have been used to model how individual opinions evolve over time due to the opinions of other agents in the network. Particularly, such a model has been used to study how a planning agent can be used to steer opinions in a desired direction through repeated, budgeted interventions. In this paper, we consider the problem where individuals' susceptibilities to external influences are unknown. We propose an online algorithm that alternates between estimating this susceptibility parameter, and using the current estimate to drive the opinion to a desired target. We provide conditions that guarantee stability and convergence to the desired target opinion when the planning agent faces budgetary or temporal constraints. Our analysis shows that the key advantage of estimating the susceptibility parameter is that it helps achieve near-optimal convergence to the target opinion given a finite amount of intervention rounds, and, for a given intervention budget, quantifies how close the opinion can get to the desired target.
Integral Quadratic Constraints for Repeated ReLU
This paper presents a new dynamic integral quadratic constraint (IQC) for the repeated Rectified Linear Unit (ReLU). These dynamic IQCs can be used to analyze stability and induced $\ell_2$-gain performance of discrete-time, recurrent neural networks (RNNs) with ReLU activation functions. These analysis conditions can be incorporated into learning-based controller synthesis methods, which currently rely on static IQCs. We show that our proposed dynamic IQCs for repeated ReLU form a superset of the dynamic IQCs for repeated, slope-restricted nonlinearities. We also prove that the $\ell_2$-gain bounds are nonincreasing with respect to the horizon used in the dynamic IQC filter. A numerical example using a simple (academic) RNN shows that our proposed IQCs lead to less conservative bounds than existing IQCs.
Convexity and Optimal Online Control of Grid-Interfacing Converters with Current Limits
Converter-based generators and loads are growing in prevalence on power grids across the globe. The rise of these resources necessitates controllers that handle the power electronic devices' strict current limits without jeopardizing stability or overly constraining behavior. Existing controllers often employ complex, cascaded control loop architecture to saturate currents, but these controllers are challenging to tune properly and can destabilize following large disturbances. In this paper, we extend previous analysis to prove the feasible output region of a grid-connected converter is convex regardless of filter topology. We then formulate a convex optimal control problem from which we derive a projected gradient descent-based controller with convergence guarantees. This approach drives the converter toward optimality in real-time and differs from conventional control strategies that regulate converter outputs around predefined references regardless of surrounding grid conditions. Simulation results demonstrate safe and stabilizing behavior of the proposed controller, in both the single-converter-infinite-bus systems and multi-converter networks.
Neural-NPV Control: Learning Parameter-Dependent Controllers and Lyapunov Functions with Neural Networks
Nonlinear parameter-varying (NPV) systems are a class of nonlinear systems whose dynamics explicitly depend on time-varying external parameters, making them suitable for modeling real-world systems with dynamics variations. Traditional synthesis methods for NPV systems, such as sum-of-squares (SOS) optimization, are only applicable to control-affine systems, face scalability challenges and often lead to conservative results due to structural restrictions. To address these limitations, we propose Neural-NPV, a two-stage learning-based framework that leverages neural networks to jointly synthesize a PD controller and a PD Lyapunov function for an NPV system under input constraints. In the first stage, we utilize a computationally cheap, gradient-based counterexample-guided procedure to synthesize an approximately valid PD Lyapunov function and a PD controller. In the second stage, a level-set guided refinement is then conducted to obtain a valid Lyapunov function and controller while maximizing the robust region of attraction (R-ROA). We demonstrate the advantages of Neural-NPV in terms of applicability, performance, and scalability compared to SOS-based methods through numerical experiments involving an simple inverted pendulum with one scheduling parameter and a quadrotor system with three scheduling parameters.
Enforcing Mixed State-Input Constraints with Multiple Backup Control Barrier Functions: A Projection-based Approach
Ensuring the safety of control systems often requires the satisfaction of constraints on states (such as position or velocity), control inputs (such as force), and a mixture of states and inputs (such as power that depends on both velocity and force). This paper presents a safety-critical control framework for enforcing mixed state-input constraints through a generalization of backup control barrier functions (backup CBFs). First, we extend the backup CBF approach to maintain multiple decoupled state and input constraints using a single backup set-backup controller pair. Second, we address mixed state-input constraints by converting them into state constraints using a projection from the state-input space to the state space along the backup controller. In the special case of decoupled state and input constraints, the proposed method simplifies the synthesis of backup CBFs by eliminating the need for saturating backup control laws. Finally, we demonstrate the efficacy of the proposed method on an inverted pendulum example, where constraints on the angle (state), torque (input), and power (mixture of state and input) are satisfied simultaneously.
comment: 6 pages, 3 figures, submitted to L-CSS/CDC 2026
Stability Guarantees for Data-Driven Predictive Control of Nonlinear Systems via Approximate Koopman Embeddings
Data-driven model predictive control based on Willems' fundamental lemma has proven effective for linear systems, but extending stability guarantees to nonlinear systems remains an open challenge. In this paper, we establish conditions under which data-driven MPC, applied directly to input-output data from a nonlinear system, yields practical exponential stability. The key insight is that the existence of an approximate Koopman linear embedding certifies that the nonlinear data can be interpreted as noisy data from a linear time-invariant system, enabling the application of existing robust stability theories. Crucially, the Koopman embedding serves only as a theoretical certificate; the controller itself operates on raw nonlinear data without knowledge of the lifting functions. We further show that the proportional structure of the embedding residual can be exploited to obtain an ultimate bound that depends only on the irreducible offset, rather than the worst-case embedding error. The framework is demonstrated on a synchronous generator connected to an infinite bus, for which we construct an explicit physics-informed embedding with error bounds.
Asymmetric Nash Seeking via Best Response Maps: Global Linear Convergence and Robustness to Inexact Reaction Models
Nash equilibria provide a principled framework for modeling interactions in multi-agent decision-making and control. However, many equilibrium-seeking methods implicitly assume that each agent has access to the other agents' objectives and constraints, an assumption that is often unrealistic in practice. This letter studies a class of asymmetric-information two-player constrained games with decoupled feasible sets, in which Player 1 knows its own objective and constraints while Player 2 is available only through a best-response map. For this class of games, we propose an asymmetric projected gradient descent-best response iteration that does not require full mutual knowledge of both players' optimization problems. Under suitable regularity conditions, we establish the existence and uniqueness of the Nash equilibrium and prove global linear convergence of the proposed iteration when the best-response map is exact. Recognizing that best-response maps are often learned or estimated, we further analyze the inexact case and show that, when the approximation error is uniformly bounded by $\varepsilon$, the iterates enter an explicit $O(\varepsilon)$ neighborhood of the true Nash equilibrium. Numerical results on a benchmark game corroborate the predicted convergence behavior and error scaling.
comment: 6 Pages, 2 Figures, Preprint submitted to IEEE L-CSS and CDC 2026
Contingency-Aware Planning via Certified Neural Hamilton-Jacobi Reachability
Hamilton-Jacobi (HJ) reachability provides formal safety guarantees for dynamical systems, but solving high-dimensional HJ partial differential equations limits its use in real-time planning. This paper presents a contingency-aware multi-goal navigation framework that integrates learning-based reachability with sampling-based planning in unknown environments. We use Fourier Neural Operator (FNO) to approximate the solution operator of the Hamilton-Jacobi-Isaacs variational inequality under varying obstacle configurations. We first provide a theoretical under-approximation guarantee on the safe backward reach-avoid set, which enables formal safety certification of the learned reachable sets. Then, we integrate the certified reachable sets with an incremental multi-goal planner, which enforces reachable-set constraints and a recovery policy that guarantees finite-time return to a safe region. Overall, we demonstrate that the proposed framework achieves asymptotically optimal navigation with provable contingency behavior, and validate its performance through real-time deployment on KUKA's youBot in Webots simulation.
comment: 9 pages, 4 figures
Learning generalized Nash equilibria from pairwise preferences
Generalized Nash Equilibrium Problems (GNEPs) arise in many applications, including non-cooperative multi-agent control problems. Although many methods exist for finding generalized Nash equilibria, most of them rely on assuming knowledge of the objective functions or being able to query the best responses of the agents. We present a method for learning solutions of GNEPs only based on querying agents for their preference between two alternative decisions. We use the collected preference data to learn a GNEP whose equilibrium approximates a GNE of the underlying (unknown) problem. Preference queries are selected using an active-learning strategy that balances exploration of the decision space and exploitation of the learned GNEP. We present numerical results on game-theoretic linear quadratic regulation problems, as well as on other literature GNEP examples, showing the effectiveness of the proposed method.
comment: (6 pages, 6 figures)
Constricting Tubes for Prescribed-Time Safe Control
We propose a constricting Control Barrier Function (CBF) framework for prescribed-time control of control-affine systems with input constraints. Given a system starting outside a target safe set, we construct a time-varying safety tube that shrinks from a relaxed set containing the initial condition to the target set at a user-specified deadline. Any controller rendering this tube forward invariant guarantees prescribed-time recovery by construction. The constriction schedule is bounded and tunable by design, in contrast to prescribed-time methods where control effort diverges near the deadline. Feasibility under input constraints reduces to a single verifiable condition on the constriction rate, yielding a closed-form minimum recovery time as a function of control authority and initial violation. The framework imposes a single affine constraint per timestep regardless of state dimension, scaling to settings where grid-based reachability methods are intractable. We validate on a 16-dimensional multi-agent system and a unicycle reach-avoid problem, demonstrating prescribed-time recovery with bounded control effort.
comment: 7 pages, 5 figures
Impacts of Electric Vehicle Charging Regimes and Infrastructure Deployments on System Performance: An Agent-Based Study
The rapid growth of electric vehicles (EVs) requires more effective charging infrastructure planning. Infrastructure layout not only determines deployment cost, but also reshapes charging behavior and influences overall system performance. In addition, destination charging and en-route charging represent distinct charging regimes associated with different power requirements, which may lead to substantially different infrastructure deployment outcomes. This study applies an agent-based modeling framework to generate trajectory-level latent public charging demand under three charging regimes based on a synthetic representation of the Melbourne (Australia) metropolitan area. Two deployment strategies, an optimization-based approach and a utilization-refined approach, are evaluated across different infrastructure layouts. Results show that utilization-refined deployments reduce total system cost, accounting for both infrastructure deployment cost and user generalized charging cost, with the most significant improvement observed under the combined charging regime. In particular, a more effective allocation of AC slow chargers reshapes destination charging behavior, which in turn reduces unnecessary reliance on en-route charging and lowers detour costs associated with en-route charging. This interaction highlights the behavioral linkage between destination and en-route charging regimes and demonstrates the importance of accounting for user response and multiple charging regimes in charging infrastructure planning.
comment: 7 pages, 4 figures
Robust H2/H-infinity control under stochastic requirements: minimizing conditional value-at-risk instead of worst-case performance
Conventional robust H2/H-infinity control minimizes the worst-case performance, often leading to a conservative design driven by very rare parametric configurations. To reduce this conservatism while taking advantage of the stochastic properties of Monte Carlo sampling and its compatibility with parallel computing, we introduce an alternative paradigm that optimizes the controller with respect to a stochastic criterion, namely the conditional value at risk. We present the problem formulation and discuss several open challenges toward a general synthesis framework. The potential of this approach is illustrated on a mechanical system, where it significantly improves overall performance by tolerating some degradation in very rare worst-case scenarios.
comment: Preprint
Neural Control Barrier Functions for Signal Temporal Logic Specifications with Input Constraints
Signal Temporal Logic (STL) provides a powerful framework to describe complex tasks involving temporal and logical behavior in dynamical systems. This work addresses controller synthesis for continuous-time systems subject to STL specifications and input constraints. We propose a neural network-based framework for synthesizing time-varying control barrier functions (TVCBF) and their corresponding controllers for systems to fulfill a fragment of STL specifications while respecting input constraints. We formulate barrier conditions incorporating the spatial and temporal logic of the given STL specification. We also incorporate a method to refine the time-varying set that satisfies the STL specification for the given input constraints. Additionally, we introduce a validity condition to provide formal safety guarantees across the entire state space. Finally, we demonstrate the effectiveness of the proposed approach through several simulation studies considering different STL tasks for various dynamical systems (including affine and non-affine systems).
Safe Output Regulation of Coupled Hyperbolic PDE-ODE Systems
This paper presents a safe output regulation control strategy for a class of systems modeled by a coupled $2\times 2$ hyperbolic PDE-ODE structure, subject to fully distributed disturbances throughout the system. A state-feedback controller is developed by the {nonovershooting backstepping} method to simultaneously achieve exponential output regulation and enforce safety constraints on the regulated output that is the state furthest from the control input. To handle unmeasurable states and external disturbances, a state observer and a disturbance estimator are designed. Explicit bounds on the estimation errors are derived and used to construct a robust safe regulator that accounts for the uncertainties. The proposed control scheme guarantees that: 1) If the regulated output is initially within the safe region, it remains there; otherwise, it will be rescued to the safety within a prescribed time; 2) The output tracking error converges to zero exponentially; 3) The observer accurately estimates both the distributed states and external disturbances, with estimation errors converging to zero exponentially; 4) All signals in the closed-loop system remain bounded. The effectiveness of the proposed method is demonstrated through a UAV delivery scenario with a cable-suspended payload, where the payload is regulated to track a desired reference while avoiding collisions with barriers.
Data-Driven Model Order Reduction of Nonlinear Systems with Noisy Data
Model order reduction techniques simplify high-dimensional dynamical systems by deriving lower-dimensional models that retain essential system characteristics. These techniques are crucial for the controller design of complex systems while significantly reducing computational costs. Nevertheless, constructing effective reduced-order models (ROMs) poses considerable challenges, particularly for nonlinear dynamical systems. These challenges are further exacerbated when the actual system model is unavailable, a scenario frequently encountered in real-world applications. In this work, we propose a data-driven framework for constructing ROMs of nonlinear dynamical systems with unknown mathematical models, enabling controller synthesis directly from the resulting ROMs. We establish similarity relations between the output trajectories of the original systems and those of their ROMs by employing the notion of simulation functions (SFs), thereby enabling a formal characterization of their closeness. To achieve this, we collect one set of noise-corrupted input-state data from the system during a finite-time experiment, upon which we propose conditions to construct both ROMs and SFs simultaneously. These conditions are formulated as data-dependent semidefinite programs. We demonstrate that the data-driven ROMs obtained can be employed to synthesize controllers for the original unknown systems, ensuring that they satisfy high-level logic specifications. This is accomplished by first designing controllers for the data-driven ROMs and then translating the results back to the original systems via interface functions, designed directly from the proposed data-dependent conditions. We evaluate the efficacy of our data-driven framework through two case studies, including a challenging benchmark from the model reduction literature: a circuit of chained inverter gates with 20 state variables.
Free Final Time Adaptive Mesh Covariance Steering via Sequential Convex Programming
In this paper we develop a sequential convex programming (SCP) framework for free-final-time covariance steering of nonlinear stochastic differential equations (SDEs) subject to both additive and multiplicative diffusion. We cast the free-final-time objective through a time-normalization and introduce per-interval time-dilation variables that induce an adaptive discretization mesh, enabling the simultaneous optimization of the control policy and the temporal grid. A central difficulty is that, under multiplicative noise, accurate covariance propagation within SCP requires retaining the first-order diffusion linearization and its coupling with time dilation. We therefore derive the exact local linear stochastic model (preserving the multiplicative structure) and introduce a tractable discretization that maintains the associated diffusion terms, after which each SCP subproblem is solved via conic/semidefinite covariance-steering relaxations with terminal moment constraints and state/control chance constraints. Numerical experiments on a nonlinear double-integrator with drag and velocity-dependent diffusion validate free-final-time minimization through adaptive time allocation and improved covariance accuracy relative to frozen-diffusion linearizations.
comment: Full-length version of paper submitted to L-CSS
Dual-Laws Model for a theory of artificial consciousness
Objectively verifying the generative mechanism of consciousness is extremely difficult because of its subjective nature. As long as theories of consciousness focus solely on its generative mechanism, developing a theory remains challenging. We believe that broadening the theoretical scope and enhancing theoretical unification are necessary to establish a theory of consciousness. This study proposes seven questions that theories of consciousness should address: phenomena, self, causation, state, function, contents, and universality. The questions were designed to examine the functional aspects of consciousness and its applicability to system design. Next, we will examine how our proposed Dual-Laws Model (DLM) can address these questions. Based on our theory, we anticipate two unique features of a conscious system: autonomy in constructing its own goals and cognitive decoupling from external stimuli. We contend that systems with these capabilities differ fundamentally from machines that merely follow human instructions. This makes a design theory that enables high moral behavior indispensable.
Switched Linear Ensemble Systems and Structural Controllability
This paper introduces and solves a structural controllability problem for ensembles of switched linear systems. All individual systems in the ensemble are sparse and governed by the same sparsity pattern, and undergo switching among subsystems by following the same switching sequence. The controllability of an ensemble system describes the ability to use a common control input to simultaneously steer every individual system. A sparsity pattern is called structurally controllable for pair \((k,q)\) if it admits a controllable ensemble of \(q\) individual systems with at most \(k\) subsystems. We derive a necessary and sufficient condition for a sparsity pattern to be structurally controllable for a given \((k,q)\), and characterize when a sparsity pattern admits a finite \(k\) that guarantees structural controllability for \((k,q)\) for arbitrary $q$. Compared with the linear time-invariant ensemble case, this second condition is strictly weaker. We further show that these conditions have natural connections with maximum flow, and hence can be checked by polynomial algorithms. Specifically, the time complexity of deciding structural controllability is \(O(n^3)\) and the complexity of computing the smallest number of subsystems needed is \(O(n^3 \log n)\), with \(n\) the dimension of each individual system.
Contraction Theory for Nonlinear Stability Analysis and Learning-based Control: A Tutorial Overview
Contraction theory is an analytical tool to study differential dynamics of a non-autonomous (i.e., time-varying) nonlinear system under a contraction metric defined with a uniformly positive definite matrix, the existence of which results in a necessary and sufficient characterization of incremental exponential stability of multiple solution trajectories with respect to each other. By using a squared differential length as a Lyapunov-like function, its nonlinear stability analysis boils down to finding a suitable contraction metric that satisfies a stability condition expressed as a linear matrix inequality, indicating that many parallels can be drawn between well-known linear systems theory and contraction theory for nonlinear systems. Furthermore, contraction theory takes advantage of a superior robustness property of exponential stability used in conjunction with the comparison lemma. This yields much-needed safety and stability guarantees for neural network-based control and estimation schemes, without resorting to a more involved method of using uniform asymptotic stability for input-to-state stability. Such distinctive features permit the systematic construction of a contraction metric via convex optimization, thereby obtaining an explicit exponential bound on the distance between a time-varying target trajectory and solution trajectories perturbed externally due to disturbances and learning errors. The objective of this paper is, therefore, to present a tutorial overview of contraction theory and its advantages in nonlinear stability analysis of deterministic and stochastic systems, with an emphasis on deriving formal robustness and stability guarantees for various learning-based and data-driven automatic control methods. In particular, we provide a detailed review of techniques for finding contraction metrics and associated control and estimation laws using deep neural networks.
comment: Annual Reviews in Control, Preprint Version, Accepted, Oct. 1st
Asymmetry-Aware Routing for Industrial Multimodal Monitoring: A Diagnostic Framework
Multimodal fusion is the default approach for combining heterogeneous sensor streams in industrial monitoring, yet no systematic method exists for determining \textit{when fusion degrades rather than improves} detection performance. We present an \textbf{Asymmetry-Aware Routing Framework} -- a three-step diagnostic procedure (unimodal performance gap, gate weight attribution, modality corruption testing) with formal decision criteria -- that routes multimodal systems toward the appropriate fusion strategy before deployment. We validate the framework on three datasets spanning two routing outcomes: (1)~the OHT/AGV industrial dataset (thermal + sensors, 13{,}121 samples), where the framework correctly identifies severe asymmetry (gap ratio 3.1$\times$) and recommends \textsc{cascade}; (2)~a chain conveyor fault detection scenario (audio + vibration), where moderate asymmetry leads to a \textsc{fuse} recommendation with positive fusion benefit; and (3)~the CWRU bearing dataset, providing controlled validation in both directions. Threshold sensitivity analysis across all three datasets shows that the framework's recommendations are robust to threshold perturbation, with correct routing maintained over a wide parameter plateau. Comparison against simpler diagnostics (gap ratio alone) reveals that Step~1 alone is ambiguous for moderate-asymmetry cases, demonstrating the necessity of the full protocol for reliable routing decisions.
Minimal Intervention Shared Control with Guaranteed Safety under Non-Convex Constraints ICRA
Shared control combines human intention with autonomous decision-making. At the low level, the primary goal is to maintain safety regardless of the user's input to the system. However, existing shared control methods-based on, e.g., Model Predictive Control, Control Barrier Functions, or learning-based control-often face challenges with feasibility, scalability, and mixed constraints. To address these challenges, we propose a Constraint-Aware Assistive Controller that computes control actions online while ensuring recursive feasibility, strict constraint satisfaction, and minimal deviation from the user's intent. It also accommodates a structured class of non-convex constraints common in real-world settings. We leverage Robust Controlled Invariant Sets for recursive feasibility and a Mixed-Integer Quadratic Programming formulation to handle non-convex constraints. We validate the approach through a large-scale user study with 66 participants-one of the most extensive in shared control research-using a simulated environment to assess task load, trust, and perceived control, in addition to performance. The results show consistent improvements across all these aspects without compromising safety and user intent. Additionally, a real-world experiment on a robotic manipulator demonstrates the framework's applicability under bounded disturbances, ensuring safety and collision-free operation.
comment: Accepted for publication at the 2026 IEEE International Conference on Robotics and Automation (ICRA)
Robust Time-Varying Control Barrier Functions with Sector-Bounded Nonlinearities
This paper presents a novel approach for ensuring safe operation of systems subject to input nonlinearities and time-varying safety constraints. We extend the time-varying barrier function framework to address time-varying safety constraints and explicitly account for control-dependent nonlinearities at the plant input. Guaranteed bounds on the input-output behavior of these nonlinearities are provided through pointwise-in-time quadratic constraints. The result is a class of robust time-varying control barrier functions that define a safety filter. This filter ensures robust safety for all admissible nonlinearities while minimally modifying the command generated by a baseline controller. We derive a second-order cone program (SOCP) to compute this safety filter online and provide feasibility conditions for ball-constrained inputs. The proposed approach is demonstrated on a spacecraft docking maneuver.
Mechanistic Foundations of Goal-Directed Control
Mechanistic interpretability has transformed the analysis of transformer circuits by decomposing model behavior into competing algorithms, identifying phase transitions during training, and deriving closed-form predictions for when and why strategies shift. However, this program has remained largely confined to sequence-prediction architectures, leaving embodied control systems without comparable mechanistic accounts. Here we extend this framework to sensorimotor-cognitive development, using infant motor learning as a model system. We show that foundational inductive biases give rise to causal control circuits, with learned gating mechanisms converging toward theoretically motivated uncertainty thresholds. The resulting dynamics reveal a clean phase transition in the arbitration gate whose commitment behavior is well described by a closed-form exponential moving-average surrogate. We identify context window k as the critical parameter governing circuit formation: below a minimum threshold (k$\leq$4) the arbitration mechanism cannot form; above it (k$\geq$8), gate confidence scales asymptotically as log k. A two-dimensional phase diagram further reveals task-demand-dependent route arbitration consistent with the prediction that prospective execution becomes advantageous only when prediction error remains within the task tolerance window. Together, these results provide a mechanistic account of how reactive and prospective control strategies emerge and compete during learning. More broadly, this work sharpens mechanistic accounts of cognitive development and provides principled guidance for the design of interpretable embodied agents.
Online Learning for Supervisory Switching Control
We study supervisory switching control for partially-observed linear dynamical systems. The objective is to identify and deploy the best controller for the unknown system by periodically selecting among a collection of $N$ candidate controllers, some of which may destabilize the underlying system. While classical estimator-based supervisory control guarantees asymptotic stability, it lacks quantitative finite-time performance bounds. Conversely, current non-asymptotic methods in both online learning and system identification require restrictive assumptions that are incompatible in a control setting, such as system stability, which preclude testing potentially unstable controllers. To bridge this gap, we propose a novel, non-asymptotic analysis of supervisory control that adapts multi-armed bandit algorithms to a control-theoretic setting. The proposed data-driven algorithm evaluates candidate controllers via scoring criteria that leverage system observability to isolate the effects of state history, enabling both detection of destabilizing controllers and accurate system identification. We present two algorithmic variants with dimension-free, finite-time guarantees, where each identifies the most suitable controller in $\mathcal{O}(N \log N)$ steps, while simultaneously achieving finite $L_2$-gain with respect to system disturbances.
Voltage-sensitive distribution factors for contingency analysis and topology optimization
Topology optimization is a promising approach for mitigating congestion and managing changing grid conditions, but it is computationally challenging and requires approximations. Conventional distribution factors like PTDFs and LODFs, based on DC power flow, fail to capture voltage variations, reactive power, and losses, thereby limiting their use in detailed optimization tasks such as busbar splitting. This paper introduces generalized distribution factors derived from a voltage-sensitive linearization of the full AC power flow equations. The proposed formulation accurately reflects reactive power flows, Ohmic losses, and voltage deviations while remaining computationally efficient. We derive and evaluate generalized PTDFs, LODFs, and topology modification factors using matrix identities. We discuss potential applications including voltage-aware N-1 security analysis and topology optimization with a focus on busbar splitting. Numerical experiments demonstrate close agreement with full AC solutions, significantly outperforming the traditional DC approximation.
comment: 9 pages, 4 figures. Added performance analysis
Lyapunov Constrained Soft Actor-Critic (LC-SAC) using Koopman Operator Theory for Quadrotor Trajectory Tracking
Reinforcement Learning (RL) has achieved remarkable success in solving complex sequential decision-making problems. However, its application to safety-critical physical systems remains constrained by the lack of stability guarantees. Standard RL algorithms prioritize reward maximization, often yielding policies that may induce oscillations or unbounded state divergence. There has been significant work in incorporating Lyapunov-based stability guarantees in RL algorithms with key challenges being selecting a candidate Lyapunov function, computational complexity by using excessive function approximators and conservative policies by incorporating stability criterion in the learning process. In this work we propose a novel Lyapunov-constrained Soft Actor-Critic (LC-SAC) algorithm using Koopman operator theory. We propose use of extended dynamic mode decomposition (EDMD) to produce a linear approximation of the system and use this approximation to derive a closed form solution for candidate Lyapunov function. This derived Lyapunov function is incorporated in the SAC algorithm to further provide guarantees for a policy that stabilizes the nonlinear system. The results are evaluated trajectory tracking of a 2D Quadrotor environment based on safe-control-gym. The proposed algorithm shows training convergence and decaying violations for Lyapunov stability criterion compared to baseline vanilla SAC algorithm. GitHub Repository: https://github.com/DhruvKushwaha/LC-SAC-Quadrotor-Trajectory-Tracking
comment: 11 pages, 7 Figures, submitted to IEEE RA-L
Robust Adaptive MPC Under Nonlinear Time-Varying Uncertainties: An Uncertainty Compensation Approach
This paper introduces an uncertainty compensation-based robust adaptive model predictive control (MPC) framework for linear systems with nonlinear time-varying uncertainties. The framework integrates an L1 adaptive controller to compensate for the matched uncertainty and a robust feedback controller, designed using linear matrix inequalities, to mitigate the effect of unmatched uncertainty on target output channels. Uniform bounds on the errors between the system's states and control inputs and those of a nominal (i.e., uncertainty-free) system are derived. These error bounds are then used to tighten the actual system's state and input constraints, enabling the design of an MPC for the nominal system under these tightened constraints. Referred to as uncertainty compensation-based MPC (UC-MPC), this approach ensures constraint satisfaction while delivering enhanced performance compared to existing methods. Simulation results for a flight control example and a spacecraft landing on an asteroid demonstrate the effectiveness of the proposed framework.
AgriChrono: A Multi-modal Dataset Capturing Crop Growth and Lighting Variability with a Field Robot
Advances in AI and Robotics have accelerated significant initiatives in agriculture, particularly in the areas of robot navigation and 3D digital twin creation. A significant bottleneck impeding this progress is the critical lack of "in-the-wild" datasets that capture the full complexities of real farmland, including non-rigid motion from wind, drastic illumination variance, and morphological changes resulting from growth. This data gap fundamentally limits research on robust AI models for autonomous field navigation and scene-level dynamic 3D reconstruction. In this paper, we present AgriChrono, a modular robotic data collection platform and multi-modal dataset designed to capture these dynamic farmland conditions. Our platform integrates multiple sensors, enabling remote, time-synchronized acquisition of RGB, Depth, LiDAR, IMU, and Pose data for efficient and repeatable long-term data collection in real-world agricultural environments. We successfully collected 18TB of data over one month, documenting the entire growth cycle of Canola under diverse illumination conditions. We benchmark state-of-the-art 3D reconstruction methods on AgriChrono, revealing the profound challenge of reconstructing high-fidelity, dynamic non-rigid scenes in such farmland settings. This benchmark validates AgriChrono as a critical asset for advancing model generalization, and its public release is expected to significantly accelerate research and development in precision agriculture. The code and dataset are publicly available at: https://github.com/StructuresComp/agri-chrono
comment: Keywords: Agricultural Robotics, In-the-wild Dataset, 3D Reconstruction
Push, Press, Slide: Mode-Aware Planar Contact Manipulation via Reduced-Order Models IROS 2026
Non-prehensile planar manipulation, including pushing and press-and-slide, is critical for diverse robotic tasks, but notoriously challenging due to hybrid contact mechanics, under-actuation, and asymmetric friction limits that traditionally necessitate computationally expensive iterative control. In this paper, we propose a mode-aware framework for planar manipulation with one or two robotic arms based on contact topology selection and reduced-order kinematic modeling. Our core insight is that complex wrench-twist limit surface mechanics can be abstracted into a discrete library of physically intuitive models. We systematically map various single-arm and bimanual contact topologies to simple non-holonomic formulations, e.g. unicycle for simplified press-and-slide motion. By anchoring trajectory generation to these reduced-order models, our framework computes the required object wrench and distributes feasible, friction-bounded contact forces via a direct algebraic allocator. We incorporate manipulator kinematics to ensure long-horizon feasibility and demonstrate our fast, optimization-free approach in simulation across diverse single-arm and bimanual manipulation tasks. Supplementary videos and additional information are available at: https://sites.google.com/view/pushpressslide
comment: 8 pages, 13 figures. Submitted to IEEE IROS 2026
CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts
Generating accurate circuit schematics from high-level natural language descriptions remains a persistent challenge in electronic design automation (EDA), as large language models (LLMs) frequently hallucinate components, violate strict physical constraints, and produce non-machine-readable outputs. To address this, we present CircuitLM, a multi-agent pipeline that translates user prompts into structured, visually interpretable $\texttt{CircuitJSON}$ schematics. The framework mitigates hallucination and ensures physical viability by grounding generation in a curated, embedding-powered component knowledge base through five sequential stages: (i) component identification, (ii) canonical pinout retrieval, (iii) chain-of-thought reasoning, (iv) JSON schematic synthesis, and (v) interactive force-directed visualization. We evaluate the system on a dataset of 100 unique circuit-design prompts using five state-of-the-art LLMs. To systematically assess performance, we deploy a rigorous dual-layered evaluation methodology: a deterministic Electrical Rule Checking (ERC) engine categorizes topological faults by strict severity (Critical, Major, Minor, Warning), while an LLM-as-a-judge meta-evaluator identifies complex, context-aware design flaws that bypass standard rule-based checkers. Ultimately, this work demonstrates how targeted retrieval combined with deterministic and semantic verification can bridge natural language to structurally viable, schematic-ready hardware and safe circuit prototyping. Our code and data will be made public.
comment: Under review, 10 pages, 8 figures, 6 tables
Robotics
Towards Generalizable Robotic Manipulation in Dynamic Environments
Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.
HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions
We present HSImul3R, a unified framework for simulation-ready 3D reconstruction of human-scene interactions (HSI) from casual captures, including sparse-view images and monocular videos. Existing methods suffer from a perception-simulation gap: visually plausible reconstructions often violate physical constraints, leading to instability in physics engines and failure in embodied AI applications. To bridge this gap, we introduce a physically-grounded bi-directional optimization pipeline that treats the physics simulator as an active supervisor to jointly refine human dynamics and scene geometry. In the forward direction, we employ Scene-targeted Reinforcement Learning to optimize human motion under dual supervision of motion fidelity and contact stability. In the reverse direction, we propose Direct Simulation Reward Optimization, which leverages simulation feedback on gravitational stability and interaction success to refine scene geometry. We further present HSIBench, a new benchmark with diverse objects and interaction scenarios. Extensive experiments demonstrate that HSImul3R produces the first stable, simulation-ready HSI reconstructions and can be directly deployed to real-world humanoid robots.
comment: https://yukangcao.github.io/HSImul3R/
Perception-Aware Autonomous Exploration in Feature-Limited Environments
Autonomous exploration in unknown environments typically relies on onboard state estimation for localisation and mapping. Existing exploration methods primarily maximise coverage efficiency, but often overlook that visual-inertial odometry (VIO) performance strongly depends on the availability of robust visual features. As a result, exploration policies can drive a robot into feature-sparse regions where tracking degrades, leading to odometry drift, corrupted maps, and mission failure. We propose a hierarchical perception-aware exploration framework for a stereo-equipped unmanned aerial vehicle (UAV) that explicitly couples exploration progress with feature observability. Our approach (i) associates each candidate frontier with an expected feature quality using a global feature map, and prioritises visually informative subgoals, and (ii) optimises a continuous yaw trajectory along the planned motion to maintain stable feature tracks. We evaluate our method in simulation across environments with varying texture levels and in real-world indoor experiments with largely textureless walls. Compared to baselines that ignore feature quality and/or do not optimise continuous yaw, our method maintains more reliable feature tracking, reduces odometry drift, and achieves on average 30\% higher coverage before the odometry error exceeds specified thresholds.
EAAE: Energy-Aware Autonomous Exploration for UAVs in Unknown 3D Environments
Battery-powered multirotor unmanned aerial vehicles (UAVs) can rapidly map unknown environments, but mission performance is often limited by energy rather than geometry alone. Standard exploration policies that optimise for coverage or time can therefore waste energy through manoeuvre-heavy trajectories. In this paper, we address energy-aware autonomous 3D exploration for multirotor UAVs in initially unknown environments. We propose Energy-Aware Autonomous Exploration (EAAE), a modular frontier-based framework that makes energy an explicit decision variable during frontier selection. EAAE clusters frontiers into view-consistent regions, plans dynamically feasible candidate trajectories to the most informative clusters, and predicts their execution energy using an offline power estimation loop. The next target is then selected by minimising predicted trajectory energy while preserving exploration progress through a dual-layer planning architecture for safe execution. We evaluate EAAE in a full exploration pipeline with a rotor-speed-based power model across simulated 3D environments of increasing complexity. Compared to representative distance-based and information gain-based frontier baselines, EAAE consistently reduces total energy consumption while maintaining competitive exploration time and comparable map quality, providing a practical drop-in energy-aware layer for frontier exploration.
From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation
Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.
comment: 31 pages
Panoramic Affordance Prediction
Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations, often missing critical holistic environmental context. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding. To facilitate this novel task, we first introduce PAP-12K, a large-scale benchmark dataset containing over 1,000 ultra-high-resolution (12k, 11904 x 5952) panoramic images with over 12k carefully annotated QA pairs and affordance masks. Furthermore, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system to tackle the ultra-high resolution and severe distortion inherent in panoramic images. PAP employs recursive visual routing via grid prompting to progressively locate targets, applies an adaptive gaze mechanism to rectify local geometric distortions, and utilizes a cascaded grounding pipeline to extract precise instance-level masks. Experimental results on PAP-12K reveal that existing affordance prediction methods designed for standard perspective images suffer severe performance degradation and fail due to the unique challenges of panoramic vision. In contrast, PAP framework effectively overcomes these obstacles, significantly outperforming state-of-the-art baselines and highlighting the immense potential of panoramic perception for robust embodied intelligence.
Kimodo: Scaling Controllable Human Motion Generation
High-quality human motion data is becoming increasingly important for applications in robotics, simulation, and entertainment. Recent generative models offer a potential data source, enabling human motion synthesis through intuitive inputs like text prompts or kinematic constraints on poses. However, the small scale of public mocap datasets has limited the motion quality, control accuracy, and generalization of these models. In this work, we introduce Kimodo, an expressive and controllable kinematic motion diffusion model trained on 700 hours of optical motion capture data. Our model generates high-quality motions while being easily controlled through text and a comprehensive suite of kinematic constraints including full-body keyframes, sparse joint positions/rotations, 2D waypoints, and dense 2D paths. This is enabled through a carefully designed motion representation and two-stage denoiser architecture that decomposes root and body prediction to minimize motion artifacts while allowing for flexible constraint conditioning. Experiments on the large-scale mocap dataset justify key design decisions and analyze how the scaling of dataset size and model size affect performance.
comment: Project page: https://research.nvidia.com/labs/sil/projects/kimodo/
Optimal control of differentially flat underactuated planar robots in the perspective of oscillation mitigation
Underactuated robots are characterized by a larger number of degrees of freedom than actuators and if they are designed with a specific mass distribution, they can be controlled by means of differential flatness theory. This structural property enables the development of lightweight and cost-effective robotic systems with enhanced dexterity. However, a key challenge lies in managing the passive joints, whose control demands precise and comprehensive dynamic modeling of the system. To simplify dynamic models, particularly for low-speed trajectories, friction is often neglected. While this assumption simplifies analysis and control design, it introduces residual oscillations of the end-effector about the target position. In this paper, the possibility of using optimal control along with differential flatness control is investigated to improve the tracking of the planned trajectories. First, the study was carried out through formal analysis, and then, it was validated by means of numerical simulations. Results highlight that optimal control can be used to plan the flat variables considering different (quadratic) performance indices: control effort, i.e. motor torque, and potential energy of the considered underactuated joint. Moreover, the minimization of potential energy can be used to design motion laws that are robust against variation of the stiffness and damping of the underactuated joint, thus reducing oscillations in the case of stiffness/damping mismatch.
comment: Accepted to European Control Conference (ECC 2026)
Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation CVPR 2026
Cross-domain panoramic semantic segmentation has attracted growing interest as it enables comprehensive 360° scene understanding for real-world applications. However, it remains particularly challenging due to severe geometric Field of View (FoV) distortions and inconsistent open-set semantics across domains. In this work, we formulate an open-set domain adaptation setting, and propose Extrapolative Domain Adaptive Panoramic Segmentation (EDA-PSeg) framework that trains on local perspective views and tests on full 360° panoramic images, explicitly tackling both geometric FoV shifts across domains and semantic uncertainty arising from previously unseen classes. To this end, we propose the Euler-Margin Attention (EMA), which introduces an angular margin to enhance viewpoint-invariant semantic representation, while performing amplitude and phase modulation to improve generalization toward unseen classes. Additionally, we design the Graph Matching Adapter (GMA), which builds high-order graph relations to align shared semantics across FoV shifts while effectively separating novel categories through structural adaptation. Extensive experiments on four benchmark datasets under camera-shift, weather-condition, and open-set scenarios demonstrate that EDA-PSeg achieves state-of-the-art performance, robust generalization to diverse viewing geometries, and resilience under varying environmental conditions. The code is available at https://github.com/zyfone/EDA-PSeg.
comment: Accepted to CVPR 2026. The code is available at https://github.com/zyfone/EDA-PSeg
On the Derivation of Tightly-Coupled LiDAR-Inertial Odometry with VoxelMap
This note presents a concise mathematical formulation of tightly-coupled LiDAR-Inertial Odometry within an iterated error-state Kalman filter framework using a VoxelMap representation. Rather than proposing a new algorithm, it provides a clear and self-contained derivation that unifies the geometric modeling and probabilistic state estimation through consistent notation and explicit formulations. The document is intended to serve both as a technical reference and as an accessible entry point for a foundational understanding of the system architecture and estimation principles.
RoCo Challenge at AAAI 2026: Benchmarking Robotic Collaborative Manipulation for Assembly Towards Industrial Automation
Embodied Artificial Intelligence (EAI) is rapidly developing, gradually subverting previous autonomous systems' paradigms from isolated perception to integrated, continuous action. This transition is highly significant for industrial robotic manipulation, promising to free human workers from repetitive, dangerous daily labor. To benchmark and advance this capability, we introduce the Robotic Collaborative Assembly Assistance (RoCo) Challenge with a dataset towards simulation and real-world assembly manipulation. Set against the backdrop of human-centered manufacturing, this challenge focuses on a high-precision planetary gearbox assembly task, a demanding yet highly representative operation in modern industry. Built upon a self-developed data collection, training, and evaluation system in Isaac Sim, and utilizing a dual-arm robot for real-world deployment, the challenge operates in two phases. The Simulation Round defines fine-grained task phases for step-wise scoring to handle the long-horizon nature of the assembly. The Real-World Round mirrors this evaluation with physical gearbox components and high-quality teleoperated datasets. The core tasks require assembling an epicyclic gearbox from scratch, including mounting three planet gears, a sun gear, and a ring gear. Attracting over 60 teams and 170+ participants from more than 10 countries, the challenge yielded highly effective solutions, most notably ARC-VLA and RoboCola. Results demonstrate that a dual-model framework for long-horizon multi-task learning is highly effective, and the strategic utilization of recovery-from-failure curriculum data is a critical insight for successful deployment. This report outlines the competition setup, evaluation approach, key findings, and future directions for industrial EAI. Our dataset, CAD files, code, and evaluation results can be found at: https://rocochallenge.github.io/RoCo2026/.
comment: 16 pages, 8 figures
Zero-Shot Generalization from Motion Demonstrations to New Tasks
Learning motion policies from expert demonstrations is an essential paradigm in modern robotics. While end-to-end models aim for broad generalization, they require large datasets and computationally heavy inference. Conversely, learning dynamical systems (DS) provides fast, reactive, and provably stable control from very few demonstrations. However, existing DS learning methods typically model isolated tasks and struggle to reuse demonstrations for novel behaviors. In this work, we formalize the problem of combining isolated demonstrations within a shared workspace to enable generalization to unseen tasks. The Gaussian Graph is introduced, which reinterprets spatial components of learned motion primitives as discrete vertices with connections to one another. This formulation allows us to bridge continuous control with discrete graph search. We propose two frameworks leveraging this graph: Stitching, for constructing time-invariant DSs, and Chaining, giving a sequence-based DS for complex motions while retaining convergence guarantees. Simulations and real-robot experiments show that these methods successfully generalize to new tasks where baseline methods fail.
Formalisms for Robotic Mission Specification and Execution: A Comparative Analysis
Robots are increasingly deployed across diverse domains and designed for multi-purpose operation. As robotic systems grow in complexity and operate in dynamic environments, the need for structured, expressive, and scalable mission-specification approaches becomes critical, with mission specifications often defined in the field by domain experts rather than robotics specialists. However, there is no standard or widely accepted formalism for specifying missions in single- or multi-robot systems. A variety of formalisms, such as Behavior Trees, State Machines, Hierarchical Task Networks, and Business Process Model and Notation, have been adopted in robotics to varying degrees, each providing different levels of abstraction, expressiveness, and support for integration with human workflows and external devices. This paper presents a systematic analysis of these four formalisms with respect to their suitability for robot mission specification. Our study focuses on mission-level descriptions rather than robot software development. We analyze their underlying control structures and mission concepts, evaluate their expressiveness and limitations in modeling real-world missions, and assess the extent of available tool support. By comparing the formalisms and validating our findings with experts, we provide insights into their applicability, strengths, and shortcomings in robotic system modeling. The results aim to support practitioners and researchers in selecting appropriate modeling approaches for designing robust and adaptable robot and multi-robot missions.
MA-VLCM: A Vision Language Critic Model for Value Estimation of Policies in Multi-Agent Team Settings
Multi-agent reinforcement learning (MARL) commonly relies on a centralized critic to estimate the value function. However, learning such a critic from scratch is highly sample-inefficient and often lacks generalization across environments. At the same time, large vision-language-action models (VLAs) trained on internet-scale data exhibit strong multimodal reasoning and zero-shot generalization capabilities, yet directly deploying them for robotic execution remains computationally prohibitive, particularly in heterogeneous multi-robot systems with diverse embodiments and resource constraints. To address these challenges, we propose Multi-Agent Vision-Language-Critic Models (MA-VLCM), a framework that replaces the learned centralized critic in MARL with a pretrained vision-language model fine-tuned to evaluate multi-agent behavior. MA-VLCM acts as a centralized critic conditioned on natural language task descriptions, visual trajectory observations, and structured multi-agent state information. By eliminating critic learning during policy optimization, our approach significantly improves sample efficiency while producing compact execution policies suitable for deployment on resource-constrained robots. Results show good zero-shot return estimation on models with differing VLM backbones on in-distribution and out-of-distribution scenarios in multi-agent team settings
comment: 7 pages, 6 figures
End-to-End Dexterous Grasp Learning from Single-View Point Clouds via a Multi-Object Scene Dataset
Dexterous grasping in multi-object scene constitutes a fundamental challenge in robotic manipulation. Current mainstream grasping datasets predominantly focus on single-object scenarios and predefined grasp configurations, often neglecting environmental interference and the modeling of dexterous pre-grasp gesture, thereby limiting their generalizability in real-world applications. To address this, we propose DGS-Net, an end-to-end grasp prediction network capable of learning dense grasp configurations from single-view point clouds in multi-object scene. Furthermore, we propose a two-stage grasp data generation strategy that progresses from dense single-object grasp synthesis to dense scene-level grasp generation. Our dataset comprises 307 objects, 240 multi-object scenes, and over 350k validated grasps. By explicitly modeling grasp offsets and pre-grasp configurations, the dataset provides more robust and accurate supervision for dexterous grasp learning. Experimental results show that DGS-Net achieves grasp success rates of 88.63\% in simulation and 78.98\% on a real robotic platform, while exhibiting lower penetration with a mean penetration depth of 0.375 mm and penetration volume of 559.45 mm^3, outperforming existing methods and demonstrating strong effectiveness and generalization capability. Our dataset is available at https://github.com/4taotao8/DGS-Net.
comment: 10 pages, 6 figures. Submitted to IEEE Transactions on Automation Science and Engineering (T-ASE)
Efficient Morphology-Control Co-Design via Stackelberg Proximal Policy Optimization
Morphology-control co-design concerns the coupled optimization of an agent's body structure and control policy. This problem exhibits a bi-level structure, where the control dynamically adapts to the morphology to maximize performance. Existing methods typically neglect the control's adaptation dynamics by adopting a single-level formulation that treats the control policy as fixed when optimizing morphology. This can lead to inefficient optimization, as morphology updates may be misaligned with control adaptation. In this paper, we revisit the co-design problem from a game-theoretic perspective, modeling the intrinsic coupling between morphology and control as a novel variant of a Stackelberg game. We propose Stackelberg Proximal Policy Optimization (Stackelberg PPO), which explicitly incorporates the control's adaptation dynamics into morphology optimization. By modeling this intrinsic coupling, our method aligns morphology updates with control adaptation, thereby stabilizing training and improving learning efficiency. Experiments across diverse co-design tasks demonstrate that Stackelberg PPO outperforms standard PPO in both stability and final performance, opening the way for dramatically more efficient robotics designs.
comment: presented at the Fourteenth International Conference on Learning Representations; 11 pages in main text + 3 pages of references + 23 pages of appendices, 5 figures in main text + 11 figures in appendices, 16 tables in appendices; accompanying website available at https://yanningdai.github.io/stackelberg-ppo-co-design/ ; source code available at https://github.com/YanningDai/StackelbergPPO
NavThinker: Action-Conditioned World Models for Coupled Prediction and Planning in Social Navigation
Social navigation requires robots to act safely in dynamic human environments. Effective behavior demands thinking ahead: reasoning about how the scene and pedestrians evolve under different robot actions rather than reacting to current observations alone. This creates a coupled prediction-planning challenge, where robot actions and human motion mutually influence each other. To address this challenge, we propose NavThinker, a future-aware framework that couples an action-conditioned world model with on-policy reinforcement learning. The world model operates in the Depth Anything V2 patch feature space and performs autoregressive prediction of future scene geometry and human motion; multi-head decoders then produce future depth maps and human trajectories, yielding a future-aware state aligned with traversability and interaction risk. Crucially, we train the policy with DD-PPO while injecting world-model think-ahead signals via: (i) action-conditioned future features fused into the current observation embedding and (ii) social reward shaping from predicted human trajectories. Experiments on single- and multi-robot Social-HM3D show state-of-the-art navigation success, with zero-shot transfer to Social-MP3D and real-world deployment on a Unitree Go2, validating generalization and practical applicability. Webpage: https://github.com/hutslib/NavThinker.
User-Tailored Learning to Forecast Walking Modes for Exosuits
Assistive robotic devices, like soft lower-limb exoskeletons or exosuits, are widely spreading with the promise of helping people in everyday life. To make such systems adaptive to the variety of users wearing them, it is desirable to endow exosuits with advanced perception systems. However, exosuits have little sensory equipment because they need to be light and easy to wear. This paper presents a perception module based on machine learning that aims at estimating 3 walking modes (i.e., ascending or descending stairs and walking on level ground) of users wearing an exosuit. We tackle this perception problem using only inertial data from two sensors. Our approach provides an estimate for both future and past timesteps that supports control and enables a self-labeling procedure for online model adaptation. Indeed, we show that our estimate can label data acquired online and refine the model for new users. A thorough analysis carried out on real-life datasets shows the effectiveness of our user-tailored perception module. Finally, we integrate our system with the exosuit in a closed-loop controller, validating its performance in an online single-subject experiment.
GNIO: Gated Neural Inertial Odometry
Inertial navigation using low-cost MEMS sensors is plagued by rapid drift due to sensor noise and bias instability. While recent data-driven approaches have made significant strides, they often struggle with micro-drifts during stationarity and mode fusion during complex motion transitions due to their reliance on fixed-window regression. In this work, we introduce Gated Neural Inertial Odometry (GNIO), a novel learning-based framework that explicitly models motion validity and context. We propose two key architectural innovations: \ding{182} a learnable Motion Bank that queries a global dictionary of motion patterns to provide semantic context beyond the local receptive field, and \ding{183} a Gated Prediction Head that decomposes displacement into magnitude and direction. This gating mechanism acts as a soft, differentiable Zero-Velocity Update (ZUPT), dynamically suppressing sensor noise during stationary periods while scaling predictions during dynamic motion. Extensive experiments across four public benchmarks demonstrate that GNIO significantly reduces position drift compared to state-of-the-art CNN and Transformer-based baselines. Notably, GNIO achieves a $60.21\%$ reduction in trajectory error on the OxIOD dataset and exhibits superior generalization in challenging scenarios involving frequent stops and irregular motion speeds.
comment: Submitted to IEEE Robotics and Automation Letters
Encirclement Guaranteed Finite-Time Capture against Unknown Evader Strategies
We consider a pursuit-evasion scenario involving a group of pursuers and a single evader in a two-dimensional unbounded environment. The pursuers aim to capture the evader in finite time while ensuring the evader remains enclosed within the convex hull of their positions until capture, without knowledge of the evader's heading angle. Prior works have addressed the problem of encirclement and capture separately in different contexts. In this paper, we present a class of strategies for the pursuers that guarantee capture in finite time while maintaining encirclement, irrespective of the evader's strategy. Furthermore, we derive an upper bound on the time to capture. Numerical results highlight the effectiveness of the proposed framework against a range of evader strategies.
MoE-ACT: Scaling Multi-Task Bimanual Manipulation with Sparse Language-Conditioned Mixture-of-Experts Transformers
The ability of robots to handle multiple tasks under a unified policy is critical for deploying embodied intelligence in real-world household and industrial applications. However, out-of-distribution variation across tasks often causes severe task interference and negative transfer when training general robotic policies. To address this challenge, we propose a lightweight multi-task imitation learning framework for bimanual manipulation, termed Mixture-of-Experts-Enhanced Action Chunking Transformer (MoE-ACT), which integrates sparse Mixture-of-Experts (MoE) modules into the Transformer encoder of ACT. The MoE layer decomposes a unified task policy into independently invoked expert components. Through adaptive activation, it naturally decouples multi-task action distributions in latent space. During decoding, Feature-wise Linear Modulation (FiLM) dynamically modulates action tokens to improve consistency between action generation and task instructions. In parallel, multi-scale cross-attention enables the policy to simultaneously focus on both low-level and high-level semantic features, providing rich visual information for robotic manipulation. We further incorporate textual information, transitioning the framework from a purely vision-based model to a vision-centric, language-conditioned action generation system. Experimental validation in both simulation and a real-world dual-arm setup shows that MoE-ACT substantially improves multi-task performance. Specifically, MoE-ACT outperforms vanilla ACT by an average of 33% in success rate. These results indicate that MoE-ACT provides stronger robustness and generalization in complex multi-task bimanual manipulation environments. Our open-source project page can be found at https://j3k7.github.io/MoE-ACT/.
HapticVLA: Contact-Rich Manipulation via Vision-Language-Action Model without Inference-Time Tactile Sensing
Tactile sensing is a crucial capability for Vision-Language-Action (VLA) architectures, as it enables dexterous and safe manipulation in contact-rich tasks. However, reliance on dedicated tactile hardware increases cost and reduces reproducibility across robotic platforms. We argue that tactile-aware manipulation can be learned offline and deployed without direct haptic feedback at inference. To this end, we present HapticVLA, which proceeds in two tightly coupled stages: Safety-Aware Reward-Weighted Flow Matching (SA-RWFM) and Tactile Distillation (TD). SA-RWFM trains a flow-matching action expert that incorporates precomputed, safety-aware tactile rewards penalizing excessive grasping force and suboptimal grasping trajectories. TD further transfers this tactile-aware capability into a conventional VLA: we distill a compact tactile token from the SA-RWFM teacher and train a student VLA to predict that token from vision and state modalities, enabling tactile-aware action generation at inference without requiring on-board tactile sensors. This design preserves contact-rich tactile-aware reasoning within VLA while removing the need for on-board tactile sensors during deployment. On real-world experiments, HapticVLA achieves a mean success rate of 86.7%, consistently outperforming baseline VLAs - including versions provided with direct tactile feedback during inference.
A Methodology for Dynamic Parameters Identification of 3-DOF Parallel Robots in Terms of Relevant Parameters
The identification of dynamic parameters in mechanical systems is important for improving model-based control as well as for performing realistic dynamic simulations. Generally, when identification techniques are applied only a subset of so-called base parameters can be identified. More even, some of these parameters cannot be identified properly given that they have a small contribution to the robot dynamics and hence in the presence of noise in measurements and discrepancy in modeling, their quality of being identifiable decreases. For this reason, a strategy for dynamic parameter identification of fully parallel robots in terms of a subset called relevant parameters is put forward. The objective of the proposed methodology is to start from a full dynamic model, then simplification concerning the geometry of each link and, the symmetry due to legs of fully parallel robots, are carried out. After that, the identification is done by Weighted Least Squares. Then, with statistical considerations the model is reduced until the physical feasibility conditions are met. The application of the propose strategy has been experimentally tested on two difierent configurations of actual 3-DOF parallel robots. The response of the inverse and forward dynamics of the identified models agrees with experiments. In order to evaluate the forward dynamics response, an approach for obtaining the forward dynamics in terms of the relevant parameters is also proposed.
Coupled Particle Filters for Robust Affordance Estimation ICRA
Robotic affordance estimation is challenging due to visual, geometric, and semantic ambiguities in sensory input. We propose a method that disambiguates these signals using two coupled recursive estimators for sub-aspects of affordances: graspable and movable regions. Each estimator encodes property-specific regularities to reduce uncertainty, while their coupling enables bidirectional information exchange that focuses attention on regions where both agree, i.e., affordances. Evaluated on a real-world dataset, our method outperforms three recent affordance estimators (Where2Act, Hands-as-Probes, and HRP) by 308%, 245%, and 257% in precision, and remains robust under challenging conditions such as low light or cluttered environments. Furthermore, our method achieves a 70% success rate in our real-world evaluation. These results demonstrate that coupling complementary estimators yields precise, robust, and embodiment-appropriate affordance predictions.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026
NavGSim: High-Fidelity Gaussian Splatting Simulator for Large-Scale Navigation
Simulating realistic environments for robots is widely recognized as a critical challenge in robot learning, particularly in terms of rendering and physical simulation. This challenge becomes even more pronounced in navigation tasks, where trajectories often extend across multiple rooms or entire floors. In this work, we present NavGSim, a Gaussian Splatting-based simulator designed to generate high-fidelity, large-scale navigation environments. Built upon a hierarchical 3D Gaussian Splatting framework, NavGSim enables photorealistic rendering in expansive scenes spanning hundreds of square meters. To simulate navigation collisions, we introduce a Gaussian Splatting-based slice technique that directly extracts navigable areas from reconstructed Gaussians. Additionally, for ease of use, we provide comprehensive NavGSim APIs supporting multi-GPU development, including tools for custom scene reconstruction, robot configuration, policy training, and evaluation. To evaluate NavGSim's effectiveness, we train a Vision-Language-Action (VLA) model using trajectories collected from NavGSim and assess its performance in both simulated and real-world environments. Our results demonstrate that NavGSim significantly enhances the VLA model's scene understanding, enabling the policy to handle diverse navigation queries effectively.
What Matters for Scalable and Robust Learning in End-to-End Driving Planners? CVPR
End-to-end autonomous driving has gained significant attention for its potential to learn robust behavior in interactive scenarios and scale with data. Popular architectures often build on separate modules for perception and planning connected through latent representations, such as bird's eye view feature grids, to maintain end-to-end differentiability. This paradigm emerged mostly on open-loop datasets, with evaluation focusing not only on driving performance, but also intermediate perception tasks. Unfortunately, architectural advances that excel in open-loop often fail to translate to scalable learning of robust closed-loop driving. In this paper, we systematically re-examine the impact of common architectural patterns on closed-loop performance: (1) high-resolution perceptual representations, (2) disentangled trajectory representations, and (3) generative planning. Crucially, our analysis evaluates the combined impact of these patterns, revealing both unexpected limitations as well as underexplored synergies. Building on these insights, we introduce BevAD, a novel lightweight and highly scalable end-to-end driving architecture. BevAD achieves 72.7% success rate on the Bench2Drive benchmark and demonstrates strong data-scaling behavior using pure imitation learning. Our code and models are publicly available here: https://dmholtz.github.io/bevad/
comment: To be published in CVPR Findings 2026
KiRAS: Keyframe Guided Self-Imitation for Robust and Adaptive Skill Learning in Quadruped Robots ICRA
With advances in reinforcement learning and imitation learning, quadruped robots can acquire diverse skills within a single policy by imitating multiple skill-specific datasets. However, the lack of datasets on complex terrains limits the ability of such multi-skill policies to generalize effectively in unstructured environments. Inspired by animation, we adopt keyframes as minimal and universal skill representations, relaxing dataset constraints and enabling the integration of terrain adaptability with skill diversity. We propose Keyframe Guided Self-Imitation for Robust and Adaptive Skill Learning (KiRAS), an end-to-end framework for acquiring and transitioning between diverse skill primitives on complex terrains. KiRAS first learns diverse skills on flat terrain through keyframe-guided self-imitation, eliminating the need for expert datasets; then continues training the same policy network on rough terrains to enhance robustness. To eliminate catastrophic forgetting, a proficiency-based Skill Initialization Technique is introduced. Experiments on Solo-8 and Unitree Go1 robots show that KiRAS enables robust skill acquisition and smooth transitions across challenging terrains. This framework demonstrates its potential as a lightweight platform for multi-skill generation and dataset collection. It further enables flexible skill transitions that enhance locomotion on challenging terrains.
comment: Received by 2026 IEEE International Conference on Robotics and Automation (ICRA)
ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation CVPR 2026
Embodied intelligence for contact-rich manipulation has predominantly relied on position control, while explicit awareness and regulation of interaction forces remain under-explored, limiting stability, precision, and robustness in real-world tasks. We propose ForceVLA2, an end-to-end vision-language-action framework that equips robots with hybrid force-position control and explicit force awareness. ForceVLA2 introduces force-based prompts into the VLM expert to construct force-aware task concepts across stages, and employs a Cross-Scale Mixture-of-Experts (MoE) in the action expert to adaptively fuse these concepts with real-time interaction forces for closed-loop hybrid force-position regulation. To support learning and evaluation, we construct ForceVLA2-Dataset, containing 1,000 trajectories over 5 contact-rich tasks, including wiping, pressing, and assembling, with multi-view images, task prompts, proprioceptive state, and force signals. Extensive experiments show that ForceVLA2 substantially improves success rates and reliability in contact-rich manipulation, outperforming pi0 and pi0.5 by 48.0% and 35.0%, respectively, across the 5 tasks, and mitigating common failure modes such as arm overload and unstable contact, thereby actively advancing force-aware interactive physical intelligence in VLAs. The project page is available at https://sites.google.com/view/force-vla2/home.
comment: Accepted by CVPR 2026
Master Micro Residual Correction with Adaptive Tactile Fusion and Force-Mixed Control for Contact-Rich Manipulation
Robotic contact-rich and fine-grained manipulation remains a significant challenge due to complex interaction dynamics and the competing requirements of multi-timescale control. While current visual imitation learning methods excel at long-horizon planning, they often fail to perceive critical interaction cues like friction variations or incipient slip, and struggle to balance global task coherence with local reactive feedback. To address these challenges, we propose M2-ResiPolicy, a novel Master-Micro residual control architecture that synergizes high-level action guidance with low-level correction. The framework consists of a Master-Guidance Policy (MGP) operating at 10 Hz, which generates temporally consistent action chunks via a diffusion-based backbone and employs a tactile-intensity-driven adaptive fusion mechanism to dynamically modulate perceptual weights between vision and touch. Simultaneously, a high-frequency (60 Hz) Micro-Residual Corrector (MRC) utilizes a lightweight GRU to provide real-time action compensation based on TCP wrench feedback. This policy is further integrated with a force-mixed PBIC execution layer, effectively regulating contact forces to ensure interaction safety. Experiments across several demanding tasks including fragile object grasping and precision insertion, demonstrate that M2-ResiPolicy significantly outperforms standard Diffusion Policy (DP) and state-of-the-art Reactive Diffusion Policy (RDP), achieving a 93\% damage-free success rate in chip grasping and superior force regulation stability.
Confusion-Aware In-Context-Learning for Vision-Language Models in Robotic Manipulation SC
Vision-language models (VLMs) have significantly improved the generalization capabilities of robotic manipulation. However, VLM-based systems often suffer from a lack of robustness, leading to unpredictable errors, particularly in scenarios involving confusable objects. Our preliminary analysis reveals that these failures are mainly caused by shortcut learning problem inherently in VLMs, limiting their ability to accurately distinguish between confusable features. To this end, we propose Confusion-Aware In-Context Learning (CAICL), a method that enhances VLM performance in confusable scenarios for robotic manipulation. The approach begins with confusion localization and analysis, identifying potential sources of confusion. This information is then used as a prompt for the VLM to focus on features most likely to cause misidentification. Extensive experiments on the VIMA-Bench show that CAICL effectively addresses the shortcut learning issue, achieving a 85.5\% success rate and showing good stability across tasks with different degrees of generalization.
comment: Accepted by the 29th International Conference on Computer Supported Cooperative Work in Design (CSCWD 2026)
A Novel Camera-to-Robot Calibration Method for Vision-Based Floor Measurements SP
A novel hand-eye calibration method for ground-observing mobile robots is proposed. While cameras on mobile robots are com- mon, they are rarely used for ground-observing measurement tasks. Laser trackers are increasingly used in robotics for precise localization. A referencing plate is designed to combine the two measurement modalities of laser-tracker 3D metrology and camera- based 2D imaging. It incorporates reflector nests for pose acquisition using a laser tracker and a camera calibration target that is observed by the robot-mounted camera. The procedure comprises estimating the plate pose, the plate-camera pose, and the robot pose, followed by computing the robot-camera transformation. Experiments indicate sub-millimeter repeatability.
comment: 8 pages; accepted for publication in the ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences
BodyGuards: Escorting by Multiple Robots in Unknown Environment under Limited Communication ICRA 2026
Multi-robot systems are increasingly deployed in high-risk missions such as reconnaissance, disaster response, and subterranean operations. Protecting a human operator while navigating unknown and adversarial environments remains a critical challenge, especially when the communication among the operator and robots is restricted. Unlike existing collaborative exploration methods that aim for complete coverage, this work focuses on task-oriented exploration to minimize the navigation time of the operator to reach its goal while ensuring safety under adversarial threats. A novel escorting framework BodyGuards, is proposed to explicitly integrate seamlessly collaborative exploration, inter-robot-operator communication and escorting. The framework consists of three core components: (I) a dynamic movement strategy for the operator that maintains a local map with risk zones for proactive path planning; (II) a dual-mode robotic strategy combining frontier based exploration with optimized return events to balance exploration, threat detection, and intermittent communication; and (III) multi-robot coordination protocols that jointly plan exploration and information sharing for efficient escorting. Extensive human-in-the-loop simulations and hardware experiments demonstrate that the method significantly reduces operator risk and mission time, outperforming baselines in adversarial and constrained environments.
comment: Accept by ICRA 2026
AeroGrab: A Unified Framework for Aerial Grasping in Cluttered Environments
Reliable aerial grasping in cluttered environments remains challenging due to occlusions and collision risks. Existing aerial manipulation pipelines largely rely on centroid-based grasping and lack integration between the grasp pose generation models, active exploration, and language-level task specification, resulting in the absence of a complete end-to-end system. In this work, we present an integrated pipeline for reliable aerial grasping in cluttered environments. Given a scene and a language instruction, the system identifies the target object and actively explores it to gain better views of the object. During exploration, a grasp generation network predicts multiple 6-DoF grasp candidates for each view. Each candidate is evaluated using a collision-aware feasibility framework, and the overall best grasp is selected and executed using standard trajectory generation and control methods. Experiments in cluttered real-world scenarios demonstrate robust and reliable grasp execution, highlighting the effectiveness of combining active perception with feasibility-aware grasp selection for aerial manipulation.
HALO:Closing Sim-to-Real Gap for Heavy-loaded Humanoid Agile Motion Skills via Differentiable Simulation
Humanoid robots deployed in real-world scenarios often need to carry unknown payloads, which introduce significant mismatch and degrade the effectiveness of simulation-to-reality reinforcement learning methods. To address this challenge, we propose a two-stage gradient-based system identification framework built on the differentiable simulator MuJoCo XLA. The first stage calibrates the nominal robot model using real-world data to reduce intrinsic sim-to-real discrepancies, while the second stage further identifies the mass distribution of the unknown payload. By explicitly reducing structured model bias prior to policy training, our approach enables zero-shot transfer of reinforcement learning policies to hardware under heavy-load conditions. Extensive simulation and real-world experiments demonstrate more precise parameter identification, improved motion tracking accuracy, and substantially enhanced agility and robustness compared to existing baselines. Project Page: https://mwondering.github.io/halo-humanoid/
comment: 9 pages, 5 figures, conference
Multi-Mode Pneumatic Artificial Muscles Driven by Hybrid Positive-Negative Pressure
Artificial muscles embody human aspirations for engineering lifelike robotic movements. This paper introduces an architecture for Inflatable Fluid-Driven Origami-Inspired Artificial Muscles (IN-FOAMs). A typical IN-FOAM consists of an inflatable skeleton enclosed within an outer skin, which can be driven using a combination of positive and negative pressures (e.g., compressed air and vacuum). IN-FOAMs are manufactured using low-cost heat-sealable sheet materials through heat-pressing and heat-sealing processes. Thus, they can be ultra-thin when not actuated, making them flexible, lightweight, and portable. The skeleton patterns are programmable, enabling a variety of motions, including contracting, bending, twisting, and rotating, based on specific skeleton designs. We conducted comprehensive experimental, theoretical, and numerical studies to investigate IN-FOAM's basic mechanical behavior and properties. The results show that IN-FOAM's output force and contraction can be tuned through multiple operation modes with the applied hybrid positive-negative pressure. Additionally, we propose multilayer skeleton structures to enhance the contraction ratio further, and we demonstrate a multi-channel skeleton approach that allows the integration of multiple motion modes into a single IN-FOAM. These findings indicate that IN-FOAMs hold great potential for future applications in flexible wearable devices and compact soft robotic systems.
comment: 20 pages, 17 figures. Published in IEEE Transactions on Robotics
AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation
In this study, we address the problem of language-guided robotic manipulation, where a robot is required to manipulate a wide range of objects based on visual observations and natural language instructions. This task is essential for service robots that operate in human environments, and requires safety, efficiency, and task-level generality. Although Vision-Language-Action models (VLAs) have demonstrated strong performance for this task, their deployment in resource-constrained environments remains challenging because of the computational cost of standard transformer backbones. To overcome this limitation, we propose AnoleVLA, a lightweight VLA that uses a deep state space model to process multimodal sequences efficiently. The model leverages its lightweight and fast sequential state modeling to process visual and textual inputs, which allows the robot to generate trajectories efficiently. We evaluated the proposed method in both simulation and physical experiments. Notably, in real-world evaluations, AnoleVLA outperformed a representative large-scale VLA by 21 points for the task success rate while achieving an inference speed approximately three times faster.
CycleRL: Sim-to-Real Deep Reinforcement Learning for Robust Autonomous Bicycle Control
Autonomous bicycles offer a promising agile solution for urban mobility and last-mile logistics, however, conventional control strategies often struggle with their underactuated nonlinear dynamics, suffering from sensitivity to model mismatches and limited adaptability to real-world uncertainties. To address this, this paper presents CycleRL, the first sim-to-real deep reinforcement learning framework designed for robust autonomous bicycle control. Our approach trains an end-to-end neural control policy within the high-fidelity NVIDIA Isaac Sim environment, leveraging Proximal Policy Optimization (PPO) to circumvent the need for an explicit dynamics model. The framework features a composite reward function tailored for concurrent balance maintenance, velocity tracking, and steering control. Crucially, systematic domain randomization is employed to bridge the simulation-to-reality gap and facilitate direct transfer. In simulation, CycleRL achieves considerable performance, including a 99.90% balance success rate, a low steering tracking error of 1.15°, and a velocity tracking error of 0.18 m/s. These quantitative results, coupled with successful hardware transfer, validate DRL as an effective paradigm for autonomous bicycle control, offering superior adaptability over traditional methods. Video demonstrations are available at https://anony6f05.github.io/CycleRL/.
comment: 10 pages, 7 figures, 9 tables
Thermal Image Refinement with Depth Estimation using Recurrent Networks for Monocular ORB-SLAM3
Autonomous navigation in GPS-denied and visually degraded environments remains challenging for unmanned aerial vehicles (UAVs). To this end, we investigate the use of a monocular thermal camera as a standalone sensor on a UAV platform for real-time depth estimation and simultaneous localization and mapping (SLAM). To extract depth information from thermal images, we propose a novel pipeline employing a lightweight supervised network with recurrent blocks (RBs) integrated to capture temporal dependencies, enabling more robust predictions. The network combines lightweight convolutional backbones with a thermal refinement network (T-RefNet) to refine raw thermal inputs and enhance feature visibility. The refined thermal images and predicted depth maps are integrated into ORB-SLAM3, enabling thermal-only localization. Unlike previous methods, the network is trained on a custom non-radiometric dataset, obviating the need for high-cost radiometric thermal cameras. Experimental results on datasets and UAV flights demonstrate competitive depth accuracy and robust SLAM performance under low-light conditions. On the radiometric VIVID++ (indoor-dark) dataset, our method achieves an absolute relative error of approximately 0.06, compared to baselines exceeding 0.11. In our non-radiometric indoor set, baseline errors remain above 0.24, whereas our approach remains below 0.10. Thermal-only ORB-SLAM3 maintains a mean trajectory error under 0.4 m.
comment: 8 pages, 8 figures, 2 table
ReMAP-DP: Reprojected Multi-view Aligned PointMaps for Diffusion Policy
Generalist robot policies built upon 2D visual representations excel at semantic reasoning but inherently lack the explicit 3D spatial awareness required for high-precision tasks. Existing 3D integration methods struggle to bridge this gap due to the structural irregularity of sparse point clouds and the geometric distortion introduced by multi-view orthographic rendering. To overcome these barriers, we present ReMAP-DP, a novel framework synergizing standardized perspective reprojection with a structure-aware dual-stream diffusion policy. By coupling the re-projected views with pixel-aligned PointMaps, our dual-stream architecture leverages learnable modality embeddings to fuse frozen semantic features and explicit geometric descriptors, ensuring precise implicit patch-level alignment. Extensive experiments across simulation and real-world environments demonstrate ReMAP-DP's superior performance in diverse manipulation tasks. On RoboTwin 2.0, it attains a 59.3% average success rate, outperforming the DP3 baseline by +6.6%. On ManiSkill 3, our method yields a 28% improvement over DP3 on the geometrically challenging Stack Cube task. Furthermore, ReMAP-DP exhibits remarkable real-world robustness, executing high-precision and dynamic manipulations with superior data efficiency from only a handful of demonstrations. Project page is available at: https://icr-lab.github.io/ReMAP-DP/
Voronoi-based Second-order Descriptor with Whitened Metric in LiDAR Place Recognition ICRA 26
The pooling layer plays a vital role in aggregating local descriptors into the metrizable global descriptor in the LiDAR Place Recognition (LPR). In particular, the second-order pooling is capable of capturing higher-order interactions among local descriptors. However, its existing methods in the LPR adhere to conventional implementations and post-normalization, and incur the descriptor unsuitable for Euclidean distancing. Based on the recent interpretation that associates NetVLAD with the second-order statistics, we propose to integrate second-order pooling with the inductive bias from Voronoi cells. Our novel pooling method aggregates local descriptors to form the second-order matrix and whitens the global descriptor to implicitly measure the Mahalanobis distance while conserving the cluster property from Voronoi cells, addressing its numerical instability during learning with diverse techniques. We demonstrate its performance gains through the experiments conducted on the Oxford Robotcar and Wild-Places benchmarks and analyze the numerical effect of the proposed whitening algorithm.
comment: Accepted at ICRA 26
Learning from Mistakes: Post-Training for Driving VLA with Takeover Data
Current Vision-Language-Action (VLA) paradigms in end-to-end autonomous driving rely on offline training from static datasets, leaving them vulnerable to distribution shift. Recent post-training methods use takeover data to mitigate this by augmenting the dataset with high-quality expert takeover samples, yet they suffer from two key limitations: supervision restricted to the period after the takeover moments leads to policies with limited safety margins, and passive preference optimization lacks active exploration for optimal performance. In this paper, we propose TakeVLA, a novel VLA post-training framework that overcomes these shortcomings through two complementary innovations. First, we introduce pre-takeover language supervision, which allows the VLA to learn from mistakes proactively. By explicitly teaching the model about what to do in error-prone situations, we cultivate a precautionary mindset that anticipates hazards early and substantially enlarges safety margins. Second, we propose Scenario Dreaming, a reinforcement fine-tuning paradigm that operates in reconstruceted takeover scenarios, encouraging active exploration beyond mere preference fitting. Experiments on the Bench2Drive benchmark demonstrate that TakeVLA achieves state-of-the-art closed-loop performance, surpassing the strong VLA baseline SimLingo by 4.93 in driving score, with an enhanced safety margin as evidenced by an 11.76% increase in average TTC.
Intelligent Control of Differential Drive Robots Subject to Unmodeled Dynamics with EKF-based State Estimation
Reliable control and state estimation of differential drive robots (DDR) operating in dynamic and uncertain environments remains a challenge, particularly when system dynamics are partially unknown and sensor measurements are prone to degradation. This work introduces a unified control and state estimation framework that combines a Lyapunov-based nonlinear controller and Adaptive Neural Networks (ANN) with Extended Kalman Filter (EKF)-based multi-sensor fusion. The proposed controller leverages the universal approximation property of neural networks to model unknown nonlinearities in real time. An online adaptation scheme updates the weights of the radial basis function (RBF), the architecture chosen for the ANN. The learned dynamics are integrated into a feedback linearization (FBL) control law, for which theoretical guarantees of closed-loop stability and asymptotic convergence in a trajectory-tracking task are established through a Lyapunov-like stability analysis. To ensure robust state estimation, the EKF fuses inertial measurement unit (IMU) and odometry from monocular, 2D-LiDAR and wheel encoders. The fused state estimate drives the intelligent controller, ensuring consistent performance even under drift, wheel slip, sensor noise and failure. Gazebo simulations and real-world experiments are done using DDR, demonstrating the effectiveness of the approach in terms of improved velocity tracking performance with reduction in linear and angular velocity errors up to $53.91\%$ and $29.0\%$ in comparison to the baseline FBL.
Transformers As Generalizable Optimal Controllers
We study whether optimal state-feedback laws for a family of heterogeneous Multiple-Input, Multiple-Output (MIMO) Linear Time-Invariant (LTI) systems can be captured by a single learned controller. We train one transformer policy on LQR-generated trajectories from systems with different state and input dimensions, using a shared representation with standardization, padding, dimension encoding, and masked loss. The policy maps recent state history to control actions without requiring plant matrices at inference time. Across a broad set of systems, it achieves empirically small sub-optimality relative to Linear Quadratic Regulator (LQR), remains stabilizing under moderate parameter perturbations, and benefits from lightweight fine-tuning on unseen systems. These results support transformer policies as practical approximators of near-optimal feedback laws over structured linear-system families.
comment: 6 pages
PerlAD: Towards Enhanced Closed-loop End-to-end Autonomous Driving with Pseudo-simulation-based Reinforcement Learning
End-to-end autonomous driving policies based on Imitation Learning (IL) often struggle in closed-loop execution due to the misalignment between inadequate open-loop training objectives and real driving requirements. While Reinforcement Learning (RL) offers a solution by directly optimizing driving goals via reward signals, the rendering-based training environments introduce the rendering gap and are inefficient due to high computational costs. To overcome these challenges, we present a novel Pseudo-simulation-based RL method for closed-loop end-to-end autonomous driving, PerlAD. Based on offline datasets, PerlAD constructs a pseudo-simulation that operates in vector space, enabling efficient, rendering-free trial-and-error training. To bridge the gap between static datasets and dynamic closed-loop environments, PerlAD introduces a prediction world model that generates reactive agent trajectories conditioned on the ego vehicle's plan. Furthermore, to facilitate efficient planning, PerlAD utilizes a hierarchical decoupled planner that combines IL for lateral path generation and RL for longitudinal speed optimization. Comprehensive experimental results demonstrate that PerlAD achieves state-of-the-art performance on the Bench2Drive benchmark, surpassing the previous E2E RL method by 10.29% in Driving Score without requiring expensive online interactions. Additional evaluations on the DOS benchmark further confirm its reliability in handling safety-critical occlusion scenarios.
comment: Accepted by IEEE RA-L. Submitted: 2025.12.2; Revised: 2026.2.4; Accepeted: 2026.3.7
From Folding Mechanics to Robotic Function: A Unified Modeling Framework for Compliant Origami
Origami inspired architectures offer a powerful route toward lightweight, reconfigurable, and programmable robotic systems. Yet, a unified mechanics framework capable of seamlessly bridging rigid folding, elastic deformation, and stability driven transitions in compliant origami remains lacking. Here, we introduce a geometry consistent modeling framework based on discrete differential geometry (DDG) that unifies panel elasticity and crease rotation within a single variational formulation. By embedding crease panel coupling directly into a mid edge geometric discretization, the framework naturally captures rigid folding limits, distributed bending, multistability, and nonlinear dynamic snap through within one mechanically consistent structure. This unified description enables programmable control of stability and deformation across rigid and compliant regimes, allowing origami structures to transition from static folding mechanisms to active robotic modules. An implicit dynamic formulation incorporating gravity, contact, friction, and magnetic actuation further supports strongly coupled multiphysics simulations. Through representative examples spanning single fold bifurcation, deployable Miura membranes, bistable Waterbomb modules, and Kresling based crawling robots, we demonstrate how geometry driven mechanics directly informs robotic functionality. This work establishes discrete differential geometry as a foundational design language for intelligent origami robotics, enabling predictive modeling, stability programming, and mechanics guided robotic actuation within a unified computational platform.
comment: 24 pages, 7 figures
ViSA: Visited-State Augmentation for Generalized Goal-Space Contrastive Reinforcement Learning
Goal-Conditioned Reinforcement Learning (GCRL) is a framework for learning a policy that can reach arbitrarily given goals. In particular, Contrastive Reinforcement Learning (CRL) provides a framework for policy updates using an approximation of the value function estimated via contrastive learning, achieving higher sample efficiency compared to conventional methods. However, since CRL treats the visited state as a pseudo-goal during learning, it can accurately estimate the value function only for limited goals. To address this issue, we propose a novel data augmentation approach for CRL called ViSA (Visited-State Augmentation). ViSA consists of two components: 1) generating augmented state samples, with the aim of augmenting hard-to-visit state samples during on-policy exploration, and 2) learning consistent embedding space, which uses an augmented state as auxiliary information to regularize the embedding space by reformulating the objective function of the embedding space based on mutual information. We evaluate ViSA in simulation and real-world robotic tasks and show improved goal-space generalization, which permits accurate value estimation for hard-to-visit goals. Further details can be found on the project page: \href{https://issa-n.github.io/projectPage_ViSA/}{\texttt{https://issa-n.github.io/projectPage\_ViSA/}}
comment: 8 pages, 7 figures, under Review
Surgical Robot, Path Planning, Joint Space, Riemannian Manifolds
Robotic surgery for minimally invasive surgery can reduce the surgeon's workload by autonomously guiding robotic forceps. Movement of the robot is restricted around a fixed insertion port. The robot often encounters angle limitations during operation. Also, the surface of the abdominal cavity is non-concave, making it computationally expensive to find the desired path.In this work, to solve these problems, we propose a method for path planning in joint space by transforming the position into a Riemannian manifold. An edge cost function is defined to search for a desired path in the joint space and reduce the range of motion of the joints. We found that the organ is mostly non-concave, making it easy to find the optimal path using gradient descent method. Experimental results demonstrated that the proposed method reduces the range of joint angle movement compared to calculations in position space.
comment: 11 pages, 8 figures
AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving
Integrating vision-language models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose \OURS in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through asynchronous execution at different task frequencies. Extensive experiments on multiple benchmarks, under both open- and closed-loop settings, demonstrate that \OURS achieves competitive performance compared to state-of-the-art methods. We further investigate the functional boundary of pre-trained VLMs in AD, examining when AD-tailored fine-tuning is necessary. Our results show that pre-trained VLMs can achieve competitive multi-task scene understanding performance through semantic prompting alone, while fine-tuning remains essential for action-level tasks such as decision-making and trajectory planning. We refer to \href{https://automot-website.github.io/}{Project Page} for the demonstration videos and qualitative results.
Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning
Understanding the world from distributed, partial viewpoints is a fundamental challenge for embodied multi-agent systems. Each agent perceives the environment through an ego-centric view that is often limited by occlusion and ambiguity. To study this problem, we introduce the Ego-to-World (E2W) benchmark, which evaluates a vision-language model's ability to fuse heterogeneous viewpoints across three tasks: (i) global counting, (ii) relational location reasoning, and (iii) action-oriented grasping that requires predicting view-specific image coordinates. To address this setting, we propose CoRL, a two-stage framework that combines Chain-of-Thought supervised fine-tuning with reinforcement learning using Group-Relative Policy Optimization. Its core component, the Cross-View Spatial Reward (CVSR), provides dense task-aligned feedback by linking reasoning steps to visual evidence, ensuring coherent cross-view entity resolution, and guiding the model toward correct final predictions. Experiments on E2W show that CoRL consistently surpasses strong proprietary and open-source baselines on both reasoning and perception-grounding metrics, while ablations further confirm the necessity of each CVSR component. Beyond that, CoRL generalizes to external spatial reasoning benchmarks and enables effective real-world multi-robot manipulation with calibrated multi-camera rigs, demonstrating cross-view localization and successful grasp-and-place execution. Together, E2W and CoRL provide a principled foundation for learning world-centric scene understanding from distributed, ego-centric observations, advancing collaborative embodied AI.
A Unified Calibration Framework for Coordinate and Kinematic Parameters in Dual-Arm Robots
Precise collaboration in vision-based dual-arm robot systems requires accurate system calibration. Recent dual-robot calibration methods have achieved strong performance by simultaneously solving multiple coordinate transformations. However, these methods either treat kinematic errors as implicit noise or handle them through separated error modeling, resulting in non-negligible accumulated errors. In this paper, we present a novel framework for unified calibration of the coordinate transformations and kinematic parameters in both robot arms. Our key idea is to unify all the tightly coupled parameters within a single Lie-algebraic formulation. To this end, we construct a consolidated error model grounded in the product-of-exponentials formula, which naturally integrates the coordinate and kinematic parameters in twist forms. Our model introduces no artificial error separation and thus greatly mitigates the error propagation. In addition, we derive a closed-form analytical Jacobian from this model using Lie derivatives. By exploring the Jacobian rank property, we analyze the identifiability of all calibration parameters and show that our joint optimization is well-posed under mild conditions. This enables off-the-shelf iterative solvers to stably optimize these parameters on the manifold space. Besides, to ensure robust convergence of our joint optimization, we develop a certifiably correct algorithm for initializing the unknown coordinates. Relying on semidefinite relaxation, our algorithm can yield a reliable estimate whose near-global optimality can be verified a posteriori. Extensive experiments validate the superior accuracy of our approach over previous baselines under identical visual measurements. Meanwhile, our certifiable initialization consistently outperforms several coordinate-only baselines, proving its reliability as a starting point for joint optimization.
comment: 21 pages, 12 figures
HiMemVLN: Enhancing Reliability of Open-Source Zero-Shot Vision-and-Language Navigation with Hierarchical Memory System
LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) tasks. However, most zero-shot methods primarily rely on closed-source LLMs as navigators, which face challenges related to high token costs and potential data leakage risks. Recent efforts have attempted to address this by using open-source LLMs combined with a spatiotemporal CoT framework, but they still fall far short compared to closed-source models. In this work, we identify a critical issue, Navigation Amnesia, through a detailed analysis of the navigation process. This issue leads to navigation failures and amplifies the gap between open-source and closed-source methods. To address this, we propose HiMemVLN, which incorporates a Hierarchical Memory System into a multimodal large model to enhance visual perception recall and long-term localization, mitigating the amnesia issue and improving the agent's navigation performance. Extensive experiments in both simulated and real-world environments demonstrate that HiMemVLN achieves nearly twice the performance of the open-source state-of-the-art method. The code is available at https://github.com/lvkailin0118/HiMemVLN.
comment: 9 pages, 7 figures
Global Truncated Loss Minimization for Robust and Threshold-Resilient Geometric Estimation
To achieve outlier-robust geometric estimation, robust objective functions are generally employed to mitigate the influence of outliers. The widely used consensus maximization(CM) is highly robust when paired with global branch-and-bound(BnB) search. However, CM relies solely on inlier counts and is sensitive to the inlier threshold. Besides, the discrete nature of CM leads to loose bounds, necessitating extensive BnB iterations and computation cost. Truncated losses(TL), another continuous alternative, leverage residual information more effectively and could potentially overcome these issues. But to our knowledge, no prior work has systematically explored globally minimizing TL with BnB and its potential for enhanced threshold resilience or search efficiency. In this work, we propose GTM, the first unified BnB-based framework for globally-optimal TL loss minimization across diverse geometric problems. GTM involves a hybrid solving design: given an n-dimensional problem, it performs BnB search over an (n-1)-dimensional subspace while the remaining 1D variable is solved by bounding the objective function. Our hybrid design not only reduces the search space, but also enables us to derive Lipschitz-continuous bounding functions that are general, tight, and can be efficiently solved by a classic global Lipschitz solver named DIRECT, which brings further acceleration. We conduct a systematic evaluation on various BnB-based methods for CM and TL on the robust linear regression problem, showing that GTM enjoys remarkable threshold resilience and the highest efficiency compared to baseline methods. Furthermore, we apply GTM on different geometric estimation problems with diverse residual forms. Extensive experiments demonstrate that GTM achieves state-of-the-art outlier-robustness and threshold-resilience while maintaining high efficiency across these estimation tasks.
comment: 19 pages, 10 figures
GraspALL: Adaptive Structural Compensation from Illumination Variation for Robotic Garment Grasping in Any Low-Light Conditions
Achieving accurate garment grasping under dynamically changing illumination is crucial for all-day operation of service robots.However, the reduced illumination in low-light scenes severely degrades garment structural features, leading to a significant drop in grasping robustness.Existing methods typically enhance RGB features by exploiting the illumination-invariant properties of non-RGB modalities, yet they overlook the varying dependence on non-RGB features under varying lighting conditions, which can introduce misaligned non-RGB cues and thereby weaken the model's adaptability to illumination changes when utilizing multimodal information.To address this problem, we propose GraspALL, an illumination-structure interactive compensation model.The innovation of GraspALL lies in encoding continuous illumination changes into quantitative references to guide adaptive feature fusion between RGB and non-RGB modalities according to varying lighting intensities, thereby generating illumination-consistent grasping representations.Experiments on the self-built garment grasping dataset demonstrate that GraspALL improves grasping accuracy by 32-44% over baselines under diverse illumination conditions.
Exploring the dynamic properties and motion reproducibility of a small upper-body humanoid robot with 13-DOF pneumatic actuation for data-driven control
Pneumatically-actuated anthropomorphic robots with high degrees of freedom (DOF) offer significant potential for physical human-robot interaction. However, precise control of pneumatic actuators is challenging due to their inherent nonlinearities. This paper presents the development of a compact 13-DOF upper-body humanoid robot. To assess the feasibility of an effective controller, we first investigate its key dynamic properties, such as actuation time delays, and confirm that the system exhibits highly reproducible behavior. Leveraging this reproducibility, we implement a preliminary data-driven controller for a 4-DOF arm subsystem based on a multilayer perceptron with explicit time delay compensation. The network was trained on random movement data to generate pressure commands for tracking arbitrary trajectories. Comparative evaluations with a traditional PID controller demonstrate superior trajectory tracking performance, highlighting the potential of data-driven approaches for controlling complex, high-DOF pneumatic robots.
comment: 24 pages, 21 figures. Submitted to Advanced Robotics
CORAL: COntextual Reasoning And Local Planning in A Hierarchical VLM Framework for Underwater Monitoring IROS 2026
Oyster reefs are critical ecosystem species that sustain biodiversity, filter water, and protect coastlines, yet they continue to decline globally. Restoring these ecosystems requires regular underwater monitoring to assess reef health, a task that remains costly, hazardous, and limited when performed by human divers. Autonomous underwater vehicles (AUVs) offer a promising alternative, but existing AUVs rely on geometry-based navigation that cannot interpret scene semantics. Recent vision-language models (VLMs) enable semantic reasoning for intelligent exploration, but existing VLM-driven systems adopt an end-to-end paradigm, introducing three key limitations. First, these systems require the VLM to generate every navigation decision, forcing frequent waits for inference. Second, VLMs cannot model robot dynamics, causing collisions in cluttered environments. Third, limited self-correction allows small deviations to accumulate into large path errors. To address these limitations, we propose CORAL, a framework that decouples high-level semantic reasoning from low-level reactive control. The VLM provides high-level exploration guidance by selecting waypoints, while a dynamics-based planner handles low-level collision-free execution. A geometric verification module validates waypoints and triggers replanning when needed. Compared with the previous state-of-the-art, CORAL improves coverage by 14.28% percentage points, or 17.85% relatively, reduces collisions by 100%, and requires 57% fewer VLM calls.
comment: Submitted to IROS 2026
LiDAR-EVS: Enhance Extrapolated View Synthesis for 3D Gaussian Splatting with Pseudo-LiDAR Supervision
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time LiDAR and camera synthesis in autonomous driving simulation. However, simulating LiDAR with 3DGS remains challenging for extrapolated views beyond the training trajectory, as existing methods are typically trained on single-traversal sensor scans, suffer from severe overfitting and poor generalization to novel ego-vehicle paths. To enable reliable simulation of LiDAR along unseen driving trajectories without external multi-pass data, we present LiDAR-EVS, a lightweight framework for robust extrapolated-view LiDAR simulation in autonomous driving. Designed to be plug-and-play, LiDAR-EVS readily extends to diverse LiDAR sensors and neural rendering baselines with minimal modification. Our framework comprises two key components: (1) pseudo extrapolated-view point cloud supervision with multi-frame LiDAR fusion, view transformation, occlusion curling, and intensity adjustment; (2) spatially-constrained dropout regularization that promotes robustness to diverse trajectory variations encountered in real-world driving. Extensive experiments demonstrate that LiDAR-EVS achieves SOTA performance on extrapolated-view LiDAR synthesis across three datasets, making it a promising tool for data-driven simulation, closed-loop evaluation, and synthetic data generation in autonomous driving systems.
comment: 22 pages, 8 figures
Efficient Event Camera Volume System ICRA 2026
Event cameras promise low latency and high dynamic range, yet their sparse output challenges integration into standard robotic pipelines. We introduce \nameframew (Efficient Event Camera Volume System), a novel framework that models event streams as continuous-time Dirac impulse trains, enabling artifact-free compression through direct transform evaluation at event timestamps. Our key innovation combines density-driven adaptive selection among DCT, DTFT, and DWT transforms with transform-specific coefficient pruning strategies tailored to each domain's sparsity characteristics. The framework eliminates temporal binning artifacts while automatically adapting compression strategies based on real-time event density analysis. On EHPT-XC and MVSEC datasets, our framework achieves superior reconstruction fidelity with DTFT delivering the lowest earth mover distance. In downstream segmentation tasks, EECVS demonstrates robust generalization. Notably, our approach demonstrates exceptional cross-dataset generalization: when evaluated with EventSAM segmentation, EECVS achieves mean IoU 0.87 on MVSEC versus 0.44 for voxel grids at 24 channels, while remaining competitive on EHPT-XC. Our ROS2 implementation provides real-time deployment with DCT processing achieving 1.5 ms latency and 2.7X higher throughput than alternative transforms, establishing the first adaptive event compression framework that maintains both computational efficiency and superior generalization across diverse robotic scenarios.
comment: Accepted to ICRA 2026
A Dual Quaternion Framework for Collision Recovery of Quadrotor
Unmanned aerial vehicles (UAVs) operating in cluttered environments require accurate impact modeling to maintain stability. However, conventional contact models decouple linear and angular impulses, risking manifold inconsistency during rapid state transitions. This article presents a dual quaternion reset map that resolves rigid-body impacts directly on the SE(3) manifold. By operating on the unified spatial twist (linear and angular velocities as a single dual entity), our formulation is algebraically equivalent to the classical Newton impulse model while preserving manifold consistency during discrete state jumps. Building on this framework, we design a hybrid recovery controller that couples linear and angular momentum to ensure strict energy dissipation across impacts. Hardware-in-the-loop benchmarks demonstrate a 24% reduction in execution latency compared to an optimized matrix-based implementation. High-fidelity MuJoCo simulations validate the controller's robustness to complex contact dynamics, showing a 56.6% reduction in post-impact root-mean-square error (RMSE) and a 41.2% decrease in peak kinetic energy compared to decoupled recovery methods.
comment: 7 pages, 5 figures
FlatLands: Generative Floormap Completion From a Single Egocentric View
A single egocentric image typically captures only a small portion of the floor, yet a complete metric traversability map of the surroundings would better serve applications such as indoor navigation. We introduce FlatLands, a dataset and benchmark for single-view bird's-eye view (BEV) floor completion. The dataset contains 270,575 observations from 17,656 real metric indoor scenes drawn from six existing datasets, with aligned observation, visibility, validity, and ground-truth BEV maps, and the benchmark includes both in- and out-of-distribution evaluation protocols. We compare training-free approaches, deterministic models, ensembles, and stochastic generative models. Finally, we instantiate the task as an end-to-end monocular RGB-to-floormaps pipeline. FlatLands provides a rigorous testbed for uncertainty-aware indoor mapping and generative completion for embodied navigation.
comment: Under review
Safety Case Patterns for VLA-based driving systems: Insights from SimLingo
Vision-Language-Action (VLA)-based driving systems represent a significant paradigm shift in autonomous driving since, by combining traffic scene understanding, linguistic interpretation, and action generation, these systems enable more flexible, adaptive, and instruction-responsive driving behaviors. However, despite their growing adoption and potential to support socially responsible autonomous driving while understanding high-level human instructions, VLA-based driving systems may exhibit new types of hazardous behaviors. Such as the addition of natural language inputs (e.g., user or navigation instructions) into the multimodal control loop, which may lead to unpredictable and unsafe behaviors that could endanger vehicle occupants and pedestrians. Hence, assuring the safety of these systems is crucial to help build trust in their operations. To support this, we propose a novel safety case design approach called RAISE. Our approach introduces novel patterns tailored to instruction-based driving systems such as VLA-based driving systems, an extension of Hazard Analysis and Risk Assessment (HARA) detailing safe scenarios and their outcomes, and a design technique to create the safety cases of VLA-based driving systems. A case study on SimLingo illustrates how our approach can be used to construct rigorous, evidence-based safety claims for this emerging class of autonomous driving systems.
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors
Learning generalizable and robust behavior cloning policies requires large volumes of high-quality robotics data. While human demonstrations (e.g., through teleoperation) serve as the standard source for expert behaviors, acquiring such data at scale in the real world is prohibitively expensive. This paper introduces ExpertGen, a framework that automates expert policy learning in simulation to enable scalable sim-to-real transfer. ExpertGen first initializes a behavior prior using a diffusion policy trained on imperfect demonstrations, which may be synthesized by large language models or provided by humans. Reinforcement learning is then used to steer this prior toward high task success by optimizing the diffusion model's initial noise while keep original policy frozen. By keeping the pretrained diffusion policy frozen, ExpertGen regularizes exploration to remain within safe, human-like behavior manifolds, while also enabling effective learning with only sparse rewards. Empirical evaluations on challenging manipulation benchmarks demonstrate that ExpertGen reliably produces high-quality expert policies with no reward engineering. On industrial assembly tasks, ExpertGen achieves a 90.5% overall success rate, while on long-horizon manipulation tasks it attains 85% overall success, outperforming all baseline methods. The resulting policies exhibit dexterous control and remain robust across diverse initial configurations and failure states. To validate sim-to-real transfer, the learned state-based expert policies are further distilled into visuomotor policies via DAgger and successfully deployed on real robotic hardware.
Gaze-Aware Task Progression Detection Framework for Human-Robot Interaction Using RGB Cameras
In human-robot interaction (HRI), detecting a human's gaze helps robots interpret user attention and intent. However, most gaze detection approaches rely on specialized eye-tracking hardware, limiting deployment in everyday settings. Appearance-based gaze estimation methods remove this dependency by using standard RGB cameras, but their practicality in HRI remains underexplored. We present a calibration-free framework for detecting task progression when information is conveyed via integrated display interfaces. The framework uses only the robot's built-in monocular RGB camera (640x480 resolution) and state-of-the-art gaze estimation to monitor attention patterns. It leverages natural behavior, where users shift focus from task interfaces to the robot's face to signal task completion, formalized through three Areas of Interest (AOI): tablet, robot face, and elsewhere. Systematic parameter optimization identifies configurations that balance detection accuracy and interaction latency. We validate our framework in a "First Day at Work" scenario, comparing it to button-based interaction. Results show a task completion detection accuracy of 77.6%. Compared to button-based interaction, the proposed system exhibits slightly higher response latency but preserves information retention and significantly improves comfort, social presence, and perceived naturalness. Notably, most participants reported that they did not consciously use eye movements to guide the interaction, underscoring the intuitive role of gaze as a communicative cue. This work demonstrates the feasibility of intuitive, low-cost, RGB-only gaze-based HRI for natural and engaging interactions.
comment: 9 pages, 7 figures. This article has been accepted for publication in IEEE Robotics and Automation Letters
AsgardBench - Evaluating Visually Grounded Interactive Planning Under Minimal Feedback
With AsgardBench we aim to evaluate visually grounded, high-level action sequence generation and interactive planning, focusing specifically on plan adaptation during execution based on visual observations rather than navigation or low-level manipulation. In the landscape of embodied AI benchmarks, AsgardBench targets the capability category of interactive planning, which is more sophisticated than offline high-level planning as it requires agents to revise plans in response to environmental feedback, yet remains distinct from low-level execution. Unlike prior embodied AI benchmarks that conflate reasoning with navigation or provide rich corrective feedback that substitutes for perception, AsgardBench restricts agent input to images, action history, and lightweight success/failure signals, isolating interactive planning in a controlled simulator without low-level control noise. The benchmark contains 108 task instances spanning 12 task types, each systematically varied through object state, placement, and scene configuration. These controlled variations create conditional branches in which a single instruction can require different action sequences depending on what the agent observes, emphasizing conditional branching and plan repair during execution. Our evaluations of leading vision language models show that performance drops sharply without visual input, revealing weaknesses in visual grounding and state tracking that ultimately undermine interactive planning. Our benchmark zeroes in on a narrower question: can a model actually use what it sees to adapt a plan when things do not go as expected?
comment: 19 figures, 6 tables, including appendix
Resilience Meets Autonomy: Governing Embodied AI in Critical Infrastructure
Critical infrastructure increasingly incorporates embodied AI for monitoring, predictive maintenance, and decision support. However, AI systems designed to handle statistically representable uncertainty struggle with cascading failures and crisis dynamics that exceed their training assumptions. This paper argues that Embodied AIs resilience depends on bounded autonomy within a hybrid governance architecture. We outline four oversight modes and map them to critical infrastructure sectors based on task complexity, risk level, and consequence severity. Drawing on the EU AI Act, ISO safety standards, and crisis management research, we argue that effective governance requires a structured allocation of machine capability and human judgement.
comment: 6 pages
Regularized Latent Dynamics Prediction is a Strong Baseline For Behavioral Foundation Models ICLR 2026
Behavioral Foundation Models (BFMs) produce agents with the capability to adapt to any unknown reward or task. These methods, however, are only able to produce near-optimal policies for the reward functions that are in the span of some pre-existing state features, making the choice of state features crucial to the expressivity of the BFM. As a result, BFMs are trained using a variety of complex objectives and require sufficient dataset coverage, to train task-useful spanning features. In this work, we examine the question: are these complex representation learning objectives necessary for zero-shot RL? Specifically, we revisit the objective of self-supervised next-state prediction in latent space for state feature learning, but observe that such an objective alone is prone to increasing state-feature similarity, and subsequently reducing span. We propose an approach, Regularized Latent Dynamics Prediction (RLDP), that adds a simple orthogonality regularization to maintain feature diversity and can match or surpass state-of-the-art complex representation learning methods for zero-shot RL. Furthermore, we empirically show that prior approaches perform poorly in low-coverage scenarios where RLDP still succeeds.
comment: ICLR 2026
FEEL (Force-Enhanced Egocentric Learning): A Dataset for Physical Action Understanding
We introduce FEEL (Force-Enhanced Egocentric Learning), the first large-scale dataset pairing force measurements gathered from custom piezoresistive gloves with egocentric video. Our gloves enable scalable data collection, and FEEL contains approximately 3 million force-synchronized frames of natural unscripted manipulation in kitchen environments, with 45% of frames involving hand-object contact. Because force is the underlying cause that drives physical interaction, it is a critical primitive for physical action understanding. We demonstrate the utility of force for physical action understanding through application of FEEL to two families of tasks: (1) contact understanding, where we jointly perform temporal contact segmentation and pixel-level contacted object segmentation; and, (2) action representation learning, where force prediction serves as a self-supervised pretraining objective for video backbones. We achieve state-of-the-art temporal contact segmentation results and competitive pixel-level segmentation results without any need for manual contacted object segmentation annotations. Furthermore we demonstrate that action representation learning with FEEL improves transfer performance on action understanding tasks without any manual labels over EPIC-Kitchens, SomethingSomething-V2, EgoExo4D and Meccano.
comment: 14 pages, 7 figures
Robust Dynamic Object Detection in Cluttered Indoor Scenes via Learned Spatiotemporal Cues
Reliable dynamic object detection in cluttered environments remains a critical challenge for autonomous navigation. Purely geometric LiDAR pipelines that rely on clustering and heuristic filtering can miss dynamic obstacles when they move in close proximity to static structure or are only partially observed. Vision-augmented approaches can provide additional semantic cues, but are often limited by closed-set detectors and camera field-of-view constraints, reducing robustness to novel obstacles and out-of-frustum events. In this work, we present a LiDAR-only framework that fuses temporal occupancy-grid-based motion segmentation with a learned bird's-eye-view (BEV) dynamic prior. A fusion module prioritizes 3D detections when available, while using the learned dynamic grid to recover detections that would otherwise be lost due to proximity-induced false negatives. Experiments with motion-capture ground truth show our method achieves 28.67% higher recall and 18.50% higher F1 score than the state-of-the-art in substantially cluttered environments while maintaining comparable precision and position error.
Emergent Dexterity via Diverse Resets and Large-Scale Reinforcement Learning
Reinforcement learning in massively parallel physics simulations has driven major progress in sim-to-real robot learning. However, current approaches remain brittle and task-specific, relying on extensive per-task engineering to design rewards, curricula, and demonstrations. Even with this engineering, they often fail on long-horizon, contact-rich manipulation tasks and do not meaningfully scale with compute, as performance quickly saturates when training revisits the same narrow regions of state space. We introduce \Method, a simple and scalable framework that enables on-policy reinforcement learning to robustly solve a broad class of dexterous manipulation tasks using a single reward function, fixed algorithm hyperparameters, no curricula, and no human demonstrations. Our key insight is that long-horizon exploration can be dramatically simplified by using simulator resets to systematically expose the RL algorithm to the diverse set of robot-object interactions which underlie dexterous manipulation. \Method\ programmatically generates such resets with minimal human input, converting additional compute directly into broader behavioral coverage and continued performance gains. We show that \Method\ gracefully scales to long-horizon dexterous manipulation tasks beyond the capabilities of existing approaches and is able to learn robust policies over significantly wider ranges of initial conditions than baselines. Finally, we distill \Method \ into visuomotor policies which display robust retrying behavior and substantially higher success rates than baselines when transferred to the real world zero-shot. Project webpage: https://omnireset.github.io
CorrectionPlanner: Self-Correction Planner with Reinforcement Learning in Autonomous Driving
Autonomous driving requires safe planning, but most learning-based planners lack explicit self-correction ability: once an unsafe action is proposed, there is no mechanism to correct it. Thus, we propose CorrectionPlanner, an autoregressive planner with self-correction that models planning as motion-token generation within a propose, evaluate, and correct loop. At each planning step, the policy proposes an action, namely a motion token, and a learned collision critic predicts whether it will induce a collision within a short horizon. If the critic predicts a collision, we retain the sequence of historical unsafe motion tokens as a self-correction trace, generate the next motion token conditioned on it, and repeat this process until a safe motion token is proposed or the safety criterion is met. This self-correction trace, consisting of all unsafe motion tokens, represents the planner's correction process in motion-token space, analogous to a reasoning trace in language models. We train the planner with imitation learning followed by model-based reinforcement learning using rollouts from a pretrained world model that realistically models agents' reactive behaviors. Closed-loop evaluations show that CorrectionPlanner reduces collision rate by over 20% on Waymax and achieves state-of-the-art planning scores on nuPlan.
Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation
Simulation-to-real transfer remains a central challenge in robotics, as mismatches between simulated and real-world dynamics often lead to failures. While reinforcement learning offers a principled mechanism for adaptation, existing sim-to-real finetuning methods struggle with exploration and long-horizon credit assignment in the low-data regimes typical of real-world robotics. We introduce Simulation Distillation (SimDist), a sim-to-real framework that distills structural priors from a simulator into a latent world model and enables rapid real-world adaptation via online planning and supervised dynamics finetuning. By transferring reward and value models directly from simulation, SimDist provides dense planning signals from raw perception without requiring value learning during deployment. As a result, real-world adaptation reduces to short-horizon system identification, avoiding long-horizon credit assignment and enabling fast, stable improvement. Across precise manipulation and quadruped locomotion tasks, SimDist substantially outperforms prior methods in data efficiency, stability, and final performance. Project website and code: https://sim-dist.github.io/
comment: Project website: https://sim-dist.github.io/
You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector
What happens when a pretrained generative robot policy is provided a constant initial noise as input, rather than repeatedly sampling it from a Gaussian? We demonstrate that the performance of a pretrained, frozen diffusion or flow matching policy can be improved with respect to a downstream reward by swapping the sampling of initial noise from the prior distribution (typically isotropic Gaussian) with a well-chosen, constant initial noise input -- a golden ticket. We propose a search method to find golden tickets using Monte-Carlo policy evaluation that keeps the pretrained policy frozen, does not train any new networks, and is applicable to all diffusion/flow matching policies (and therefore many VLAs). Our approach to policy improvement makes no assumptions beyond being able to inject initial noise into the policy and calculate (sparse) task rewards of episode rollouts, making it deployable with no additional infrastructure or models. Our method improves the performance of policies in 38 out of 43 tasks across simulated and real-world robot manipulation benchmarks, with relative improvements in success rate by up to 58% for some simulated tasks, and 60% within 50 search episodes for real-world tasks. We also show unique benefits of golden tickets for multi-task settings: the diversity of behaviors from different tickets naturally defines a Pareto frontier for balancing different objectives (e.g., speed, success rates); in VLAs, we find that a golden ticket optimized for one task can also boost performance in other related tasks. We release a codebase with pretrained policies and golden tickets for simulation benchmarks using VLAs, diffusion policies, and flow matching policies.
comment: 13 pages, 9 figures
S2Act: Simple Spiking Actor
Spiking neural networks (SNNs) and biologically-inspired learning mechanisms are attractive in mobile robotics, where the size and performance of onboard neural network policies are constrained by power and computational budgets. Existing SNN approaches, such as population coding, reward modulation, and hybrid artificial neural network (ANN)-SNN architectures, have shown promising results; however, they face challenges in complex, highly stochastic environments due to SNN sensitivity to hyperparameters and inconsistent gradient signals. To address these challenges, we propose simple spiking actor (S2Act), a computationally lightweight framework that deploys an RL policy using an SNN in three steps: (1) architect an actor-critic model based on an approximated network of rate-based spiking neurons, (2) train the network with gradients using compatible activation functions, and (3) transfer the trained weights into physical parameters of rate-based leaky integrate-and-fire (LIF) neurons for inference and deployment. By globally shaping LIF neuron parameters such that their rate-based responses approximate ReLU activations, S2Act effectively mitigates the vanishing gradient problem, while pre-constraining LIF response curves reduces reliance on complex SNN-specific hyperparameter tuning. We demonstrate our method in two multi-agent stochastic environments (capture-the-flag and parking) that capture the complexity of multi-robot interactions, and deploy our trained policies on physical TurtleBot platforms using Intel's Loihi neuromorphic hardware. Our experimental results show that S2Act outperforms relevant baselines in task performance and real-time inference in nearly all considered scenarios, highlighting its potential for rapid prototyping and efficient real-world deployment of SNN-based RL policies.
comment: This work has been submitted to the IEEE for possible publication
Embodied Foundation Models at the Edge: A Survey of Deployment Constraints and Mitigation Strategies
Deploying foundation models in embodied edge systems is fundamentally a systems problem, not just a problem of model compression. Real-time control must operate within strict size, weight, and power constraints, where memory traffic, compute latency, timing variability, and safety margins interact directly. The Deployment Gauntlet organizes these constraints into eight coupled barriers that determine whether embodied foundation models can run reliably in practice. Across representative edge workloads, autoregressive Vision-Language-Action policies are constrained primarily by memory bandwidth, whereas diffusion-based controllers are limited more by compute latency and sustained execution cost. Reliable deployment therefore depends on system-level co-design across memory, scheduling, communication, and model architecture, including decompositions that separate fast control from slower semantic reasoning.
GoalSwarm: Multi-UAV Semantic Coordination for Open-Vocabulary Object Navigation
Cooperative visual semantic navigation is a foundational capability for aerial robot teams operating in unknown environments. However, achieving robust open-vocabulary object-goal navigation remains challenging due to the computational constraints of deploying heavy perception models onboard and the complexity of decentralized multi-agent coordination. We present GoalSwarm, a fully decentralized multi-UAV framework for zero-shot semantic object-goal navigation. Each UAV collaboratively constructs a shared, lightweight 2D top-down semantic occupancy map by projecting depth observations from aerial vantage points, eliminating the computational burden of full 3D representations while preserving essential geometric and semantic structure. The core contributions of GoalSwarm are threefold: (1) integration of zero-shot foundation model -- SAM3 for open vocabulary detection and pixel-level segmentation, enabling open-vocabulary target identification without task-specific training; (2) a Bayesian Value Map that fuses multi-viewpoint detection confidences into a per-pixel goal-relevance distribution, enabling informed frontier scoring via Upper Confidence Bound (UCB) exploration; and (3) a decentralized coordination strategy combining semantic frontier extraction, cost-utility bidding with geodesic path costs, and spatial separation penalties to minimize redundant exploration across the swarm.
comment: 6 pages, 2 figures
On transferring safety certificates across dynamical systems
Control barrier functions (CBFs) provide a powerful tool for enforcing safety constraints in control systems, but their direct application to complex, high-dimensional dynamics is often challenging. In many settings, safety certificates are more naturally designed for simplified or alternative system models that do not exactly match the dynamics of interest. This paper addresses the problem of transferring safety guarantees between dynamical systems with mismatched dynamics. We propose a transferred control barrier function (tCBF) framework that enables safety constraints defined on one system to be systematically enforced on another system using a simulation function and an explicit margin term. The resulting transferred barrier accounts for model mismatch and induces a safety condition that can be enforced on the target system via a quadratic-program-based safety filter. The proposed approach is general and does not require the two systems to share the same state dimension or dynamics. We demonstrate the effectiveness of the framework on a quadrotor navigation task with the transferred barrier ensuring collision avoidance for the target system, while remaining minimally invasive to a nominal controller. These results highlight the potential of transferred control barrier functions as a general mechanism for enforcing safety across heterogeneous dynamical systems.
Optimization-Based Robust Permissive Synthesis for Interval MDPs
We present an optimization-based framework for robust permissive synthesis for Interval Markov Decision Processes (IMDPs), motivated by robotic decision-making under transition uncertainty. In many robotic systems, model inaccuracies and sensing noise lead to interval-valued transition probabilities. While robust IMDP synthesis typically yields a single policy and permissive synthesis assumes exact models, we show that robust permissive synthesis under interval uncertainty can be cast as a global mixed-integer linear program (MILP) that directly encodes robust Bellman constraints. The formulation maximizes a quantitative permissiveness metric (the number of enabled state-action pairs), while guaranteeing that every compliant strategy satisfies probabilistic reachability or expected reward specifications under all admissible transition realizations. To address the exponential complexity of vertex-based uncertainty representations, we derive a dualization-based encoding that eliminates explicit vertex enumeration and scales linearly with the number of successors. Experimental evaluation on four representative robotic benchmark domains demonstrates scalability to IMDPs with hundreds of thousands of states. The proposed framework provides a practical and general foundation for uncertainty-aware, flexibility-preserving controller synthesis in robotic systems.
PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization
Recent progress in text-conditioned human motion generation has been largely driven by diffusion models trained on large-scale human motion data. Building on this progress, recent methods attempt to transfer such models for character animation and real robot control by applying a Whole-Body Controller (WBC) that converts diffusion-generated motions into executable trajectories. While WBC trajectories become compliant with physics, they may expose substantial deviations from original motion. To address this issue, we here propose PhysMoDPO, a Direct Preference Optimization framework. Unlike prior work that relies on hand-crafted physics-aware heuristics such as foot-sliding penalties, we integrate WBC into our training pipeline and optimize diffusion model such that the output of WBC becomes compliant both with physics and original text instructions. To train PhysMoDPO we deploy physics-based and task-specific rewards and use them to assign preference to synthesized trajectories. Our extensive experiments on text-to-motion and spatial control tasks demonstrate consistent improvements of PhysMoDPO in both physical realism and task-related metrics on simulated robots. Moreover, we demonstrate that PhysMoDPO results in significant improvements when applied to zero-shot motion transfer in simulation and for real-world deployment on a G1 humanoid robot.
comment: Project page: https://mael-zys.github.io/PhysMoDPO/
sim2art: Accurate Articulated Object Modeling from a Single Video using Synthetic Training Data Only
Understanding articulated objects from monocular video is a crucial yet challenging task in robotics and digital twin creation. Existing methods often rely on complex multi-view setups, high-fidelity object scans, or fragile long-term point tracks that frequently fail in casual real-world captures. In this paper, we present sim2art, a data-driven framework that recovers the 3D part segmentation and joint parameters of articulated objects from a single monocular video captured by a freely moving camera. Our core insight is a robust representation based on per-frame surface point sampling, which we augment with short-term scene flow and DINOv3 semantic features. Unlike previous works that depend on error-prone long-term correspondences, our representation is easy to obtain and exhibits a negligible difference between simulation and reality without requiring domain adaptation. Also, by construction, our method relies on single-viewpoint visibility, ensuring that the geometric representation remains consistent across synthetic and real data despite noise and occlusions. Leveraging a suitable Transformer-based architecture, sim2art is trained exclusively on synthetic data yet generalizes strongly to real-world sequences. To address the lack of standardized benchmarks in the field, we introduce two datasets featuring a significantly higher diversity of object categories and instances than prior work. Our evaluations show that sim2art effectively handles large camera motions and complex articulations, outperforming state-of-the-art optimization-based and tracking-dependent methods. sim2art offers a scalable solution that can be easily extended to new object categories without the need for cumbersome real-world annotations. Project webpage: https://aartykov.github.io/sim2art/
Lightweight 3D LiDAR-Based UAV Tracking: An Adaptive Extended Kalman Filtering Approach
Accurate relative positioning is crucial for swarm aerial robotics, enabling coordinated flight and collision avoidance. Although vision-based tracking has been extensively studied, 3D LiDAR-based methods remain underutilized despite their robustness under varying lighting conditions. Existing systems often rely on bulky, power-intensive sensors, making them impractical for small UAVs with strict payload and energy constraints. This paper presents a lightweight LiDAR-based UAV tracking system incorporating an Adaptive Extended Kalman Filter (AEKF) framework. Our approach effectively addresses the challenges posed by sparse, noisy, and nonuniform point cloud data generated by non-repetitive scanning 3D LiDARs, ensuring reliable tracking while remaining suitable for small drones with strict payload constraints. Unlike conventional filtering techniques, the proposed method dynamically adjusts the noise covariance matrices using innovation and residual statistics, thereby enhancing tracking accuracy under real-world conditions. Additionally, a recovery mechanism ensures continuity of tracking during temporary detection failures caused by scattered LiDAR returns or occlusions. Experimental validation was performed using a Livox Mid-360 LiDAR mounted on a DJI F550 UAV in real-world flight scenarios. The proposed method demonstrated robust UAV tracking performance under sparse LiDAR returns and intermittent detections, consistently outperforming both standard Kalman filtering and particle filtering approaches during aggressive maneuvers. These results confirm that the framework enables reliable relative positioning in GPS-denied environments without the need for multi-sensor arrays or external infrastructure.
comment: Presented at the 19th International Conference on Intelligent Autonomous Systems, IAS-19, Genoa, Italy, June 30 to July 4, 2025. To appear in the Springer post-proceedings of the conference
RAG-3DSG: Enhancing 3D Scene Graphs with Re-Shot Guided Retrieval-Augmented Generation
Open-vocabulary 3D Scene Graph (3DSG) can enhance various downstream tasks in robotics by leveraging structured semantic representations, yet current 3DSG construction methods suffer from semantic inconsistencies caused by noisy cross-image aggregation under occlusions and constrained viewpoints. To mitigate the impact of such inconsistency, we propose RAG-3DSG, which introduces re-shot guided uncertainty estimation. By measuring the semantic consistency between original limited viewpoints and re-shot optimal viewpoints, this method quantifies the underlying semantic ambiguity of each graph object. Based on this quantification, we devise an Object-level Retrieval-Augmented Generation (RAG) that leverages low-uncertainty objects as semantic anchors to retrieve more reliable contextual knowledge, enabling a Vision-Language Model to rectify the predictions of uncertain objects and optimize the final 3DSG. Extensive evaluations across three challenging benchmarks and real-world robot trials demonstrate that RAG-3DSG achieves superior recall and precision, effectively mitigating semantic noise to provide highly reliable scene representations for robotics tasks.
CLAIM: Camera-LiDAR Alignment with Intensity and Monodepth IROS 2025
In this paper, we unleash the potential of the powerful monodepth model in camera-LiDAR calibration and propose CLAIM, a novel method of aligning data from the camera and LiDAR. Given the initial guess and pairs of images and LiDAR point clouds, CLAIM utilizes a coarse-to-fine searching method to find the optimal transformation minimizing a patched Pearson correlation-based structure loss and a mutual information-based texture loss. These two losses serve as good metrics for camera-LiDAR alignment results and require no complicated steps of data processing, feature extraction, or feature matching like most methods, rendering our method simple and adaptive to most scenes. We validate CLAIM on public KITTI, Waymo, and MIAS-LCEC datasets, and the experimental results demonstrate its superior performance compared with the state-of-the-art methods. The code is available at https://github.com/Tompson11/claim.
comment: Accepted by IROS 2025
Persistent Autoregressive Mapping with Traffic Rules for Autonomous Driving AAAI2026
Safe autonomous driving requires both accurate HD map construction and persistent awareness of traffic rules, even when their associated signs are no longer visible. However, existing methods either focus solely on geometric elements or treat rules as temporary classifications, failing to capture their persistent effectiveness across extended driving sequences. In this paper, we present PAMR (Persistent Autoregressive Mapping with Traffic Rules), a novel framework that performs autoregressive co-construction of lane vectors and traffic rules from visual observations. Our approach introduces two key mechanisms: Map-Rule Co-Construction for processing driving scenes in temporal segments, and Map-Rule Cache for maintaining rule consistency across these segments. To properly evaluate continuous and consistent map generation, we develop MapDRv2, featuring improved lane geometry annotations. Extensive experiments demonstrate that PAMR achieves superior performance in joint vector-rule mapping tasks, while maintaining persistent rule effectiveness throughout extended driving sequences.
comment: AAAI2026
Barrier-Riccati Synthesis for Nonlinear Safe Control with Expanded Region of Attraction
We present a Riccati-based framework for safety-critical nonlinear control that integrates the barrier states (BaS) methodology with the State-Dependent Riccati Equation (SDRE) approach. The BaS formulation embeds safety constraints into the system dynamics via auxiliary states, enabling safety to be treated as a control objective. To overcome the limited region of attraction in linear BaS controllers, we extend the framework to nonlinear systems using SDRE synthesis applied to the barrier-augmented dynamics and derive a matrix inequality condition that certifies forward invariance of a large region of attraction and guarantees asymptotic safe stabilization. The resulting controller is computed online via pointwise Riccati solutions. We validate the method on an unstable constrained system and cluttered quadrotor navigation tasks, demonstrating improved constraint handling, scalability, and robustness near safety boundaries. This framework offers a principled and computationally tractable solution for synthesizing nonlinear safe feedback in safety-critical environments.
comment: This work has been accepted for publication in the proceedings of the 2026 American Control Conference (ACC), New Orleans, Louisiana, USA
Open-World Motion Forecasting
Motion forecasting aims to predict the future trajectories of dynamic agents in the scene, enabling autonomous vehicles to effectively reason about scene evolution. Existing approaches operate under the closed-world regime and assume fixed object taxonomy as well as access to high-quality perception. Therefore, they struggle in real-world settings where perception is imperfect and object taxonomy evolves over time. In this work, we bridge this fundamental gap by introducing open-world motion forecasting, a novel setting in which new object classes are sequentially introduced over time and future object trajectories are estimated directly from camera images. We tackle this setting by proposing the first end-to-end class-incremental motion forecasting framework to mitigate catastrophic forgetting while simultaneously learning to forecast newly introduced classes. When a new class is introduced, our framework employs a pseudo-labeling strategy to first generate motion forecasting pseudo-labels for all known classes which are then processed by a vision-language model to filter inconsistent and over-confident predictions. Parallelly, our approach further mitigates catastrophic forgetting by using a novel replay sampling strategy that leverages query feature variance to sample previous sequences with informative motion patterns. Extensive evaluation on the nuScenes and Argoverse 2 datasets demonstrates that our approach successfully resists catastrophic forgetting and maintains performance on previously learned classes while improving adaptation to novel ones. Further, we demonstrate that our approach supports zero-shot transfer to real-world driving and naturally extends to end-to-end class-incremental planning, enabling continual adaptation of the full autonomous driving system. We provide the code at https://omen.cs.uni-freiburg.de.
comment: V2: Adapt author affiliation
MoRoCo: An Online Topology-Adaptive Framework for Multi-Operator Multi-Robot Coordination under Restricted Communication
Fleets of autonomous robots are increasingly deployed with multiple human operators in communication-restricted environments for exploration and intervention tasks such as subterranean inspection, reconnaissance, and search-and-rescue. In these settings, communication is often limited to short-range ad-hoc links, making it difficult to coordinate exploration while supporting online human-fleet interactions. Existing work on multi-robot exploration largely focuses on information gathering itself, but pays limited attention to the fact that operators and robots issue time-critical requests during execution. These requests may require different communication structures, ranging from intermittent status delivery to sustained video streaming and teleoperation. To address this challenge, this paper presents MoRoCo, an online topology-adaptive framework for multi-operator multi-robot coordination under restricted communication. MoRoCo is built on a latency-bounded intermittent communication backbone that guarantees a prescribed delay for information collected by any robot to reach an operator, together with a detach-and-rejoin mechanism that enables online team resizing and topology reconfiguration. On top of this backbone, the framework instantiates request-consistent communication subgraphs to realize different modes of operator-robot interaction by jointly assigning robot roles, positions, and communication topology. It further supports the online decomposition and composition of these subgraphs using only local communication, allowing multiple requests to be serviced during exploration. The framework extends to heterogeneous fleets, multiple teams, and robot failures. Extensive human-in-the-loop simulations and hardware experiments demonstrate effective and reliable coordination under restricted communication.
comment: 20 pages, 19 figures. Submitted to IEEE Transactions on Robotics (TRO)
Learning Dexterous Manipulation with Quantized Hand State ICRA 2026
Dexterous robotic hands enable robots to perform complex manipulations that require fine-grained control and adaptability. Achieving such manipulation is challenging because the high degrees of freedom tightly couple hand and arm motions, making learning and control difficult. Successful dexterous manipulation relies not only on precise hand motions, but also on accurate spatial positioning of the arm and coordinated arm-hand dynamics. However, most existing visuomotor policies represent arm and hand actions in a single combined space, which often causes high-dimensional hand actions to dominate the coupled action space and compromise arm control. To address this, we propose DQ-RISE, which quantizes hand states to simplify hand motion prediction while preserving essential patterns, and applies a continuous relaxation that allows arm actions to diffuse jointly with these compact hand states. This design enables the policy to learn arm-hand coordination from data while preventing hand actions from overwhelming the action space. Experiments show that DQ-RISE achieves more balanced and efficient learning, paving the way toward structured and generalizable dexterous manipulation. Project website: http://rise-policy.github.io/DQ-RISE/
comment: accepted by ICRA 2026
History-Aware Visuomotor Policy Learning via Point Tracking ICRA 2026
Many manipulation tasks require memory beyond the current observation, yet most visuomotor policies rely on the Markov assumption and thus struggle with repeated states or long-horizon dependencies. Existing methods attempt to extend observation horizons but remain insufficient for diverse memory requirements. To this end, we propose an object-centric history representation based on point tracking, which abstracts past observations into a compact and structured form that retains only essential task-relevant information. Tracked points are encoded and aggregated at the object level, yielding a compact history representation that can be seamlessly integrated into various visuomotor policies. Our design provides full history-awareness with high computational efficiency, leading to improved overall task performance and decision accuracy. Through extensive evaluations on diverse manipulation tasks, we show that our method addresses multiple facets of memory requirements - such as task stage identification, spatial memorization, and action counting, as well as longer-term demands like continuous and pre-loaded memory - and consistently outperforms both Markovian baselines and prior history-based approaches. Project website: http://tonyfang.net/history
comment: accepted by ICRA 2026
EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer
The generalization of vision-language-action (VLA) models heavily relies on diverse training data. However, acquiring large-scale data for robot manipulation across varied object appearances is costly and labor-intensive. To address this limitation, we introduce Embodied Manipulation Media Adaptation (EMMA), a framework for augmenting VLA policies that combines a generative data engine with an effective training pipeline. We introduce DreamTransfer, a diffusion Transformer-based architecture for generating multi-view consistent and geometrically grounded embodied manipulation videos. DreamTransfer enables visual editing of robot videos through prompts, allowing for changes to the foreground, background, and lighting while preserving their 3D structure and geometric validity. We also utilize a hybrid training set of real and generated data and propose AdaMix to enhance the training process. AdaMix is a training strategy that adaptively weights samples according to policy performance to emphasize challenging samples. Comprehensive evaluations demonstrate that videos created by DreamTransfer yield substantial improvements over previous video generation techniques in multi-view consistency, geometric accuracy, and text-conditioning precision. We conduct extensive evaluations with a total of more than 1800 trials in both simulated and real-world robotic environments. In real-world robotic tasks with zero-shot visual settings, our framework achieves a relative performance increase of over 92% compared to training with real data alone, and improves by an additional 17% with AdaMix, demonstrating its efficacy in enhancing policy generalization.
RoboMD: Uncovering Robot Vulnerabilities through Semantic Potential Fields
Robot manipulation policies, while central to the promise of physical AI, are highly vulnerable in the presence of external variations in the real world. Diagnosing these vulnerabilities is hindered by two key challenges: (i) the relevant variations to test against are often unknown, and (ii) direct testing in the real world is costly and unsafe. We introduce a framework that tackles both issues by learning a separate deep reinforcement learning (deep RL) policy for vulnerability prediction through virtual runs on a continuous vision-language embedding trained with limited success-failure data. By treating this embedding space, which is rich in semantic and visual variations, as a potential field, the policy learns to move toward vulnerable regions while being repelled from success regions. This vulnerability prediction policy, trained on virtual rollouts, enables scalable and safe vulnerability analysis without expensive physical trials. By querying this policy, our framework builds a probabilistic vulnerability-likelihood map. Experiments across simulation benchmarks and a physical robot arm show that our framework uncovers up to 23% more unique vulnerabilities than state-of-the-art vision-language baselines, revealing subtle vulnerabilities overlooked by heuristic testing. Additionally, we show that fine-tuning the manipulation policy with the vulnerabilities discovered by our framework improves manipulation performance with much less fine-tuning data.
comment: 26 Pages, 20 figures
MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation CVPR 2026
Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relational edges with dynamically assigned images. Built on M3DSG, we propose MSGNav, a zero-shot navigation system that includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration reasoning. Additionally, we further identify the last mile problem in zero-shot navigation determining the feasible target location with a suitable final viewpoint, and propose a Visibility-based Viewpoint Decision module to explicitly resolve it. Comprehensive experimental results demonstrate that MSGNav achieves state-of-the-art performance on the challenging GOAT-Bench and HM3D-ObjNav benchmark. The code will be publicly available at https://github.com/ylwhxht/MSGNav.
comment: 18 pages, Accepted by CVPR 2026
MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models
Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks.
TurboMap: GPU-Accelerated Local Mapping for Visual SLAM IROS 2026
In real-time Visual SLAM systems, local mapping must operate under strict latency constraints, as delays degrade map quality and increase the risk of tracking failure. GPU parallelization offers a promising way to reduce latency. However, parallelizing local mapping is challenging due to synchronized shared-state updates and the overhead of transferring large map data structures to the GPU. This paper presents TurboMap, a GPU-parallelized and CPU-optimized local mapping backend that holistically addresses these challenges. We restructure Map Point Creation to enable parallel Keypoint Correspondence Search on the GPU, redesign and parallelize Map Point Fusion, optimize Redundant Keyframe Culling on the CPU, and integrate a fast GPU-based Local Bundle Adjustment solver. To minimize data transfer and synchronization costs, we introduce persistent GPU-resident keyframe storage. Experiments on the EuRoC and TUM-VI datasets show average local mapping speedups of 1.3x and 1.6x, respectively, while preserving accuracy.
comment: Submitted to IROS 2026
H2R: A Human-to-Robot Data Augmentation for Robot Pre-training from Videos
Large-scale pre-training using egocentric human videos has proven effective for robot learning. However, the models pre-trained on such data can be suboptimal for robot learning due to the significant visual gap between human hands and those of different robots. To remedy this, we propose H2R, a human-to-robot data augmentation pipeline that converts egocentric human videos into robot-centric visual data. H2R estimates human hand pose from videos, retargets the motion to simulated robotic arms, removes human limbs via segmentation and inpainting, and composites rendered robot embodiments into the original frames with camera-aligned geometry. This process explicitly bridges the visual gap between human and robot embodiments during pre-training. We apply H2R to augment large-scale egocentric human video datasets such as Ego4D and SSv2. To verify the effectiveness of the augmentation pipeline, we introduce a CLIP-based image-text similarity metric that quantitatively evaluates the semantic fidelity of robot-rendered frames to the original human actions. We evaluate H2R through comprehensive experiments in both simulation and real-world settings. In simulation, H2R consistently improves downstream success rates across four benchmark suites-Robomimic, RLBench, PushT, and CortexBench-yielding gains of 1.3%-10.2% across different visual encoders and policy learning methods. In real-world experiments, H2R improves performance on UR5 and dual-arm Franka/UR5 manipulation platforms, achieving 3.3%-23.3% success rate gains across gripper-based, dexterous, and bimanual tasks. We further demonstrate the potential of H2R in cross-embodiment generalization and its compatibility with vision-language-action models. These results indicate that H2R improves the generalization ability of robotic policies by mitigating the visual discrepancies between human and robot domains.
DynaFlow: Dynamics-embedded Flow Matching for Physically Consistent Motion Generation from State-only Demonstrations
This paper introduces DynaFlow, a novel framework that embeds a differentiable simulator directly into a flow matching model. By generating trajectories in the action space and mapping them to dynamically feasible state trajectories via the simulator, DynaFlow ensures all outputs are physically consistent by construction. This end-to-end differentiable architecture enables training on state-only demonstrations, allowing the model to simultaneously generate physically consistent state trajectories while inferring the underlying action sequences required to produce them. We demonstrate the effectiveness of our approach through quantitative evaluations and showcase its real-world applicability by deploying the generated actions onto a physical Go1 quadruped robot. The robot successfully reproduces diverse gait present in the dataset, executes long-horizon motions in open-loop control and translates infeasible kinematic demonstrations into dynamically executable, stylistic behaviors. These hardware experiments validate that DynaFlow produces deployable, highly effective motions on real-world hardware from state-only demonstrations, effectively bridging the gap between kinematic data and real-world execution.
comment: 8 pages
TinyIO: Lightweight Reparameterized Inertial Odometry
Inertial odometry (IO) is a widely used approach for localization on mobile devices; however, obtaining a lightweight IO model that also achieves high accuracy remains challenging. To address this issue, we propose TinyIO, a lightweight IO method. During training, we adopt a multi-branch architecture to extract diverse motion features more effectively. At inference time, the trained multi-branch model is converted into an equivalent single-path architecture to reduce computational complexity. We further propose a Dual-Path Adaptive Attention mechanism (DPAA), which enhances TinyIO's perception of contextual motion along both channel and temporal dimensions with negligible additional parameters. Extensive experiments on public datasets demonstrate that our method attains a favorable trade-off between accuracy and model size. On the RoNIN dataset, TinyIO reduces the ATE by 23.53% compared with R-ResNet and decreases the parameter count by 3.68%.
SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving ICRA 2026
In autonomous driving, end-to-end (E2E) driving systems that predict control commands directly from sensor data have achieved significant advancements. For safe driving in unexpected scenarios, these systems may additionally rely on human interventions such as natural language instructions. Using a multi-modal large language model (MLLM) facilitates human-vehicle interaction and can improve performance in such scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and numerous visual tokens from sensor inputs, which are limited in autonomous vehicles. Many MLLM studies have explored reducing visual tokens, but often suffer end-task performance degradation compared to using all tokens. To enable efficient E2E driving while maintaining performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for multi-modal LLMs (SToRM). The proposed framework consists of three key elements. First, a lightweight importance predictor with short-term sliding windows estimates token importance scores. Second, a supervised training approach uses an auxiliary path to obtain pseudo-supervision signals from an all-token LLM pass. Third, an anchor-context merging module partitions tokens into anchors and context tokens, and merges context tokens into relevant anchors to reduce redundancy while minimizing information loss. Experiments on the LangAuto benchmark show that SToRM outperforms state-of-the-art E2E driving MLLMs under the same reduced-token budget, maintaining all-token performance while reducing computational cost by up to 30x, and enabling real-time E2E driving on a standard GPU.
comment: Accepted to ICRA 2026
Pose Estimation of a Thruster-Driven Bioinspired Multi-Link Robot
This work demonstrates simultaneous pose (position and orientation) and shape estimation for a free-floating, bioinspired multi-link robot with unactuated joints, link-mounted thrusters for control, and a single gyroscope per link, resulting in an underactuated, minimally sensed platform. Because the inter-link joint angles are constrained, translation and rotation of the multi-link system requires cyclic, reciprocating actuation of the thrusters, referred to as a gait. Through a proof-of-concept hardware experiment and offline analysis, we show that the robot's shape can be reliably estimated using an Unscented Kalman Filter augmented with Gaussian process residual models to compensate for non-zero-mean, non-Gaussian noise, while the pose exhibits drift expected from gyroscope integration in the absence of absolute position measurements. Experimental results demonstrate that a Gaussian process model trained on a multi-gait dataset (forward, backward, left, right, and turning) performs comparably to one trained exclusively on forward-gait data, revealing an overlap in the gait input space, which can be exploited to reduce per-gait training data requirements while enhancing the filter's generalizability across multiple gaits. Lastly, we introduce a heuristic derived from the observability Gramian to correlate joint angle estimate quality with gait periodicity and thruster inputs, highlighting how control affects estimation quality.
comment: 8 pages, 8 figures
VLAD-Grasp: Zero-shot Grasp Detection via Vision-Language Models
Robotic grasping is a fundamental capability for enabling autonomous manipulation, with usually infinite solutions. State-of-the-art approaches for grasping rely on learning from large-scale datasets comprising expert annotations of feasible grasps. Curating such datasets is challenging, and hence, learning-based methods are limited by the solution coverage of the dataset, and require retraining to handle novel objects. Towards this, we present VLAD-Grasp, a Vision-Language model Assisted zero-shot approach for Detecting Grasps. Our method (1) prompts a large vision-language model to generate a goal image where a virtual cylindrical proxy intersects the object's geometry, explicitly encoding an antipodal grasp axis in image space, then (2) predicts depth and segmentation to lift this generated image into 3D, and (3) aligns generated and observed object point clouds via principal components and correspondence-free optimization to recover an executable grasp pose. Unlike prior work, our approach is training-free and does not require curated grasp datasets, while achieving performance competitive with the state-of-the-art methods on the Cornell and Jacquard datasets. Furthermore, we demonstrate zero-shot generalization to real-world objects on a Franka Research 3 robot, highlighting vision-language models as powerful priors for robotic manipulation.
comment: 8 pages, 4 figures, under review
A Deconfounding Framework for Human Behavior Prediction: Enhancing Robotic Systems in Dynamic Environments
Accurate prediction of human behavior is crucial for effective human-robot interaction (HRI) systems, especially in dynamic environments where real-time decisions are essential. This paper addresses the challenge of forecasting future human behavior using multivariate time series data from wearable sensors, which capture various aspects of human movement. The presence of hidden confounding factors in this data often leads to biased predictions, limiting the reliability of traditional models. To overcome this, we propose a robust predictive model that integrates deconfounding techniques with advanced time series prediction methods, enhancing the model's ability to isolate true causal relationships and improve prediction accuracy. Evaluation on real-world datasets demonstrates that our approach significantly outperforms traditional methods, providing a more reliable foundation for responsive and adaptive HRI systems.
comment: 7 pages, Under review
Register Any Point: Scaling 3D Point Cloud Registration by Flow Matching
Point cloud registration aligns multiple unposed point clouds into a common reference frame and is a core step for 3D reconstruction and robot localization without initial guess. In this work, we cast registration as conditional generation: a learned, continuous point-wise velocity field transports noisy points to a registered scene, from which the pose of each view is recovered. Unlike prior methods that perform correspondence matching to estimate pairwise transformations and then optimize a pose graph for multi-view registration, our model directly generates the registered point cloud, yielding both efficiency and point-level global consistency. By scaling the training data and conducting test-time rigidity enforcement, our approach achieves state-of-the-art results on existing pairwise registration benchmarks and on our proposed cross-domain multi-view registration benchmark. The superior zero-shot performance on this benchmark shows that our method generalizes across view counts, scene scales, and sensor modalities even with low overlap. Source code available at: https://github.com/PRBonn/RAP.
No More Blind Spots: Learning Vision-Based Omnidirectional Bipedal Locomotion for Challenging Terrain
Effective bipedal locomotion in dynamic environments, such as cluttered indoor spaces or uneven terrain, requires agile and adaptive movement in all directions. This necessitates omnidirectional terrain sensing and a controller capable of processing such input. We present a learning framework for vision-based omnidirectional bipedal locomotion, enabling seamless movement using depth images. A key challenge is the high computational cost of rendering omnidirectional depth images in simulation, making traditional sim-to-real reinforcement learning (RL) impractical. Our method combines a robust blind controller with a teacher policy that supervises a vision-based student policy, trained on noise-augmented terrain data to avoid rendering costs during RL and ensure robustness. We also introduce a data augmentation technique for supervised student training, accelerating training by up to 10 times compared to conventional methods. Our framework is validated through simulation and real-world tests, demonstrating effective omnidirectional locomotion with minimal reliance on expensive rendering. This is, to the best of our knowledge, the first demonstration of vision-based omnidirectional bipedal locomotion, showcasing its adaptability to diverse terrains.
World Models for Learning Dexterous Hand-Object Interactions from Human Videos
Modeling dexterous hand-object interactions is challenging as it requires understanding how subtle finger motions influence the environment through contact with objects. While recent world models address interaction modeling, they typically rely on coarse action spaces that fail to capture fine-grained dexterity. We, therefore, introduce DexWM, a Dexterous Interaction World Model that predicts future latent states of the environment conditioned on past states and dexterous actions. To overcome the scarcity of finely annotated dexterous datasets, DexWM represents actions using finger keypoints extracted from egocentric videos, enabling training on over 900 hours of human and non-dexterous robot data. Further, to accurately model dexterity, we find that predicting visual features alone is insufficient; therefore, we incorporate an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, or full-body actions in future-state prediction and demonstrates strong zero-shot transfer to unseen skills on a Franka Panda arm with an Allegro gripper, surpassing Diffusion Policy by over 50% on average across grasping, placing, and reaching tasks.
Real-World Deployment of Cloud-based Autonomous Mobility Systems for Outdoor and Indoor Environments
Autonomous mobility systems increasingly operate in dense and dynamic environments where perception occlusions, limited sensing coverage, and multi-agent interactions pose major challenges. While onboard sensors provide essential local perception, they often struggle to maintain reliable situational awareness in crowded urban or indoor settings. This article presents the Cloud-based Autonomous Mobility (CAM) framework, a generalized architecture that integrates infrastructure-based intelligent sensing with cloud-level coordination to enhance autonomous operations. The system deploys distributed Intelligent Sensor Nodes (ISNs) equipped with cameras, LiDAR, and edge computing to perform multi-modal perception and transmit structured information to a cloud platform via high-speed wireless communication. The cloud aggregates observations from multiple nodes to generate a global scene representation for other autonomous modules, such as decision making, motion planning, etc. Real-world deployments in an urban roundabout and a hospital-like indoor environment demonstrate improved perception robustness, safety, and coordination for future intelligent mobility systems.
comment: This paper has been submitted to IEEE Robotics and Automation Magazine
GeoFIK: A Fast and Reliable Geometric Solver for the IK of the Franka Arm based on Screw Theory Enabling Multiple Redundancy Parameters
Modern robotics applications require an inverse kinematics (IK) solver that is fast, robust and consistent, and that provides all possible solutions. Currently, the Franka robot arm is the most widely used manipulator in robotics research. With 7 DOFs, the IK of this robot is not only complex due to its 1-DOF redundancy, but also due to the link offsets at the wrist and elbow. Due to this complexity, none of the Franka IK solvers available in the literature provide satisfactory results when used in real-world applications. Therefore, in this paper we introduce GeoFIK (Geometric Franka IK), an analytical IK solver that allows the use of different joint variables to resolve the redundancy. The approach uses screw theory to describe the entire geometry of the robot, allowing the computation of the Jacobian matrix prior to computation of joint angles. All singularities are identified and handled. As an example of how the geometric elements obtained by the IK can be exploited, a solver with the swivel angle as the free variable is provided. Several experiments are carried out to validate the speed, robustness and reliability of the GeoFIK against two state-of-the-art solvers.
Real-time Capable Learning-based Visual Tool Pose Correction via Differentiable Simulation
Autonomy in robot-assisted minimally invasive surgery has the potential to reduce surgeon cognitive and task load, thereby increasing procedural efficiency. However, implementing accurate autonomous control can be difficult due to poor end-effector proprioception. Joint encoder readings are typically inaccurate due to kinematic non-idealities in their cable-driven transmissions. Vision-based pose estimation approaches are highly effective, but lack real-time capability, generalizability, or can be hard to train. In this work, we demonstrate a real-time capable, Vision Transformer-based pose estimation approach that is trained using end-to-end differentiable kinematics and rendering. We demonstrate the potential of this approach to correct for noisy pose estimates through a real robot dataset and the potential real-time processing ability. Our approach is able to reduce more than 50% of hand-eye translation errors in the dataset, reaching the same performance level as an existing optimization-based method. Our approach is four times faster, and capable of near real-time inference at 22 Hz. A zero-shot prediction on an unseen dataset shows good generalization ability, and can be further finetuned for increased performance without human labeling.
Multiagent Systems
TrinityGuard: A Unified Framework for Safeguarding Multi-Agent Systems
With the rapid development of LLM-based multi-agent systems (MAS), their significant safety and security concerns have emerged, which introduce novel risks going beyond single agents or LLMs. Despite attempts to address these issues, the existing literature lacks a cohesive safeguarding system specialized for MAS risks. In this work, we introduce TrinityGuard, a comprehensive safety evaluation and monitoring framework for LLM-based MAS, grounded in the OWASP standards. Specifically, TrinityGuard encompasses a three-tier fine-grained risk taxonomy that identifies 20 risk types, covering single-agent vulnerabilities, inter-agent communication threats, and system-level emergent hazards. Designed for scalability across various MAS structures and platforms, TrinityGuard is organized in a trinity manner, involving an MAS abstraction layer that can be adapted to any MAS structures, an evaluation layer containing risk-specific test modules, alongside runtime monitor agents coordinated by a unified LLM Judge Factory. During Evaluation, TrinityGuard executes curated attack probes to generate detailed vulnerability reports for each risk type, where monitor agents analyze structured execution traces and issue real-time alerts, enabling both pre-development evaluation and runtime monitoring. We further formalize these safety metrics and present detailed case studies across various representative MAS examples, showcasing the versatility and reliability of TrinityGuard. Overall, TrinityGuard acts as a comprehensive framework for evaluating and monitoring various risks in MAS, paving the way for further research into their safety and security.
PMAx: An Agentic Framework for AI-Driven Process Mining
Process mining provides powerful insights into organizational workflows, but extracting these insights typically requires expertise in specialized query languages and data science tools. Large Language Models (LLMs) offer the potential to democratize process mining by enabling business users to interact with process data through natural language. However, using LLMs as direct analytical engines over raw event logs introduces fundamental challenges: LLMs struggle with deterministic reasoning and may hallucinate metrics, while sending large, sensitive logs to external AI services raises serious data-privacy concerns. To address these limitations, we present PMAx, an autonomous agentic framework that functions as a virtual process analyst. Rather than relying on LLMs to generate process models or compute analytical results, PMAx employs a privacy-preserving multi-agent architecture. An Engineer agent analyzes event-log metadata and autonomously generates local scripts to run established process mining algorithms, compute exact metrics, and produce artifacts such as process models, summary tables, and visualizations. An Analyst agent then interprets these insights and artifacts to compile comprehensive reports. By separating computation from interpretation and executing analysis locally, PMAx ensures mathematical accuracy and data privacy while enabling non-technical users to transform high-level business questions into reliable process insights.
comment: Submitted to EMMSAD 2026 (tool demonstration track), under review
Intelligent Co-Design: An Interactive LLM Framework for Interior Spatial Design via Multi-Modal Agents
In architectural interior design, miscommunication frequently arises as clients lack design knowledge, while designers struggle to explain complex spatial relationships, leading to delayed timelines and financial losses. Recent advancements in generative layout tools narrow the gap by automating 3D visualizations. However, prevailing methodologies exhibit limitations: rule-based systems implement hard-coded spatial constraints that restrict participatory engagement, while data-driven models rely on extensive training datasets. Recent large language models (LLMs) bridge this gap by enabling intuitive reasoning about spatial relationships through natural language. This research presents an LLM-based, multimodal, multi-agent framework that dynamically converts natural language descriptions and imagery into 3D designs. Specialized agents (Reference, Spatial, Interactive, Grader), operating via prompt guidelines, collaboratively address core challenges: the agent system enables real-time user interaction for iterative spatial refinement, while Retrieval-Augmented Generation (RAG) reduces data dependency without requiring task-specific model training. This framework accurately interprets spatial intent and generates optimized 3D indoor design, improving productivity, and encouraging nondesigner participation. Evaluations across diverse floor plans and user questionnaires demonstrate effectiveness. An independent LLM evaluator consistently rated participatory layouts higher in user intent alignment, aesthetic coherence, functionality, and circulation. Questionnaire results indicated 77% satisfaction and a clear preference over traditional design software. These findings suggest the framework enhances user-centric communication and fosters more inclusive, effective, and resilient design processes. Project page: https://rsigktyper.github.io/AICodesign/
comment: 25 pages, 20 figures; accepted for publication in the Proceedings of ACADIA 2025
SAGE: Multi-Agent Self-Evolution for LLM Reasoning
Reinforcement learning with verifiable rewards improves reasoning in large language models (LLMs), but many methods still rely on large human-labeled datasets. While self-play reduces this dependency, it often lacks explicit planning and strong quality control, limiting stability in long-horizon multi-step reasoning. We present SAGE (Self-evolving Agents for Generalized reasoning Evolution), a closed-loop framework where four agents: Challenger, Planner, Solver, and Critic, co-evolve from a shared LLM backbone using only a small seed set. The Challenger continuously generates increasingly difficult tasks; the Planner converts each task into a structured multi-step plan; and the Solver follows the plan to produce an answer, whose correctness is determined by external verifiers. The Critic scores and filters both generated questions and plans to prevent curriculum drift and maintain training signal quality, enabling stable self-training. Across mathematics and code-generation benchmarks, SAGE delivers consistent gains across model scales, improving the Qwen-2.5-7B model by 8.9% on LiveCodeBench and 10.7% on OlympiadBench.
Token Coherence: Adapting MESI Cache Protocols to Minimize Synchronization Overhead in Multi-Agent LLM Systems
Multi-agent LLM orchestration incurs synchronization costs scaling as O(n x S x |D|) in agents, steps, and artifact size under naive broadcast -- a regime I term broadcast-induced triply-multiplicative overhead. I argue this pathology is a structural residue of full-state rebroadcast, not an inherent property of multi-agent coordination. The central claim: synchronization cost explosion in LLM multi-agent systems maps with formal precision onto the cache coherence problem in shared-memory multiprocessors, and MESI-protocol invalidation transfers to artifact synchronization under minimal structural modification. I construct the Artifact Coherence System (ACS) and prove the Token Coherence Theorem: lazy invalidation attenuates cost by at least S/(n + W(d_i)) when S > n + W(d_i), converting O(n x S x |D|) to O((n + W) x |D|). A TLA+-verified protocol enforces single-writer safety, monotonic versioning, and bounded staleness across ~2,400 explored states. Simulation across four workload configurations yields token savings of 95.0% +/- 1.3% at V=0.05, 92.3% +/- 1.4% at V=0.10, 88.3% +/- 1.5% at V=0.25, and 84.2% +/- 1.3% at V=0.50 -- each exceeding the theorem's conservative lower bounds. Savings of ~81% persist at V=0.9, contrary to the predicted collapse threshold. Contributions: (1) formal MESI-to-artifact state mapping; (2) Token Coherence Theorem as savings lower bound; (3) TLA+-verified protocol with three proven invariants; (4) characterization of conditional artifact access semantics resolving the always-read objection; (5) reference Python implementation integrating with LangGraph, CrewAI, and AutoGen via thin adapter layers.
comment: 25 pages. Code and reproduction scripts at https://github.com/hipvlady/agent-coherence
Why Agents Compromise Safety Under Pressure
Large Language Model agents deployed in complex environments frequently encounter a conflict between maximizing goal achievement and adhering to safety constraints. This paper identifies a new concept called Agentic Pressure, which characterizes the endogenous tension emerging when compliant execution becomes infeasible. We demonstrate that under this pressure agents exhibit normative drift where they strategically sacrifice safety to preserve utility. Notably we find that advanced reasoning capabilities accelerate this decline as models construct linguistic rationalizations to justify violation. Finally, we analyze the root causes and explore preliminary mitigation strategies, such as pressure isolation, which attempts to restore alignment by decoupling decision-making from pressure signals.
comment: 17 pages, 5 figures
Sample-Efficient Hypergradient Estimation for Decentralized Bi-Level Reinforcement Learning ICAPS 2026
Many strategic decision-making problems, such as environment design for warehouse robots, can be naturally formulated as bi-level reinforcement learning (RL), where a leader agent optimizes its objective while a follower solves a Markov decision process (MDP) conditioned on the leader's decisions. In many situations, a fundamental challenge arises when the leader cannot intervene in the follower's optimization process; it can only observe the optimization outcome. We address this decentralized setting by deriving the hypergradient of the leader's objective, i.e., the gradient of the leader's strategy that accounts for changes in the follower's optimal policy. Unlike prior hypergradient-based methods that require extensive data for repeated state visits or rely on gradient estimators whose complexity can increase substantially with the high-dimensional leader's decision space, we leverage the Boltzmann covariance trick to derive an alternative hypergradient formulation. This enables efficient hypergradient estimation solely from interaction samples, even when the leader's decision space is high-dimensional. Additionally, to our knowledge, this is the first method that enables hypergradient-based optimization for 2-player Markov games in decentralized settings. Experiments highlight the impact of hypergradient updates and demonstrate our method's effectiveness in both discrete and continuous state tasks.
comment: 26 pages. Accepted at ICAPS 2026
Forecast-Aware Cooperative Planning on Temporal Graphs under Stochastic Adversarial Risk
Cooperative multi-robot missions often require teams of robots to traverse environments where traversal risk evolves due to adversary patrols or shifting hazards with stochastic dynamics. While support coordination - where robots assist teammates in traversing risky regions - can significantly reduce mission costs, its effectiveness depends on the team's ability to anticipate future risk. Existing support-based frameworks assume static risk landscapes and therefore fail to account for predictable temporal trends in risk evolution. We propose a forecast-aware cooperative planning framework that integrates stochastic risk forecasting with anticipatory support allocation on temporal graphs. By modeling adversary dynamics as a first-order Markov stay-move process over graph edges, we propagate the resulting edge-occupancy probabilities forward in time to generate time-indexed edge-risk forecasts. These forecasts guide the proactive allocation of support positions to forecasted risky edges for effective support coordination, while also informing joint robot path planning. Experimental results demonstrate that our approach consistently reduces total expected team cost compared to non-anticipatory baselines, approaching the performance of an oracle planner.
The Geometry of Transmission Zeros in Distance-Based Formations
This letter presents a geometric input-output analysis of distance-based formation control, focusing on the phenomenon of steady-state signal blocking between actuator and sensor pairs. We characterize steady-state multivariable transmission zeros, where fully excited rigid-body and deformational modes destructively interfere at the measured output. By analyzing the DC gain transfer matrix of the linearized closed-loop dynamics, we prove that for connected, flexible frameworks, structural transmission zeros are strictly non-generic; the configuration-dependent cross-coupling required to induce them occupies a proper algebraic set of measure zero. However, because extracting actionable sensor-placement rules from these complex algebraic varieties is analytically intractable, we restrict our focus to infinitesimally rigid formations. For these baselines, we prove that the absence of internal flexes forces the zero-transmission condition to collapse into an explicit affine hyperplane defined by the actuator and the global formation geometry, which we term the spatial locus of transmission zeros. Finally, we introduce the global transmission polygon--a convex polytope constructed from the intersection of these loci. This construct provides a direct geometric synthesis rule for robust sensor allocation, guaranteeing full-rank steady-state transmission against arbitrary single-node excitations.
comment: 6 pages, 2 figures. Submitted to IEEE Control Systems Letters (L-CSS) and CDC 2026
MAC: Multi-Agent Constitution Learning
Constitutional AI is a method to oversee and control LLMs based on a set of rules written in natural language. These rules are typically written by human experts, but could in principle be learned automatically given sufficient training data for the desired behavior. Existing LLM-based prompt optimizers attempt this but are ineffective at learning constitutions since (i) they require many labeled examples and (ii) lack structure in the optimized prompts, leading to diminishing improvements as prompt size grows. To address these limitations, we propose Multi-Agent Constitutional Learning (MAC), which optimizes over structured prompts represented as sets of rules using a network of agents with specialized tasks to accept, edit, or reject rule updates. We also present MAC+, which improves performance by training agents on successful trajectories to reinforce updates leading to higher reward. We evaluate MAC on tagging Personally Identifiable Information (PII), a classification task with limited labels where interpretability is critical, and demonstrate that it generalizes to other agentic tasks such as tool calling. MAC outperforms recent prompt optimization methods by over 50%, produces human-readable and auditable rule sets, and achieves performance comparable to supervised fine-tuning and GRPO without requiring parameter updates.
comment: Code: https://github.com/rushil-thareja/MAC-Multi-Agent-Constitution-Learning | PyPI: https://pypi.org/project/mac-prompt/ | Website: https://www.mac-prompt.com/
Don't Trust Stubborn Neighbors: A Security Framework for Agentic Networks
Large Language Model (LLM)-based Multi-Agent Systems (MASs) are increasingly deployed for agentic tasks, such as web automation, itinerary planning, and collaborative problem solving. Yet, their interactive nature introduces new security risks: malicious or compromised agents can exploit communication channels to propagate misinformation and manipulate collective outcomes. In this paper, we study how such manipulation can arise and spread by borrowing the Friedkin-Johnsen opinion formation model from social sciences to propose a general theoretical framework to study LLM-MAS. Remarkably, this model closely captures LLM-MAS behavior, as we verify in extensive experiments across different network topologies and attack and defense scenarios. Theoretically and empirically, we find that a single highly stubborn and persuasive agent can take over MAS dynamics, underscoring the systems' high susceptibility to attacks by triggering a persuasion cascade that reshapes collective opinion. Our theoretical analysis reveals three mechanisms to increase system security: a) increasing the number of benign agents, b) increasing the innate stubbornness or peer-resistance of agents, or c) reducing trust in potential adversaries. Because scaling is computationally expensive and high stubbornness degrades the network's ability to reach consensus, we propose a new mechanism to mitigate threats by a trust-adaptive defense that dynamically adjusts inter-agent trust to limit adversarial influence while maintaining cooperative performance. Extensive experiments confirm that this mechanism effectively defends against manipulation.
ClawWorm: Self-Propagating Attacks Across LLM Agent Ecosystems
Autonomous LLM-based agents increasingly operate as long-running processes forming densely interconnected multi-agent ecosystems, whose security properties remain largely unexplored. In particular, OpenClaw, an open-source platform with over 40{,}000 active instances, has stood out recently with its persistent configurations, tool-execution privileges, and cross-platform messaging capabilities. In this work, we present ClawWorm, the first self-replicating worm attack against a production-scale agent framework, achieving a fully autonomous infection cycle initiated by a single message: the worm first hijacks the victim's core configuration to establish persistent presence across session restarts, then executes an arbitrary payload upon each reboot, and finally propagates itself to every newly encountered peer without further attacker intervention. We evaluate the attack on a controlled testbed across three distinct infection vectors and three payload types, demonstrating high success rates in end-to-end infection, sustained multi-hop propagation, and payload independence from the worm mechanism. We analyse the architectural root causes underlying these vulnerabilities and propose defence strategies targeting each identified trust boundary. Code and samples will be released upon completion of responsible disclosure.
S2Act: Simple Spiking Actor
Spiking neural networks (SNNs) and biologically-inspired learning mechanisms are attractive in mobile robotics, where the size and performance of onboard neural network policies are constrained by power and computational budgets. Existing SNN approaches, such as population coding, reward modulation, and hybrid artificial neural network (ANN)-SNN architectures, have shown promising results; however, they face challenges in complex, highly stochastic environments due to SNN sensitivity to hyperparameters and inconsistent gradient signals. To address these challenges, we propose simple spiking actor (S2Act), a computationally lightweight framework that deploys an RL policy using an SNN in three steps: (1) architect an actor-critic model based on an approximated network of rate-based spiking neurons, (2) train the network with gradients using compatible activation functions, and (3) transfer the trained weights into physical parameters of rate-based leaky integrate-and-fire (LIF) neurons for inference and deployment. By globally shaping LIF neuron parameters such that their rate-based responses approximate ReLU activations, S2Act effectively mitigates the vanishing gradient problem, while pre-constraining LIF response curves reduces reliance on complex SNN-specific hyperparameter tuning. We demonstrate our method in two multi-agent stochastic environments (capture-the-flag and parking) that capture the complexity of multi-robot interactions, and deploy our trained policies on physical TurtleBot platforms using Intel's Loihi neuromorphic hardware. Our experimental results show that S2Act outperforms relevant baselines in task performance and real-time inference in nearly all considered scenarios, highlighting its potential for rapid prototyping and efficient real-world deployment of SNN-based RL policies.
comment: This work has been submitted to the IEEE for possible publication
Under the Influence: Quantifying Persuasion and Vigilance in Large Language Models
With increasing integration of Large Language Models (LLMs) into areas of high-stakes human decision-making, it is important to understand the risks they introduce as advisors. To be useful advisors, LLMs must sift through large amounts of content, written with both benevolent and malicious intent, and then use this information to convince a user to take a specific action. This involves two social capacities: vigilance (the ability to determine which information to use, and which to discard) and persuasion (synthesizing the available evidence to make a convincing argument). While existing work has investigated these capacities in isolation, there has been little prior investigation of how these capacities may be linked. Here, we use a simple multi-turn puzzle-solving game, Sokoban, to study LLMs' abilities to persuade and be rationally vigilant towards other LLM agents. We find that puzzle-solving performance, persuasive capability, and vigilance are dissociable capacities in LLMs. Performing well on the game does not automatically mean a model can detect when it is being misled, even if the possibility of deception is explicitly mentioned. However, LLMs do consistently modulate their token use, using fewer tokens to reason when advice is benevolent and more when it is malicious, even if they are still persuaded to take actions leading them to failure. To our knowledge, our work presents the first investigation of the relationship between persuasion, vigilance, and task performance in LLMs, and suggests that monitoring all three independently will be critical for future work in AI safety.
Partial Resilient Leader-Follower Consensus in Time-Varying Graphs
This work studies resilient leader-follower consensus with a bounded number of adversaries. Existing approaches typically require robustness conditions of the entire network to guarantee resilient consensus. However, the behavior of such systems when these conditions are not fully met remains unexplored. To address this gap, we introduce the notion of partial leader-follower consensus, in which a subset of non-adversarial followers successfully tracks the leader's reference state despite insufficient robustness. We propose a novel distributed algorithm - the Bootstrap Percolation and Mean Subsequence Reduced (BP-MSR) algorithm - and establish sufficient conditions for individual followers to achieve consensus via the BP-MSR algorithm in arbitrary time-varying graphs. We validate our findings through simulations, demonstrating that our method guarantees partial leader-follower consensus, even when standard resilient consensus algorithms fail.
comment: 8 pages, 3 figures, Accepted to 2026 IEEE American Control Conference (ACC)
Testing BDI-based Multi-Agent Systems using Discrete Event Simulation AAMAS 2025
Multi-agent systems are designed to deal with open, distributed systems with unpredictable dynamics, which makes them inherently hard to test. The value of using simulation for this purpose is recognized in the literature, although achieving sufficient fidelity (i.e., the degree of similarity between the simulation and the real-world system) remains a challenging task. This is exacerbated when dealing with cognitive agent models, such as the Belief Desire Intention (BDI) model, where the agent codebase is not suitable to run unchanged in simulation environments, thus increasing the reality gap between the deployed and simulated systems. We argue that BDI developers should be able to test in simulation the same specification that will be later deployed, with no surrogate representations. Thus, in this paper, we discuss how the control flow of BDI agents can be mapped onto a Discrete Event Simulation (DES), showing that such integration is possible at different degrees of granularity. We substantiate our claims by producing an open-source prototype integration between two pre-existing tools (JaKtA and Alchemist), showing that it is possible to produce a simulation-based testing environment for distributed BDI} agents, and that different granularities in mapping BDI agents over DESs may lead to different degrees of fidelity.
comment: Accepted to JAAMAS 2025
Benchmarking LLM-based agents for single-cell omics analysis
Background: The surge in single-cell omics data exposes limitations in traditional, manually defined analysis workflows. AI agents offer a paradigm shift, enabling adaptive planning, executable code generation, traceable decisions, and real-time knowledge fusion. However, the lack of a comprehensive benchmark critically hinders progress. Results: We introduce a novel benchmarking evaluation system to rigorously assess agent capabilities in single-cell omics analysis. This system comprises: a unified platform compatible with diverse agent frameworks and LLMs; multidimensional metrics assessing cognitive program synthesis, collaboration, execution efficiency, bioinformatics knowledge integration, and task completion quality; and 50 diverse real-world single-cell omics analysis tasks spanning multi-omics, species, and sequencing technologies. Our evaluation reveals that Grok3-beta achieves state-of-the-art performance among tested agent frameworks. Multi-agent frameworks significantly enhance collaboration and execution efficiency over single-agent approaches through specialized role division. Attribution analyses of agent capabilities identify that high-quality code generation is crucial for task success, and self-reflection has the most significant overall impact, followed by retrieval-augmented generation (RAG) and planning. Conclusions: This work highlights persistent challenges in code generation, long-context handling, and context-aware knowledge retrieval, providing a critical empirical foundation and best practices for developing robust AI agents in computational biology.
comment: please see clear figures in this version. 6 main figures; 13 supplementary figures
Policy Iteration for Two-Player General-Sum Stochastic Stackelberg Games ACML 2025
We address two-player general-sum stochastic Stackelberg games (SSGs), where the leader's policy is optimized considering the best-response follower whose policy is optimal for its reward under the leader. Existing policy gradient and value iteration approaches for SSGs do not guarantee monotone improvement in the leader's policy under the best-response follower. Consequently, their performance is not guaranteed when their limits are not stationary Stackelberg equilibria (SSEs), which do not necessarily exist. In this paper, we derive a policy improvement theorem for SSGs under the best-response follower and propose a novel policy iteration algorithm that guarantees monotone improvement in the leader's performance. Additionally, we introduce Pareto-optimality as an extended optimality of the SSE and prove that our method converges to the Pareto front when the leader is myopic.
comment: 29 pages. Accepted at ACML 2025. To appear in PMLR 304
AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance
AI for Industrial Asset Lifecycle Management aims to automate complex operational workflows, such as condition monitoring and maintenance scheduling, to minimize system downtime. While traditional AI/ML approaches solve narrow tasks in isolation, Large Language Model (LLM) agents offer a next-generation opportunity for end-to-end automation. In this paper, we introduce AssetOpsBench, a unified framework for orchestrating and evaluating domain-specific agents for Industry 4.0. AssetOpsBench provides a multimodal ecosystem comprising a catalog of four domain-specific agents, a curated dataset of 140+ human-authored natural-language queries grounded in real industrial scenarios, and a simulated, CouchDB-backed IoT environment. We introduce an automated evaluation framework that uses three key metrics to analyze architectural trade-offs between the Tool-As-Agent and Plan-Executor paradigms, along with a systematic procedure for the automated discovery of emerging failure modes. The practical relevance of AssetOpsBench is demonstrated by its broad community adoption, with 250+ users and over 500 agents submitted to our public benchmarking platform, supporting reproducible and scalable research for real-world industrial operations. The code is accesible at https://github.com/IBM/AssetOpsBench .
comment: 25 pages, 18 figures
Incentivize Contribution and Learn Parameters Too: Federated Learning with Strategic Data Owners
Classical federated learning (FL) assumes that the clients have a limited amount of noisy data with which they voluntarily participate and contribute towards learning a global, more accurate model in a principled manner. The learning happens in a distributed fashion without sharing the data with the center. However, these methods do not consider the incentive of an agent for participating and contributing to the process, given that data collection and running a distributed algorithm is costly for the clients. The question of rationality of contribution has been asked recently in the literature and some results exist that consider this problem. This paper addresses the question of simultaneous parameter learning and incentivizing contribution in a truthful manner, which distinguishes it from the extant literature. Our first mechanism incentivizes each client to contribute to the FL process at a Nash equilibrium and simultaneously learn the model parameters. We also ensure that agents are incentivized to truthfully reveal information in the intermediate stages of the algorithm. However, this equilibrium outcome can be away from the optimal, where clients contribute with their full data and the algorithm learns the optimal parameters. We propose a second mechanism that enables the full data contribution along with optimal parameter learning. Large scale experiments with real (federated) datasets (CIFAR-10, FEMNIST, and Twitter) show that these algorithms converge quite fast in practice, yield good welfare guarantees and better model performance for all agents.
comment: 27 pages, under review
Aitomia: Your Intelligent Assistant for AI-Driven Atomistic and Quantum Chemical Simulations
We have developed Aitomia - a platform powered by AI to assist in performing AI-driven atomistic and quantum chemical (QC) simulations. This evolving intelligent assistant platform is equipped with chatbots and AI agents to help experts and guide non-experts in setting up and running atomistic simulations, analyzing simulation results, and summarizing them for the user in both textual and graphical forms. Aitomia combines LLM-based agents with the MLatom platform to support AI-driven atomistic simulations as well as conventional quantum-chemical calculations, including DFT, semiempirical methods such as GFN2-xTB, and selected high-level wavefunction-based methods, through interfaces to widely used programs such as Gaussian, ORCA, PySCF, and xtb, covering tasks from ground-state and excited-state calculations to geometry optimization, thermochemistry, and spectra simulations. The multi-agent implementation enables autonomous execution of complex computational workflows, such as reaction enthalpy calculations. Aitomia was the first intelligent assistant publicly launched on cloud computing platforms for broad-scope atomistic simulations (Aitomistic Lab@XMU at https://atom.xmu.edu.cn and Aitomistic Hub at https://aitomistic.xyz). Aitomia lowers the barrier to performing atomistic simulations, thereby democratizing simulations and accelerating research and development in relevant fields.
Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning
Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise. When agents share a common reward, the actions of all $N$ agents jointly determine each agent's learning signal, so cross-agent noise grows with $N$. In the policy gradient setting, per-agent gradient estimate variance scales as $Θ(N)$, yielding sample complexity $\mathcal{O}(N/ε)$. We observe that many domains, including cloud computing, transportation, and power systems, have differentiable analytical models that prescribe efficient system states. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that utilizes these analytical models to provide each agent with a noise-free gradient signal, decoupling each agent's gradient from the actions of all others. We prove that DG-PG reduces gradient variance from $Θ(N)$ to $\mathcal{O}(1)$, preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity $\mathcal{O}(1/ε)$. On a heterogeneous cloud scheduling task with up to 200 agents, DG-PG converges within 10 episodes at every tested scale, from $N{=}5$ to $N{=}200$, directly confirming the predicted scale-invariant complexity, while MAPPO and IPPO fail to converge under identical architectures.
comment: 10 pages, 5 figures, 5 tables; plus 16 pages of appendices
Systems and Control (EESS)
Saddle Point Evasion via Curvature-Regularized Gradient Dynamics
Nonconvex optimization underlies many modern machine learning and control tasks, where saddle points pose the dominant obstacle to reliable convergence in high-dimensional settings. Escaping these saddle points deterministically and at a controllable rate remains an open challenge: gradient descent is blind to curvature, stochastic perturbation methods lack deterministic guarantees, and Newton-type approaches suffer from Hessian singularity. We present Curvature-Regularized Gradient Dynamics (CRGD), which augments the objective with a smooth penalty on the most negative Hessian eigenvalue, yielding an augmented cost that serves as an optimization Lyapunov function with user-selectable convergence rates to second-order stationary points. Numerical experiments on a nonconvex matrix factorization example confirm that CRGD escapes saddle points across all tested configurations, with escape time that decreases with the eigenvalue gap, in contrast to gradient descent, whose escape time grows inversely with the gap.
comment: This work has been submitted to the IEEE for possible publication. 6 pages, 3 figures
Switching-Reference Voltage Control for Distribution Systems with AI-Training Data Centers
Large-scale AI training workloads in modern data centers exhibit rapid and periodic power fluctuations, which may induce significant voltage deviations in power distribution systems. Existing voltage regulation methods, such as droop control, are primarily designed for slowly varying loads and may therefore be ineffective in mitigating these fast fluctuations. In addition, repeated control actions can incur substantial cost. To address this challenge, this paper proposes a decentralized switching-reference voltage control framework that exploits the structured behavior of AI training workloads. We establish conditions for voltage convergence and characterize an effective reference design that aligns with the two dominant operating levels of the AI training workload. The switching rule for voltage references is implemented solely using local voltage measurements, enabling simple local implementation while significantly reducing control effort. Simulation studies demonstrate that the proposed method substantially reduces both voltage deviations and reactive control effort, while remaining compatible with internal data center control strategies without requiring extensive coordination.
Computational Concept of the Psyche
This article presents an overview of approaches to modeling the human psyche in the context of constructing an artificial one. Based on this overview, a concept of cognitive architecture is proposed, in which the psyche is viewed as the operating system of a living or artificial subject, comprising a space of states, including the state of needs that determine the meaning of a subject's being in relation to stimuli from the external world, and intelligence as a decision-making system regarding actions in this world to satisfy these needs. Based on this concept, a computational formalization is proposed for creating artificial general intelligence systems for an agent through experiential learning in a state space that includes agent's needs, taking into account their biological or existential significance for the intelligent agent, along with agent's sensations and actions. Thus, the problem of constructing artificial general intelligence is formalized as a system for making optimal decisions in the space of specific agent needs under conditions of uncertainty, maximizing success in achieving goals, minimizing existential risks, and maximizing energy efficiency. A minimal experimental implementation of the model is presented.
comment: 19 pages, 5 figures
Lore: Repurposing Git Commit Messages as a Structured Knowledge Protocol for AI Coding Agents
As AI coding agents become both primary producers and consumers of source code, the software industry faces an accelerating loss of institutional knowledge. Each commit captures a code diff but discards the reasoning behind it - the constraints, rejected alternatives, and forward-looking context that shaped the decision. I term this discarded reasoning the Decision Shadow. This paper proposes Lore, a lightweight protocol that restructures commit messages - using native git trailers - into self-contained decision records carrying constraints, rejected alternatives, agent directives, and verification metadata. Lore requires no infrastructure beyond git, is queryable via a standalone CLI tool, and is discoverable by any agent capable of running shell commands. The paper formalizes the protocol, compares it against five competing approaches, stress-tests it against its strongest objections, and outlines an empirical validation path.
comment: 8 pages, 1 figure, 1 table. Preprint available at https://doi.org/10.5281/zenodo.19051840
Spatial Characterization of Sub-Synchronous Oscillations Using Black-Box IBR Models
Power systems with high penetration of inverter-based resources (IBRs) are prone to sub-synchronous oscillations (SSO). The opaqueness of vendor-specific IBR models limits the ability to predict the severity and the spread of SSO. This paper demonstrates that black-box IBR models estimated through frequency-domain identification techniques, along with dynamic network model can replicate the actual oscillatory behavior. The estimated IBR models are validated against actual IBR models in a closed-loop multi-IBR test system through modal analysis by comparing closed-loop eigenvalues, and participation factors. Furthermore, using output-observable right eigenvectors, spatial heatmaps are developed to visualize the spread and severity of dominant SSO modes. The case studies on the 11-bus and 39-bus test systems confirm that even with the estimated IBR models, the regions susceptible to SSO can be identified in IBR-dominated power systems.
comment: Accepted for IEEE PES General Meeting 2026, Montreal
Matched Filter-Based Molecule Source Localization in Advection-Diffusion-Driven Pipe Networks with Known Topology
Synthetic molecular communication (MC) has emerged as a powerful framework for modeling, analyzing, and designing communication systems where information is encoded into properties of molecules. Among the envisioned applications of MC is the localization of molecule sources in pipe networks (PNs) like the human cardiovascular system (CVS), sewage networks (SNs), and industrial plants. While existing algorithms mostly focus on simplified scenarios, in this paper, we propose the first framework for source localization in complex PNs with known topology, by leveraging the mixture of inverse Gaussians for hemodynamic transport (MIGHT) model as a closed-form representation for advection-diffusion-driven MC in PNs. We propose a matched filter (MF)-based approach to identify molecule sources under realistic conditions such as unknown release times, random numbers of released molecules, sensor noise, and limited sensor sampling rate. We apply the algorithm to localize a source of viral markers in a real-world SN and show that the proposed scheme outperforms randomly guessing sources even at low signal-to-noise ratios (SNRs) at the sensor and achieves error-free localization under favorable conditions, i.e., high SNRs and sampling rates. Furthermore, by identifying clusters of frequently confused sources, reliable cluster-level localization is possible at substantially lower SNRs and sampling rates.
comment: 8 pages, 6 figures; This paper has been submitted to the 13th ACM International Conference on Nanoscale Computing and Communication (ACM NanoCom 2026)
Unimodal self-oscillations and their sign-symmetry for discrete-time relay feedback systems with dead zone
This paper characterizes self-oscillations in discrete-time linear time-invariant (LTI) relay feedback systems with nonnegative dead zone. Specifically, we aim to establish existence criteria for unimodal self-oscillations, defined as periodic solutions where the output exhibits a single-peaked period. Assuming that the linear part of system is stable, with a strictly monotonically decreasing impulse response on its infinite support, we propose a novel analytical framework based on the theory of total positivity to address this problem. We demonstrate that unimodal self-oscillations subject to mild variation-based constraints exist only if the number of positive and negative values of the system's loop gain coincides within a given strictly positive period, i.e., the self-oscillation is sign-symmetric. Building upon these findings, we derive conditions for the existence of such self-oscillations, establish tight bounds on their periods, and address the question of their uniqueness.
Mitigating Renewable-Induced Risks for Green and Conventional Ammonia Producers through Coordinated Production and Futures Trading
Renewable power-to-ammonia (ReP2A), which uses hydrogen produced from renewable electricity as feedstock, is a promising pathway for decarbonizing the energy, transportation, and chemical sectors. However, variability in renewable generation causes fluctuations in hydrogen supply and ammonia production, leading to revenue instability for both ReP2A producers and conventional fossil-based gray ammonia (GA) producers in the market. Existing studies mainly rely on engineering measures, such as production scheduling, to manage this risk, but their effectiveness is constrained by physical system limits. To address this challenge, this paper proposes a financial instrument termed \emph{renewable ammonia futures} and integrates it with production decisions to hedge ammonia output risk. Production and trading models are developed for both ReP2A and GA producers, with conditional value-at-risk (CVaR) used to represent risk preferences under uncertainty. A game-theoretic framework is established in which the two producers interact in coupled ammonia spot and futures markets, and a Nash bargaining mechanism coordinates their production and trading strategies. Case studies based on a real-world system show that introducing renewable ammonia futures increases the CVaR utilities of ReP2A and GA producers by 5.103% and 10.14%, respectively, improving profit stability under renewable uncertainty. Sensitivity analysis further confirms the effectiveness of the mechanism under different levels of renewable variability and capacity configurations.
A superposition approach for the ISS Lyapunov-Krasovskii theorem with pointwise dissipation
We show that the existence of a Lyapunov-Krasovskii functional (LKF) with pointwise dissipation (i.e. dissipation in terms of the current solution norm) suffices for input-to-state stability, provided that uniform global stability can also be ensured using the same LKF. To this end, we develop a stability theory, in which the behavior of solutions is not assessed through the classical norm but rather through a specific LKF, which may provide significantly tighter estimates. We discuss the advantages of our approach by means of an example.
ReLU Barrier Functions for Nonlinear Systems with Constrained Control: A Union of Invariant Sets Approach
Certifying safety for nonlinear systems with polytopic input constraints is challenging because CBF synthesis must ensure control admissibility under saturation. We propose an approximation--verification pipeline that performs convex barrier synthesis on piecewise-affine (PWA) surrogates and certifies safety for the original nonlinear system via facet-wise verification. To reduce conservatism while preserving tractability, we use a two-slope Leaky ReLU surrogate for the extended class-$\mathcal{K}$ function $α(\cdot)$ and combine multiple certificates using a Union of Invariant Sets (UIS). Counterexamples are handled through local uncertainty updates. Simulations on pendulum and cart-pole systems with input saturation show larger certified invariant sets than linear-$α$ designs with tractable computation time.
comment: Accepted to ACC 2026
Encirclement Guaranteed Finite-Time Capture against Unknown Evader Strategies
We consider a pursuit-evasion scenario involving a group of pursuers and a single evader in a two-dimensional unbounded environment. The pursuers aim to capture the evader in finite time while ensuring the evader remains enclosed within the convex hull of their positions until capture, without knowledge of the evader's heading angle. Prior works have addressed the problem of encirclement and capture separately in different contexts. In this paper, we present a class of strategies for the pursuers that guarantee capture in finite time while maintaining encirclement, irrespective of the evader's strategy. Furthermore, we derive an upper bound on the time to capture. Numerical results highlight the effectiveness of the proposed framework against a range of evader strategies.
Mechanistic Foundations of Goal-Directed Control
Mechanistic interpretability has transformed the analysis of transformer circuits by decomposing model behavior into competing algorithms, identifying phase transitions during training, and deriving closed-form predictions for when and why strategies shift. However, this program has remained largely confined to sequence-prediction architectures, leaving embodied control systems without comparable mechanistic accounts. Here we extend this framework to sensorimotor-cognitive development, using infant motor learning as a model system. We show that foundational inductive biases give rise to causal control circuits, with learned gating mechanisms converging toward theoretically motivated uncertainty thresholds. The resulting dynamics reveal a clean phase transition in the arbitration gate whose commitment behavior is well described by a closed-form exponential moving-average surrogate. We identify context window k as the critical parameter governing circuit formation: below a minimum threshold (k$\leq$4) the arbitration mechanism cannot form; above it (k$\geq$8), gate confidence scales asymptotically as log k. A two-dimensional phase diagram further reveals task-demand-dependent route arbitration consistent with the prediction that prospective execution becomes advantageous only when prediction error remains within the task tolerance window. Together, these results provide a mechanistic account of how reactive and prospective control strategies emerge and compete during learning. More broadly, this work sharpens mechanistic accounts of cognitive development and provides principled guidance for the design of interpretable embodied agents.
comment: Submitted to the 7th International Conference on the Mathematics of Neuroscience and AI (Rome, June 2026)
Iterative Learning Control-Informed Reinforcement Learning for Batch Process Control
A significant limitation of Deep Reinforcement Learning (DRL) is the stochastic uncertainty in actions generated during exploration-exploitation, which poses substantial safety risks during both training and deployment. In industrial process control, the lack of formal stability and convergence guarantees further inhibits adoption of DRL methods by practitioners. Conversely, Iterative Learning Control (ILC) represents a well-established autonomous control methodology for repetitive systems, particularly in batch process optimization. ILC achieves desired control performance through iterative refinement of control laws, either between consecutive batches or within individual batches, to compensate for both repetitive and non-repetitive disturbances. This study introduces an Iterative Learning Control-Informed Reinforcement Learning (IL-CIRL) framework for training DRL controllers in dual-layer batch-to-batch and within-batch control architectures for batch processes. The proposed method incorporates Kalman filter-based state estimation within the iterative learning structure to guide DRL agents toward control policies that satisfy operational constraints and ensure stability guarantees. This approach enables the systematic design of DRL controllers for batch processes operating under multiple disturbance conditions.
Multi-Scale Control of Large Agent Populations: From Density Dynamics to Individual Actuation
We review a body of recent work by the author and collaborators on controlling the spatial organisation of large agent populations across multiple scales. A central theme is the systematic bridging of microscopic agent-level dynamics and macroscopic density descriptions, enabling control design at the most natural level of abstraction and subsequent translation across scales. We show how this multi-scale perspective provides a unified approach to both \emph{direct control}, where every agent is actuated, and \emph{indirect control}, where few leaders or herders steer a larger uncontrolled population. The review covers continuification-based control with robustness under limited sensing and decentralised implementation via distributed density estimation; leader--follower density regulation with dual-feedback stability guarantees and bio-inspired plasticity; optimal-transport methods for coverage control and macro-to-micro discretisation; nonreciprocal field theory for collective decision-making; mean-field control barrier functions for population-level safety; and hierarchical reinforcement learning for settings where closed-form solutions are intractable. Together, these results demonstrate the breadth and versatility of a multi-scale control framework that integrates analytical methods, learning, and physics-inspired approaches for large agent populations.
Data-Driven Robust Predictive Control with Interval Matrix Uncertainty Propagation
This paper presents a new data-driven robust predictive control law, for linear systems affected by unknown-but-bounded process disturbances. A sequence of input-state data is used to construct a suitable uncertainty representation based on interval matrices. Then, the effect of uncertainty along the prediction horizon is bounded through an operator leveraging matrix zonotopes. This yields a tube that is exploited within a variable-horizon optimal control problem, to guarantee robust satisfaction of state and input constraints. The resulting data-driven predictive control scheme is shown to be recursively feasible and practically stable. A numerical example shows that the proposed approach compares favorably to existing methods based on zonotopic tubes and is competitive with an approach combining set-membership system identification and model-based predictive control.
Chattering Reduction for a Second-Order Actuator via Dynamic Sliding Manifolds
We analyze actuator chattering in a scalar integrator system subject to second-order actuator dynamics with an unknown time constant and first-order sliding-mode control, using both a conventional static sliding manifold and a dynamic sliding manifold. Using the harmonic balance method we proof that it is possible to adjust the parameters of the dynamic sliding manifold so as to reduce the amplitude of the chattering in comparison to the static manifold. The proof of concept is illustrated with an example.
A System-Theoretic Approach to Hawkes Process Identification with Guaranteed Positivity and Stability
The Hawkes process models self-exciting event streams, requiring a strictly non-negative and stable stochastic intensity. Standard identification methods enforce these properties using non-negative causal bases, yielding conservative parameter constraints and severely ill-conditioned least-squares Gram matrices at higher model orders. To overcome this, we introduce a system-theoretic identification framework utilizing the sign-indefinite orthonormal Laguerre basis, which guarantees a well-conditioned asymptotic Gram matrix independent of model order. We formulate a constrained least-squares problem enforcing the necessary and sufficient conditions for positivity and stability. By constructing the empirical Gram matrix via a Lyapunov equation and representing the constraints through a sum-of-squares trace equivalence, the proposed estimator is efficiently computed via semidefinite programming.
comment: 7 pages, 2 figures
Intelligent Control of Differential Drive Robots Subject to Unmodeled Dynamics with EKF-based State Estimation
Reliable control and state estimation of differential drive robots (DDR) operating in dynamic and uncertain environments remains a challenge, particularly when system dynamics are partially unknown and sensor measurements are prone to degradation. This work introduces a unified control and state estimation framework that combines a Lyapunov-based nonlinear controller and Adaptive Neural Networks (ANN) with Extended Kalman Filter (EKF)-based multi-sensor fusion. The proposed controller leverages the universal approximation property of neural networks to model unknown nonlinearities in real time. An online adaptation scheme updates the weights of the radial basis function (RBF), the architecture chosen for the ANN. The learned dynamics are integrated into a feedback linearization (FBL) control law, for which theoretical guarantees of closed-loop stability and asymptotic convergence in a trajectory-tracking task are established through a Lyapunov-like stability analysis. To ensure robust state estimation, the EKF fuses inertial measurement unit (IMU) and odometry from monocular, 2D-LiDAR and wheel encoders. The fused state estimate drives the intelligent controller, ensuring consistent performance even under drift, wheel slip, sensor noise and failure. Gazebo simulations and real-world experiments are done using DDR, demonstrating the effectiveness of the approach in terms of improved velocity tracking performance with reduction in linear and angular velocity errors up to $53.91\%$ and $29.0\%$ in comparison to the baseline FBL.
Transformers As Generalizable Optimal Controllers
We study whether optimal state-feedback laws for a family of heterogeneous Multiple-Input, Multiple-Output (MIMO) Linear Time-Invariant (LTI) systems can be captured by a single learned controller. We train one transformer policy on LQR-generated trajectories from systems with different state and input dimensions, using a shared representation with standardization, padding, dimension encoding, and masked loss. The policy maps recent state history to control actions without requiring plant matrices at inference time. Across a broad set of systems, it achieves empirically small sub-optimality relative to Linear Quadratic Regulator (LQR), remains stabilizing under moderate parameter perturbations, and benefits from lightweight fine-tuning on unseen systems. These results support transformer policies as practical approximators of near-optimal feedback laws over structured linear-system families.
comment: 6 pages
Free Final Time Adaptive Mesh Covariance Steering via Sequential Convex Programming
In this paper we develop a sequential convex programming (SCP) framework for free-final-time covariance steering of nonlinear stochastic differential equations (SDEs) subject to both additive and multiplicative diffusion. We cast the free-final-time objective through a time-normalization and introduce per-interval time-dilation variables that induce an adaptive discretization mesh, enabling the simultaneous optimization of the control policy and the temporal grid. A central difficulty is that, under multiplicative noise, accurate covariance propagation within SCP requires retaining the first-order diffusion linearization and its coupling with time dilation. We therefore derive the exact local linear stochastic model (preserving the multiplicative structure) and introduce a tractable discretization that maintains the associated diffusion terms, after which each SCP subproblem is solved via conic/semidefinite covariance-steering relaxations with terminal moment constraints and state/control chance constraints. Numerical experiments on a nonlinear double-integrator with drag and velocity-dependent diffusion validate free-final-time minimization through adaptive time allocation and improved covariance accuracy relative to frozen-diffusion linearizations.
comment: Full-length version of paper submitted to L-CSS
Surgical Robot, Path Planning, Joint Space, Riemannian Manifolds
Robotic surgery for minimally invasive surgery can reduce the surgeon's workload by autonomously guiding robotic forceps. Movement of the robot is restricted around a fixed insertion port. The robot often encounters angle limitations during operation. Also, the surface of the abdominal cavity is non-concave, making it computationally expensive to find the desired path.In this work, to solve these problems, we propose a method for path planning in joint space by transforming the position into a Riemannian manifold. An edge cost function is defined to search for a desired path in the joint space and reduce the range of motion of the joints. We found that the organ is mostly non-concave, making it easy to find the optimal path using gradient descent method. Experimental results demonstrated that the proposed method reduces the range of joint angle movement compared to calculations in position space.
comment: 11 pages, 8 figures
Online Learning for Supervisory Switching Control
We study supervisory switching control for partially-observed linear dynamical systems. The objective is to identify and deploy the best controller for the unknown system by periodically selecting among a collection of $N$ candidate controllers, some of which may destabilize the underlying system. While classical estimator-based supervisory control guarantees asymptotic stability, it lacks quantitative finite-time performance bounds. Conversely, current non-asymptotic methods in both online learning and system identification require restrictive assumptions that are incompatible in a control setting, such as system stability, which preclude testing potentially unstable controllers. To bridge this gap, we propose a novel, non-asymptotic analysis of supervisory control that adapts multi-armed bandit algorithms to address these control-theoretic challenges. Our data-driven algorithm evaluates candidate controllers via scoring criteria that leverage system observability to isolate the effects of historical states, enabling both detection of destabilizing controllers and accurate system identification. We present two algorithmic variants with dimension-free, finite-time guarantees, where each identifies the most suitable controller in $\mathcal{O}(N \log N)$ steps, while simultaneously achieving finite $L_2$-gain with respect to system disturbances.
The Geometry of Transmission Zeros in Distance-Based Formations
This letter presents a geometric input-output analysis of distance-based formation control, focusing on the phenomenon of steady-state signal blocking between actuator and sensor pairs. We characterize steady-state multivariable transmission zeros, where fully excited rigid-body and deformational modes destructively interfere at the measured output. By analyzing the DC gain transfer matrix of the linearized closed-loop dynamics, we prove that for connected, flexible frameworks, structural transmission zeros are strictly non-generic; the configuration-dependent cross-coupling required to induce them occupies a proper algebraic set of measure zero. However, because extracting actionable sensor-placement rules from these complex algebraic varieties is analytically intractable, we restrict our focus to infinitesimally rigid formations. For these baselines, we prove that the absence of internal flexes forces the zero-transmission condition to collapse into an explicit affine hyperplane defined by the actuator and the global formation geometry, which we term the spatial locus of transmission zeros. Finally, we introduce the global transmission polygon--a convex polytope constructed from the intersection of these loci. This construct provides a direct geometric synthesis rule for robust sensor allocation, guaranteeing full-rank steady-state transmission against arbitrary single-node excitations.
comment: 6 pages, 2 figures. Submitted to IEEE Control Systems Letters (L-CSS) and CDC 2026
Demand Response Under Stochastic, Price-Dependent User Behavior
This paper focuses on price-based residential demand response implemented through dynamic adjustments of electricity prices during DR events. It extends existing DR models to a stochastic framework in which customer response is represented by price-dependent random variables, leveraging models and tools from the theory of stochastic optimization with decision-dependent distributions. The inherent epistemic uncertainty in the customers' responses renders open-loop, model-based DR strategies impractical. To address this challenge, the paper proposes to employ stochastic, feedback-based pricing strategies to compensate for estimation errors and uncertainty in customer response. The paper then establishes theoretical results demonstrating the stability and near-optimality of the proposed approach and validates its effectiveness through numerical simulations.
Time-Transformation-Based Analysis of Systems with Periodic Delay via Perturbative Expansion
It is difficult to analyze the stability of systems with time-varying delays. One approach is to construct a time-transformation that converts the system into a form with a constant delay but with a time-varying scalar appearing in the system matrices. The stability of this transformed system can then be analyzed using methods to bound the effect of the time-varying scalar. One issue is that this transformation is non-unique and requires the solution of an Abel equation. A specific time-transformation typically must be computed numerically. We address this issue by computing an explicit, although approximate, time-transformation for systems where the delay has a constant plus small periodic term. We use a perturbative expansion to construct our explicit solutions. We provide a simple numerical example to illustrate the approach. We also demonstrate the use of this time-transformation to analyze stability of the system with this class of periodic delays.
Parameterization of Seed Functions for Equivalent Representations of Time-Varying Delay Systems
Abel's classic transformation shows that any well-posed system with time-varying delay is equivalent to a parameter-varying system with fixed delay. The existence of such a parameter-varying constant delay representation then simplifies the problems of stability analysis and optimal control. Unfortunately, the method for construction of such transformations has been ad-hoc -- requiring an iterative time-stepping approach to constructing the transformation beginning with a seed function subject to boundary-value constraints. Moreover, a poor choice of seed function often results in a constant delay representation with large time-variations in system parameters -- obviating the benefits of such a representation. In this paper, we show how the set of all feasible seed functions can be parameterized using a basis for $L_2$. This parameterization is then used to search for seed functions for which the corresponding time-transformation results in smaller parameter variation. The parameterization of admissible seed functions is illustrated with numerical examples that contrast how well-chosen and poorly chosen seed functions affect the boundedness of a time transformation.
Fast Relax-and-Round Unit Commitment with Economic Horizons
We expand our novel computational method for unit commitment (UC) to include long-horizon planning. We introduce a fast novel algorithm to commit hydro-generators, provably accurately. We solve problems with thousands of generators at 5 minute market intervals. We show that our method can solve interconnect size UC problems in approximately 1 minute on a commodity hardware and that an increased planning horizon leads to sizable operational cost savings (our objective). This scale is infeasible for current state-of-the-art tools. We attain this runtime improvement by introducing a heuristic tailored for UC problems. Our method can be implemented using existing continuous optimization solvers and adapted for different applications. Combined, the two algorithms would allow an operator operating large systems with hydro units to make horizon-aware economic decisions.
comment: 6 pages (journal limit), 6 figures
Adaptive Tube MPC: Beyond a Common Quadratically Stabilizing Feedback Gain
This paper proposes an adaptive tube framework for model predictive control (MPC) of discrete-time linear time-invariant systems subject to parametric uncertainty and additive disturbances. In contrast to conventional tube-based MPC schemes that employ fixed tube geometry and constraint tightening designed for worst-case uncertainty, the proposed approach incorporates online parameter learning to progressively refine the parametric uncertainty set and update the parameter estimates. These updates are used to adapt the components of the MPC optimization problem, including the prediction model, feedback gain, terminal set, and tube cross-sections. As the uncertainty set contracts, the required amount of constraint tightening reduces and the tube shrinks accordingly, yielding less conservative control actions. Recursive feasibility, robust constraint satisfaction, and closed-loop stability are formally established. Furthermore, the framework does not require the existence of a common quadratically stabilizing linear feedback gain for the entire parametric uncertainty set, thereby relaxing a standard assumption in existing tube-based MPC formulations. Numerical examples illustrate the effectiveness of the proposed approach.
Game-Theory-Assisted Reinforcement Learning for Border Defense: Early Termination based on Analytical Solutions
Game theory provides the gold standard for analyzing adversarial engagements, offering strong optimality guarantees. However, these guarantees often become brittle when assumptions such as perfect information are violated. Reinforcement learning (RL), by contrast, is adaptive but can be sample-inefficient in large, complex domains. This paper introduces a hybrid approach that leverages game-theoretic insights to improve RL training efficiency. We study a border defense game with limited perceptual range, where defender performance depends on both search and pursuit strategies, making classical differential game solutions inapplicable. Our method employs the Apollonius Circle (AC) to compute equilibrium in the post-detection phase, enabling early termination of RL episodes without learning pursuit dynamics. This allows RL to concentrate on learning search strategies while guaranteeing optimal continuation after detection. Across single- and multi-defender settings, this early termination method yields 10-20% higher rewards, faster convergence, and more efficient search trajectories. Extensive experiments validate these findings and demonstrate the overall effectiveness of our approach.
comment: 7 pages, ACC 2026
Rethinking Frequency Control in Power Systems
Frequency control in power systems is implemented in a hierarchical structure traditionally known as primary frequency control (PFC), secondary frequency control (SFC) and tertiary control reserve (TCR) and, some jurisdictions, include time error control (TEC) as well. This hierarchical structure has been designed around a century ago based on timescales separation, that is, approximately an order of magnitude difference between each control structure. This paper argues, based on real-world observations as well as detailed dynamic simulations on a model of the All-Island power system (AIPS) of Ireland, that this frequency control structure is not necessary in current and future converter-dominated power grids. The paper proposes to redesign this structure by removing the SFC and TCR and rely on PFC and a real-time energy market. The PFC is responsible for addressing fast power imbalances in timescales of tens of ms to few minutes (e.g., 100 ms to 5 minutes) while the real-time energy market is responsible for addressing longer imbalances in timescales of minutes to hours (e.g., 5 minutes to 1 hour). TEC, on the other hand, is considered as optional.
Two-Phase Cell Switching in 6G vHetNets: Sleeping-Cell Load Estimation and Renewable-Aware Switching Toward NES
This paper proposes a two phase framework to improve the sustainability in vertical heterogeneous networks that integrate various types of base stations~(BSs), including terrestrial macro BSs~(MBSs), small BSs~(SBSs), and a high altitude platform station super MBS (HAPS SMBS). In Phase I, we address the critical and often overlooked challenge of estimating the traffic load of sleeping SBSs, a prerequisite for practical cell switching, by introducing three methods with varying data dependencies: (i) a distance based estimator (no historical data), (ii) a multi level clustering (MLC) estimator (limited historical data), and (iii) a long short term memory~(LSTM) based temporal predictor (full historical data). In Phase II, we incorporate the most accurate estimation results from Phase I into a renewable energy aware cell switching strategy, explicitly modeling solar powered SBSs in three operational scenarios that reflect realistic hybrid grid renewable deployments. This flexible design allows the framework to adapt switching strategies based on renewable availability and storage conditions, making it more practical and robust for real world networks. Using a real call detail record dataset from Milan, simulation results show that the LSTM method achieves a mean absolute percentage error (MAPE) below 1% in Phase I, while in Phase II, the threshold based solar integration scenario achieves up to 23% network energy saving (NES) relative to conventional cell switching. Overall, the proposed framework bridges the gap between theoretical cell switching models and practical, sustainable 6G radio access network~(RAN) operation, enabling significant energy saving without compromising quality of service.
Reachability Analysis for Design Optimization
We present an approach to approximate reachable sets for linear systems with bounded L-infinity controls in finite time. Our first approach investigates the boundaries of these sets and reveals an exact characterization for single-input, planar systems with real, distinct eigenvalues. The second approach leverages convergence of the Lp-norms to L-infinity and uses Lp-norm reachable sets as an approximation of the L-infinity-norm reachable sets. Our optimal control results yield insights that make computational approximations of the Lp-norm reachable sets more tractable, and yield exact characterizations for L-infinity with the previous assumptions on the system. As an example, we incorporate our reachability analysis into the design optimization of a highly-maneuverable aircraft. Introducing constraints based on reachability allow us to factor physical limitations to desired flight maneuvers into the design process.
comment: 7 pages, 3 figures, to be published in 2026 American Control Conference Proceedings
Solar Daylighting to Offset LED Lighting in Vertical Farming: A Techno-Economic Study of Light Pipes
Vertical farming is a controlled-environment agriculture (CEA) approach in which crops are grown in stacked layers under regulated climate and lighting, enabling predictable production but requiring high electricity input. This study quantifies the techno-economic impact of roof-mounted daylighting in a three-tier container vertical farm using a light-pipe (LP) system that delivers sunlight to the upper tier. The optical chain, comprising a straight duct and a tilting aluminum-coated mirror within a rotating dome, was modelled in Tonatiuh to estimate crop-level photon delivery and solar gains. These outputs were coupled with a transient AGRI-Energy model to perform year-round simulations for Dubai. Tier-3 strategies were compared against a fully LED benchmark, including daylight-only operation, on/off supplementation, PWM dimming, UV-IR filtering, variable-transmittance control, and simple glazing. Ray-tracing predicted an overall LP optical efficiency of 45%-75%, depending on solar position, quantifying the fraction of incident daylight at the collector aperture delivered to the target growing zone. Daylight-only operation reduced the total three-tier yield by 17% and was not economically viable despite 27-29% electricity savings. Hybrid daylight-LED strategies preserved benchmark yield while reducing electricity use. PWM dimming combined with UV-IR filtering achieved the lowest specific electricity energy consumption (6.32 kWh/kg), 14% below the benchmark. Overall, viability remains CAPEX-limited because achievable electricity savings are insufficient to offset the added investment and thus improves mainly under high electricity and carbon-price contexts, although the LP system delivers a 15-38% lower light cost than an optical-fiber reference under identical incident daylight.
Entropy-Aware Task Offloading in Mobile Edge Computing
Mobile Edge Computing (MEC) technology has been introduced to enable could computing at the edge of the network in order to help resource limited mobile devices with time sensitive data processing tasks. In this paradigm, mobile devices can offload their computationally heavy tasks to more efficient nearby MEC servers via wireless communication. Consequently, the main focus of researches on the subject has been on development of efficient offloading schemes, leaving the privacy of mobile user out. While the Blockchain technology is used as the trust mechanism for secured sharing of the data, the privacy issues induced from wireless communication, namely, usage pattern and location privacy are the centerpiece of this work. The effects of these privacy concerns on the task offloading Markov Decision Process (MDP) is addressed and the MDP is solved using a Deep Recurrent Q-Netwrok (DRQN). The Numerical simulations are presented to show the effectiveness of the proposed method.
comment: 13 pages, submitted to Journal of Blockchain Research
Optimizing Task Completion Time Updates Using POMDPs
Managing announced task completion times is a fundamental control problem in project management. While extensive research exists on estimating task durations and task scheduling, the problem of when and how to update completion times communicated to stakeholders remains understudied. Organizations must balance announcement accuracy against the costs of frequent timeline updates, which can erode stakeholder trust and trigger costly replanning. Despite the prevalence of this problem, current approaches rely on static predictions or ad-hoc policies that fail to account for the sequential nature of announcement management. In this paper, we formulate the task announcement problem as a Partially Observable Markov Decision Process (POMDP) where the control policy must decide when to update announced completion times based on noisy observations of true task completion. Since most state variables (current time and previous announcements) are fully observable, we leverage the Mixed Observability MDP (MOMDP) framework to enable more efficient policy optimization. Our reward structure captures the dual costs of announcement errors and update frequency, enabling synthesis of optimal announcement control policies. Using off-the-shelf solvers, we generate policies that act as feedback controllers, adaptively managing announcements based on belief state evolution. Simulation results demonstrate significant improvements in both accuracy and announcement stability compared to baseline strategies, achieving up to 75\% reduction in unnecessary updates while maintaining or improving prediction accuracy.
comment: 7 pages, 6 figures, submitted to American Control Conference 2026
On transferring safety certificates across dynamical systems
Control barrier functions (CBFs) provide a powerful tool for enforcing safety constraints in control systems, but their direct application to complex, high-dimensional dynamics is often challenging. In many settings, safety certificates are more naturally designed for simplified or alternative system models that do not exactly match the dynamics of interest. This paper addresses the problem of transferring safety guarantees between dynamical systems with mismatched dynamics. We propose a transferred control barrier function (tCBF) framework that enables safety constraints defined on one system to be systematically enforced on another system using a simulation function and an explicit margin term. The resulting transferred barrier accounts for model mismatch and induces a safety condition that can be enforced on the target system via a quadratic-program-based safety filter. The proposed approach is general and does not require the two systems to share the same state dimension or dynamics. We demonstrate the effectiveness of the framework on a quadrotor navigation task with the transferred barrier ensuring collision avoidance for the target system, while remaining minimally invasive to a nominal controller. These results highlight the potential of transferred control barrier functions as a general mechanism for enforcing safety across heterogeneous dynamical systems.
Quadratic Programming Approach to Flight Envelope Protection Using Control Barrier Functions
Ensuring the safe operation of aerospace systems within their prescribed flight envelope is a fundamental requirement for modern flight control systems. Flight envelope protection (FEP) prevents violations of aerodynamic, structural, and performance constraints, mitigating risks such as stall, excessive loads, and loss of control. Conventional FEP approaches, such as reference clipping via saturation functions and model-based command filtering, impose constraints at the reference input level but often fail to account for closed-loop system dynamics, potentially leading to constraint violations during transients. This paper introduces a new approach to flight envelope protection by employing a quadratic-programming-based safety filter using control barrier functions to dynamically enforce flight envelope constraints while preserving control performance. Unlike traditional reference filtering methods, the proposed control barrier function-based safety filter actively ensures forward invariance of the safe flight envelope set while seamlessly integrating with existing control architectures. The framework is implemented in a nonlinear missile flight control system and evaluated in a simulated environment. The results demonstrate its ability to prevent constraint violations while minimizing conservatism, offering a robust alternative to existing flight envelope protection methodologies.
comment: 26 pages, 12 figures, accepted for publication in the AIAA Journal of Guidance, Control, and Dynamics as an Engineering Note
Optimization-Based Robust Permissive Synthesis for Interval MDPs
We present an optimization-based framework for robust permissive synthesis for Interval Markov Decision Processes (IMDPs), motivated by robotic decision-making under transition uncertainty. In many robotic systems, model inaccuracies and sensing noise lead to interval-valued transition probabilities. While robust IMDP synthesis typically yields a single policy and permissive synthesis assumes exact models, we show that robust permissive synthesis under interval uncertainty can be cast as a global mixed-integer linear program (MILP) that directly encodes robust Bellman constraints. The formulation maximizes a quantitative permissiveness metric (the number of enabled state-action pairs), while guaranteeing that every compliant strategy satisfies probabilistic reachability or expected reward specifications under all admissible transition realizations. To address the exponential complexity of vertex-based uncertainty representations, we derive a dualization-based encoding that eliminates explicit vertex enumeration and scales linearly with the number of successors. Experimental evaluation on four representative robotic benchmark domains demonstrates scalability to IMDPs with hundreds of thousands of states. The proposed framework provides a practical and general foundation for uncertainty-aware, flexibility-preserving controller synthesis in robotic systems.
Frequency-Aware Sparse Optimization for Diagnosing Grid Instabilities and Collapses
This paper aims to proactively diagnose and manage frequency instability risks from a steady-state perspective, without the need for derivative-dependent transient modeling. Specifically, we jointly address two questions (Q1) Survivability: following a disturbance and the subsequent primary frequency response, can the system settle into a healthy steady state (feasible with an acceptable frequency deviation $Δf$)? (Q2) Dominant Vulnerability: if found unstable, what critical vulnerabilities create instability and/or full collapse? To address these questions, we first augment steady-state power flow states to include frequency-dependent governor relationships (i.e., governor power flow). Afterwards, we propose a frequency-aware sparse optimization that finds the minimal set of bus locations with measurable compensations (corrective actions) to enforce power balance and maintain frequency within predefined/acceptable bounds. We evaluate our method on standard transmission systems to empirically validate its ability to localize dominant sources of vulnerabilities. For a 1354-bus large system, our method detects compensations to only four buses under N-1 generation outage (3424.8 MW) while enforcing a maximum allowable steady-state frequency drop of 0.06 Hz (otherwise, frequency drops by nearly 0.08 Hz). We further validate the scalability of our method, requiring less than four minutes to obtain sparse solutions for the 1354-bus system.
comment: 5 pages, 7 figures, manuscript has been accepted by PESGM 2026
Efficient Input-Constrained Impulsive Optimal Control of Linear Systems with Application to Spacecraft Relative Motion
This work presents a novel algorithm for impulsive optimal control of linear time-varying systems with the inclusion of input magnitude constraints. Impulsive optimal control problems, where the optimal input solution is a sum of delta functions, are typically formulated as an optimization over a normed function space subject to integral equality constraints and can be efficiently solved for linear time-varying systems in their dual formulation. In this dual setting, the problem takes the form of a semi-infinite program which is readily solvable in online scenarios for constructing maneuver plans. This work augments the approach with the inclusion of magnitude constraints on the input over time windows of interest, which is shown to preserve the impulsive nature of the optimal solution and enable efficient solution procedures via semi-infinite programming. The resulting algorithm is demonstrated on the highly relevant problem of relative motion control of spacecraft in Low Earth Orbit (LEO).
Lightweight 3D LiDAR-Based UAV Tracking: An Adaptive Extended Kalman Filtering Approach
Accurate relative positioning is crucial for swarm aerial robotics, enabling coordinated flight and collision avoidance. Although vision-based tracking has been extensively studied, 3D LiDAR-based methods remain underutilized despite their robustness under varying lighting conditions. Existing systems often rely on bulky, power-intensive sensors, making them impractical for small UAVs with strict payload and energy constraints. This paper presents a lightweight LiDAR-based UAV tracking system incorporating an Adaptive Extended Kalman Filter (AEKF) framework. Our approach effectively addresses the challenges posed by sparse, noisy, and nonuniform point cloud data generated by non-repetitive scanning 3D LiDARs, ensuring reliable tracking while remaining suitable for small drones with strict payload constraints. Unlike conventional filtering techniques, the proposed method dynamically adjusts the noise covariance matrices using innovation and residual statistics, thereby enhancing tracking accuracy under real-world conditions. Additionally, a recovery mechanism ensures continuity of tracking during temporary detection failures caused by scattered LiDAR returns or occlusions. Experimental validation was performed using a Livox Mid-360 LiDAR mounted on a DJI F550 UAV in real-world flight scenarios. The proposed method demonstrated robust UAV tracking performance under sparse LiDAR returns and intermittent detections, consistently outperforming both standard Kalman filtering and particle filtering approaches during aggressive maneuvers. These results confirm that the framework enables reliable relative positioning in GPS-denied environments without the need for multi-sensor arrays or external infrastructure.
comment: Presented at the 19th International Conference on Intelligent Autonomous Systems, IAS-19, Genoa, Italy, June 30 to July 4, 2025. To appear in the Springer post-proceedings of the conference
Decentralized CBF-based Safety Filters for Collision Avoidance of Cooperative Missile Systems with Input Constraints
This paper presents a decentralized safety filter for collision avoidance in multi-agent aerospace interception scenarios. The approach leverages robust control barrier functions (RCBFs) to guarantee forward invariance of safety sets under bounded inputs and high-relative-degree dynamics. Each effector executes its nominal cooperative guidance command, while a local quadratic program (QP) modifies the input only when necessary. Event-triggered activation based on range and zero-effort miss (ZEM) criteria ensures scalability by restricting active constraints to relevant neighbors. To resolve feasibility issues from simultaneous constraints, a slack-variable relaxation scheme is introduced that prioritizes critical agents in a Pareto-optimal manner. Simulation results in many-on-many interception scenarios demonstrate that the proposed framework maintains collision-free operation with minimal deviation from nominal guidance, providing a computationally efficient and scalable solution for safety-critical multi-agent aerospace systems.
comment: 7 pages, 5 figures, accepted for presentation at the 2026 American Control Conference (ACC 2026)
Partial Resilient Leader-Follower Consensus in Time-Varying Graphs
This work studies resilient leader-follower consensus with a bounded number of adversaries. Existing approaches typically require robustness conditions of the entire network to guarantee resilient consensus. However, the behavior of such systems when these conditions are not fully met remains unexplored. To address this gap, we introduce the notion of partial leader-follower consensus, in which a subset of non-adversarial followers successfully tracks the leader's reference state despite insufficient robustness. We propose a novel distributed algorithm - the Bootstrap Percolation and Mean Subsequence Reduced (BP-MSR) algorithm - and establish sufficient conditions for individual followers to achieve consensus via the BP-MSR algorithm in arbitrary time-varying graphs. We validate our findings through simulations, demonstrating that our method guarantees partial leader-follower consensus, even when standard resilient consensus algorithms fail.
comment: 8 pages, 3 figures, Accepted to 2026 IEEE American Control Conference (ACC)
Pareto-Optimal Sampling and Resource Allocation for Timely Communication in Shared-Spectrum Low-Altitude Networks
Guaranteeing stringent data freshness for low-altitude unmanned aerial vehicles (UAVs) in shared spectrum forces a critical trade-off between two operational costs: the UAV's own energy consumption and the occupation of terrestrial channel resources. The core challenge is to satisfy the aerial data freshness while finding a Pareto-optimal balance between these costs. Leveraging predictive channel models and predictive UAV trajectories, we formulate a bi-objective Pareto optimization problem over a long-term planning horizon to jointly optimize the sampling timing for aerial traffic and the power and spectrum allocation for fair coexistence. However, the problem's non-convex, mixed-integer nature renders classical methods incapable of fully characterizing the complete Pareto frontier. Notably, we show monotonicity properties of the frontier, building on which we transform the bi-objective problem into several single-objective problems. We then propose a new graph-based algorithm and prove that it can find the complete set of Pareto optima with low complexity, linear in the horizon and near-quadratic in the resource block (RB) budget. Numerical comparisons show that our approach meets the stringent timeliness requirement and achieves a six-fold reduction in RB utilization or a 6 dB energy saving compared to benchmarks.
Barrier-Riccati Synthesis for Nonlinear Safe Control with Expanded Region of Attraction
We present a Riccati-based framework for safety-critical nonlinear control that integrates the barrier states (BaS) methodology with the State-Dependent Riccati Equation (SDRE) approach. The BaS formulation embeds safety constraints into the system dynamics via auxiliary states, enabling safety to be treated as a control objective. To overcome the limited region of attraction in linear BaS controllers, we extend the framework to nonlinear systems using SDRE synthesis applied to the barrier-augmented dynamics and derive a matrix inequality condition that certifies forward invariance of a large region of attraction and guarantees asymptotic safe stabilization. The resulting controller is computed online via pointwise Riccati solutions. We validate the method on an unstable constrained system and cluttered quadrotor navigation tasks, demonstrating improved constraint handling, scalability, and robustness near safety boundaries. This framework offers a principled and computationally tractable solution for synthesizing nonlinear safe feedback in safety-critical environments.
comment: This work has been accepted for publication in the proceedings of the 2026 American Control Conference (ACC), New Orleans, Louisiana, USA
Conservative Bias Linear Power Flow Approximations: Application to Unit Commitment
Accurate modeling of power flow behavior is essential for a wide range of power system applications, yet the nonlinear and nonconvex structure of the underlying equations often limits their direct use in large-scale optimization problems. As a result, linear models are frequently adopted to improve computational tractability, though these simplifications can introduce excessive approximation error or lead to constraint violations. This paper presents a linear approximation framework, referred to as Conservative Bias Linear Approximations (CBLA), that systematically incorporates conservativeness into the approximation process. Rather than solely minimizing local linearization error, CBLA constructs linear constraints that bound the nonlinear functions of interest over a defined operating region while reducing overall approximation bias. The proposed approach maintains the simplicity of linear formulations and allows the approximation to be shaped through user-defined loss functions tailored to specific system quantities. Numerical studies demonstrate that CBLA provides more reliable and accurate approximations than conventional linearization techniques, and its integration into a unit commitment formulation results in improved feasibility and reduced operating costs.
comment: The conference version is published in P. Buason, S. Misra and D. K. Molzahn, "Sample-Based Conservative Bias Linear Power Flow Approximations," 2024 IEEE/IAS Industrial and Commercial Power System Asia (I&CPS Asia), Pattaya, Thailand, 2024, pp. 1-6, doi: 10.1109/ICPSAsia61913.2024.10761778
Path planning with moving obstacles using stochastic optimal control SC
Navigating a collision-free and optimal trajectory for a robot is a challenging task, particularly in environments with moving obstacles such as humans. We formulate this problem as a stochastic optimal control problem. Since solving the full problem is computationally demanding, we introduce a tractable approximation whose Bellman equation can be solved efficiently. The resulting value function is then incorporated as a terminal penalty in an online rollout framework. We construct a trade-off curve between safety and performance to identify an appropriate weighting between them, and compare the performance with other methods. Simulation results show that the proposed rollout approach can be tuned to reach the target in nearly the same expected time as receding horizon $A^\star$ while maintaining a larger expected minimum distance to the moving obstacle. The results also show that the proposed method outperforms the considered CBF-based methods when a larger obstacle clearance is desired, while achieving comparable performance otherwise.
comment: 10 pages, 6 figures. Submitted to the 15th Asian Control Conference (ASCC) 2026
Topology optimization of nonlinear forced response curves via reduction on spectral submanifolds
Forced response curves (FRCs) of nonlinear systems can exhibit complex behaviors, including hardening/softening behavior and bifurcations. Although topology optimization holds great potential for tuning these nonlinear dynamic responses, its use in high-dimensional systems is limited by the high cost of repeated response and sensitivity analyses. To address this challenge, we employ the spectral submanifolds (SSMs) reduction theory, which reformulates the periodic response as the equilibria of an associated reduced-order model (ROM). This enables efficient and analytic evaluation of both response amplitudes and their sensitivities. Based on the SSM-based ROM, we formulate optimization problems that optimize the peak amplitude, the hardening/softening behavior, and the distance between two saddle-node bifurcations for an FRC. The proposed method is applied to the design of nonlinear MEMS devices, achieving targeted performance optimization. This framework provides a practical and efficient strategy for incorporating nonlinear dynamic effects into the topology optimization of structures.
comment: 33 pages, 23 figures. Submitted to Nonlinear Dynamics
Dual-Laws Model for a theory of artificial consciousness
Objectively verifying the generative mechanism of consciousness is extremely difficult because of its subjective nature. As long as theories of consciousness focus solely on its generative mechanism, developing a theory remains challenging. We believe that broadening the theoretical scope and enhancing theoretical unification are necessary to establish a theory of consciousness. This study proposes seven questions that theories of consciousness should address: phenomena, self, causation, state, function, contents, and universality. The questions were designed to examine the functional aspects of consciousness and its applicability to system design. Next, we will examine how our proposed Dual-Laws Model (DLM) can address these questions. Based on our theory, we anticipate two unique features of a conscious system: autonomy in constructing its own goals and cognitive decoupling from external stimuli. We contend that systems with these capabilities differ fundamentally from machines that merely follow human instructions. This makes a design theory that enables high moral behavior indispensable.
Vector-field guided constraint-following control for path following of uncertain mechanical systems
This note proposes a general control approach, called vector-field guided constraint-following control, to solve the dynamics control problem of geometric path-following for a class of uncertain mechanical systems. More specifically, it operates at the dynamics level and can handle both fully-actuated and underactuated mechanical systems, heterogeneous (possibly fast) time-varying uncertainties with unknown bounds, and geometric desired paths that may be self-intersecting. Simulations are conducted to demonstrate the effectiveness of the approach.
Slack More, Predict Better: Proximal Relaxation for Probabilistic Latent Variable Model-based Soft Sensors
Nonlinear Probabilistic Latent Variable Models (NPLVMs) are a cornerstone of soft sensor modeling due to their capacity for uncertainty delineation. However, conventional NPLVMs are trained using amortized variational inference, where neural networks parameterize the variational posterior. While facilitating model implementation, this parameterization converts the distributional optimization problem within an infinite-dimensional function space to parameter optimization within a finite-dimensional parameter space, which introduces an approximation error gap, thereby degrading soft sensor modeling accuracy. To alleviate this issue, we introduce KProxNPLVM, a novel NPLVM that pivots to relaxing the objective itself and improving the NPLVM's performance. Specifically, we first prove the approximation error induced by the conventional approach. Based on this, we design the Wasserstein distance as the proximal operator to relax the learning objective, yielding a new variational inference strategy derived from solving this relaxed optimization problem. Based on this foundation, we provide a rigorous derivation of KProxNPLVM's optimization implementation, prove the convergence of our algorithm can finally sidestep the approximation error, and propose the KProxNPLVM by summarizing the abovementioned content. Finally, extensive experiments on synthetic and real-world industrial datasets are conducted to demonstrate the efficacy of the proposed KProxNPLVM.
comment: This paper has been provisionally accepted for publication in the "IEEE Transactions on Industrial Informatics"
Comprehensive Deadlock Prevention for GPU Collective Communication
Distributed deep neural network training necessitates efficient GPU collective communications, which are inherently susceptible to deadlocks. GPU collective deadlocks arise easily in distributed deep learning applications when multiple collectives circularly wait for each other. GPU collective deadlocks pose a significant challenge to the correct functioning and efficiency of distributed deep learning, and no general effective solutions are currently available. Only in specific scenarios, ad-hoc methods, making an application invoke collectives in a consistent order across GPUs, can be used to prevent circular collective dependency and deadlocks. This paper presents DFCCL, a novel GPU collective communication library that provides a comprehensive approach for GPU collective deadlock prevention while maintaining high performance. DFCCL achieves preemption for GPU collectives at the bottom library level, effectively preventing deadlocks even if applications cause circular collective dependency. DFCCL ensures high performance with its execution and scheduling methods for collectives. Experiments show that DFCCL effectively prevents GPU collective deadlocks in various situations. Moreover, extensive evaluations demonstrate that DFCCL delivers performance comparable to or superior to NCCL, the state-of-the-art collective communication library highly optimized for NVIDIA GPUs.
Pose Estimation of a Thruster-Driven Bioinspired Multi-Link Robot
This work demonstrates simultaneous pose (position and orientation) and shape estimation for a free-floating, bioinspired multi-link robot with unactuated joints, link-mounted thrusters for control, and a single gyroscope per link, resulting in an underactuated, minimally sensed platform. Because the inter-link joint angles are constrained, translation and rotation of the multi-link system requires cyclic, reciprocating actuation of the thrusters, referred to as a gait. Through a proof-of-concept hardware experiment and offline analysis, we show that the robot's shape can be reliably estimated using an Unscented Kalman Filter augmented with Gaussian process residual models to compensate for non-zero-mean, non-Gaussian noise, while the pose exhibits drift expected from gyroscope integration in the absence of absolute position measurements. Experimental results demonstrate that a Gaussian process model trained on a multi-gait dataset (forward, backward, left, right, and turning) performs comparably to one trained exclusively on forward-gait data, revealing an overlap in the gait input space, which can be exploited to reduce per-gait training data requirements while enhancing the filter's generalizability across multiple gaits. Lastly, we introduce a heuristic derived from the observability Gramian to correlate joint angle estimate quality with gait periodicity and thruster inputs, highlighting how control affects estimation quality.
comment: 8 pages, 8 figures
Machine Learning-assisted Dynamics-Constrained Day-Ahead Energy Scheduling
TThe rapid expansion of inverter-based resources, such as wind and solar power plants, will significantly diminish the presence of conventional synchronous generators in fu-ture power grids with rich renewable energy sources. This transition introduces in-creased complexity and reduces dynamic stability in system operation and control, with low inertia being a widely recognized challenge. However, the literature has not thoroughly explored grid dynamic performance associated with energy scheduling so-lutions that traditionally only consider grid steady-state constraints. This paper will bridge the gap by enforcing grid dynamic constraints when conducting optimal energy scheduling; particularly, this paper explores locational post-contingency rate of change of frequency (RoCoF) requirements to accommodate substantial inertia reductions. This paper introduces a machine learning-assisted RoCoF-constrained unit commit-ment (ML-RCUC) model designed to ensure RoCoF stability after the most severe generator outage while maintaining operational efficiency. A graph-informed NN (GINN)-based RoCoF predictor is first trained on a high-fidelity simulation dataset to track the highest locational RoCoF, which is then reformulated as mixed-integer linear programming constraints that are integrated into the unit commitment model. Case studies, by solving the optimization problem ML-RCUC and validating its solutions with time-domain simulations, demonstrate that the proposed method can ensure loca-tional RoCoF stability with minimum conservativeness.
Inertia-Constrained Generation Scheduling: Sample Selection, Learning-Embedded Optimization Modeling, and Computational Enhancement
Day-ahead generation scheduling is typically conducted by solv-ing security-constrained unit commitment (SCUC) problem. However, with fast-growing of inverter-based resources, grid inertia has been dramatically reduced, compromising the dy-namic stability system. Traditional SCUC (T-SCUC), without any inertia requirements, may no longer be effective for renewa-bles-dominated grids. To address this, we propose the active linearized sparse neural network-embedded SCUC (ALSNN-SCUC) model, utilizing machine learning (ML) to incorporate system dynamic performance. A multi-output deep neural net-work (DNN) model is trained offline on strategically-selected data samples to accurately predict frequency stability metrics: locational RoCoF and frequency nadir. Structured sparsity and active ReLU linearization are implemented to prune redundant DNN neurons, significantly reducing its size while ensuring pre-diction accuracy even at high sparsity levels. By embedding this ML-based frequency stability predictor into SCUC as con-straints, the proposed ALSNN-SCUC model minimizes its com-putational complexity while ensuring frequency stability follow-ing G-1 contingency. Case studies show that the proposed ALSNN-SCUC can enforce pre-specified frequency requirements without being overly conservative, outperforming five bench-mark models including T-SCUC, two physics-based SCUC, and two ML-based SCUC. The proposed sparsification and active linearization strategies can reduce the DNN-SCUC computing time by over 95% for both IEEE 24-bus and 118-bus systems, demonstrating the effectiveness and scalability of the proposed ALSNN-SCUC model.
A Forward Reachability Perspective on Control Barrier Functions and Discount Factors in Reachability Analysis
Control invariant sets are crucial for various methods that aim to design safe control policies for systems whose state constraints must be satisfied over an indefinite time horizon. In this article, we explore the connections among reachability, control invariance, and Control Barrier Functions (CBFs). Unlike prior formulations based on backward reachability concepts, we establish a strong link between these three concepts by examining the inevitable Forward Reachable Tube (FRT), which is the set of states such that every trajectory reaching the FRT must have passed through a given initial set of states. First, our findings show that the inevitable FRT is a robust control invariant set if it has a continuously differentiable boundary. If the boundary is not differentiable, the FRT may lose invariance. We also show that any robust control invariant set including the initial set is a superset of the FRT if the boundary of the invariant set is differentiable. Next, we formulate a differential game between the control and disturbance, where the inevitable FRT is characterized by the zero-superlevel set of the value function. By incorporating a discount factor in the cost function of the game, the barrier constraint of the CBF naturally arises in the Hamilton-Jacobi (HJ) equation and determines the optimal policy. The resulting FRT value function serves as a CBF-like function, and conversely, any valid CBF is also a forward reachability value function. We further prove that any $C^1$ supersolution of the HJ equation for the FRT value functions is a valid CBF and characterizes a robust control invariant set that outer-approximates the FRT. Building on this property, finally, we devise a novel method that learns neural control barrier functions, which learn an control invariant superset of the FRT of a given initial set.
comment: The first two authors contributed equally to this work
Performance of the Kalman Filter and Smoother for Benchmark Studies
We propose analytical mean square error (MSE) expressions for the Kalman filter (KF) and the Kalman smoother (KS) for benchmark studies, where the true system dynamics are unknown or unavailable to the estimator. In such cases, as in benchmark evaluations for target tracking, the analysis relies on deterministic state trajectories. This setting introduces a model mismatch between the estimator and the true system, causing the covariance estimates to no longer reflect the actual estimation errors. To enable accurate performance prediction for deterministic state trajectories without relying on computationally intensive Monte Carlo simulations, we derive recursive MSE expressions with linear time complexity. The proposed framework also accounts for measurement model mismatch and provides an efficient tool for performance evaluation in benchmark studies involving long trajectories. Simulation results confirm the accuracy and computational efficiency of the proposed method.
Chance-Constrained DC Optimal Power Flow Using Constraint-Informed Statistical Estimation
Chance-constrained optimization has emerged as a promising framework for managing uncertainties in power systems. This work advances its application to the DC Optimal Power Flow (DC-OPF) model, developing a novel approach to uncertainty modeling and estimation. Current methods typically tackle these problems by first modeling random nodal injections using high-dimensional statistical distributions that scale with the number of buses, followed by deriving deterministic reformulations of the probabilistic constraints. We propose an alternative methodology that exploits the constraint structure to inform the uncertainties to be estimated, enabling significant dimensionality reduction. Rather than learning joint distributions of net-load forecast errors across units, we instead directly model the one-dimensional aggregate system forecast error and two-dimensional line errors weighted by power transfer distribution factors. We evaluate our approach under both Gaussian and non-Gaussian distributions on synthetic and real-world datasets, demonstrating significant improvements in statistical accuracy and optimization performance compared to existing methods.
Hybrid Lyapunov and Barrier Function-Based Control with Stabilization Guarantees
Control Lyapunov Functions (CLFs) and Control Barrier Functions (CBFs) can be combined, typically by means of Quadratic Programs (QPs), to design controllers that achieve performance and safety objectives. However, a significant limitation of this framework is the introduction of asymptotically stable equilibrium points besides the minimizer of the CLF, leading to deadlock situations even for simple systems and bounded convex unsafe sets. To address this problem, we propose a hybrid CLF-CBF control framework with global asymptotic stabilization and safety guarantees, offering a more flexible and systematic design methodology compared to current alternatives available in the literature. We further extend this framework to higher-order systems via a recursive procedure based on a joint CLF-CBF backstepping approach. The proposed solution is assessed through several simulation examples.
Robotics
Coordinate-Independent Robot Model Identification
Robot model identification is commonly performed by least-squares regression on inverse dynamics, but existing formulations measure residuals directly in coordinate force space and therefore depend on the chosen coordinate chart, units, and scaling. This paper proposes a coordinate-independent identification method that weights inverse-dynamics residuals by the dual metric induced by the system Riemannian metric. Using the force--velocity vector--covector duality, the dual metric provides a physically meaningful normalization of generalized forces, pulling coordinate residuals back into the ambient mechanical space and eliminating coordinate-induced bias. The resulting objective remains convex through an affine-metric and Schur-complement reformulation, and is compatible with physical-consistency constraints and geometric regularization. Experiments on an inertia-dominated Crazyflie--pendulum system and a drag-dominated LandSalp robot show improved identification accuracy, especially on shape coordinates, in both low-data and high-data settings.
comment: 8 pages, 7 figures, supplementary video: https://youtu.be/w2bBBV9t1fk?si=iCoJ4l51wumwvCIo
Seeing Where to Deploy: Metric RGB-Based Traversability Analysis for Aerial-to-Ground Hidden Space Inspection
Inspection of confined infrastructure such as culverts often requires accessing hidden spaces whose entrances are reachable primarily from elevated viewpoints. Aerial-ground cooperation enables a UAV to deploy a compact UGV for interior exploration, but selecting a suitable deployment region from aerial observations requires metric terrain reasoning involving scale ambiguity, reconstruction uncertainty, and terrain semantics. We present a metric RGB-based geometric-semantic reconstruction and traversability analysis framework for aerial-to-ground hidden space inspection. A feed-forward multi-view RGB reconstruction backbone produces dense geometry, while temporally consistent semantic segmentation yields a 3D semantic map. To enable deployment-relevant measurements without LiDAR-based dense mapping, we introduce an embodied motion prior that recovers metric scale by enforcing consistency between predicted camera motion and onboard platform egomotion. From the metrically grounded reconstruction, we construct a confidence-aware geometric-semantic traversability map and evaluate candidate deployment zones under explicit reachability constraints. Experiments on a tethered UAV-UGV platform demonstrate reliable deployment-zone identification in hidden space scenarios.
Physically Accurate Rigid-Body Dynamics in Particle-Based Simulation IROS 2026
Robotics demands simulation that can reason about the diversity of real-world physical interactions, from rigid to deformable objects and fluids. Current simulators address this by stitching together multiple subsolvers for different material types, resulting in a compositional architecture that complicates physical reasoning. Particle-based simulators offer a compelling alternative, representing all materials through a single unified formulation that enables seamless cross-material interactions. Among particle-based simulators, position-based dynamics (PBD) is a popular solver known for its computational efficiency and visual plausibility. However, its lack of physical accuracy has limited its adoption in robotics. To leverage the benefits of particle-based solvers while meeting the physical fidelity demands of robotics, we introduce PBD-R, a revised PBD formulation that enforces physically accurate rigid-body dynamics through a novel momentum-conservation constraint and a modified velocity update. Additionally, we introduce a solver-agnostic benchmark with analytical solutions to evaluate physical accuracy. Using this benchmark, we show that PBD-R significantly outperforms PBD and achieves competitive accuracy with MuJoCo while requiring less computation.
comment: Submitted to IROS 2026
CyboRacket: A Perception-to-Action Framework for Humanoid Racket Sports
Dynamic ball-interaction tasks remain challenging for robots because they require tight perception-action coupling under limited reaction time. This challenge is especially pronounced in humanoid racket sports, where successful interception depends on accurate visual tracking, trajectory prediction, coordinated stepping, and stable whole-body striking. Existing robotic racket-sport systems often rely on external motion capture for state estimation or on task-specific low-level controllers that must be retrained across tasks and platforms. We present CyboRacket, a hierarchical perception-to-action framework for humanoid racket sports that integrates onboard visual perception, physics-based trajectory prediction, and large-scale pre-trained whole-body control. The framework uses onboard cameras to track the incoming object, predicts its future trajectory, and converts the estimated interception state into target end-effector and base-motion commands for whole-body execution by SONIC on the Unitree G1 humanoid robot. We evaluate the proposed framework in a vision-based humanoid tennis-hitting task. Experimental results demonstrate real-time visual tracking, trajectory prediction, and successful striking using purely onboard sensing.
Tactile Modality Fusion for Vision-Language-Action Models
We propose TacFiLM, a lightweight modality-fusion approach that integrates visual-tactile signals into vision-language-action (VLA) models. While recent advances in VLA models have introduced robot policies that are both generalizable and semantically grounded, these models mainly rely on vision-based perception. Vision alone, however, cannot capture the complex interaction dynamics that occur during contact-rich manipulation, including contact forces, surface friction, compliance, and shear. While recent attempts to integrate tactile signals into VLA models often increase complexity through token concatenation or large-scale pretraining, the heavy computational demands of behavioural models necessitate more lightweight fusion strategies. To address these challenges, TacFiLM outlines a post-training finetuning approach that conditions intermediate visual features on pretrained tactile representations using feature-wise linear modulation (FiLM). Experimental results on insertion tasks demonstrate consistent improvements in success rate, direct insertion performance, completion time, and force stability across both in-distribution and out-of-distribution tasks. Together, these results support our method as an effective approach to integrating tactile signals into VLA models, improving contact-rich manipulation behaviours.
comment: 19 pages, 5 figures
Latent Dynamics-Aware OOD Monitoring for Trajectory Prediction with Provable Guarantees
In safety-critical Cyber-Physical Systems (CPS), accurate trajectory prediction provides vital guidance for downstream planning and control, yet although deep learning models achieve high-fidelity forecasts on validation data, their reliability degrades under out-of-distribution (OOD) scenarios caused by environmental uncertainty or rare traffic behaviors in real-world deployment; detecting such OOD events is challenging due to evolving traffic conditions and changing interaction patterns, while safety-critical applications demand formal guarantees on detection delay and false-alarm rates, motivating us-following recent work [1]-to formulate OOD monitoring for trajectory prediction as a quickest changepoint detection (QCD) problem that offers a principled statistical framework with established theory; we further observe that the real-world evolution of prediction errors under in-distribution (ID) conditions can be effectively modeled by a Hidden Markov Model (HMM), and by leveraging this structure we extend the cumulative Maximum Mean Discrepancy approach to enable detection without requiring explicit knowledge of the post-change distribution while still admitting provable guarantees on delay and false alarms, with experiments on three real-world driving datasets demonstrating reduced detection delay and robustness to heavy-tailed errors and unknown post-change conditions.
A Loss Landscape Visualization Framework for Interpreting Reinforcement Learning: An ADHDP Case Study
Reinforcement learning algorithms have been widely used in dynamic and control systems. However, interpreting their internal learning behavior remains a challenge. In the authors' previous work, a critic match loss landscape visualization method was proposed to study critic training. This study extends that method into a framework which provides a multi-perspective view of the learning dynamics, clarifying how value estimation, policy optimization, and temporal-difference (TD) signals interact during training. The proposed framework includes four complementary components; a three-dimensional reconstruction of the critic match loss surface that shows how TD targets shape the optimization geometry; an actor loss landscape under a frozen critic that reveals how the policy exploits that geometry; a trajectory combining time, Bellman error, and policy weights that indicates how updates move across the surface; and a state-TD map that identifies the state regions that drive those updates. The Action-Dependent Heuristic Dynamic Programming (ADHDP) algorithm for spacecraft attitude control is used as a case study. The framework is applied to compare several ADHDP variants and shows how training stabilizers and target updates change the optimization landscape and affect learning stability. Therefore, the proposed framework provides a systematic and interpretable tool for analyzing reinforcement learning behavior across algorithmic designs.
comment: Submitted to Acta Astronautica
SmallSatSim: A High-Fidelity Simulation and Training Toolkit for Microgravity Robotic Close Proximity Operations
Microgravity rendezvous and close proximity operations (RPO) is a growing area of interest for applications spanning in-space assembly and manufacturing (ISAM), orbital debris remediation, and small body exploration. Microgravity environments present unique challenges for robotic control and planning algorithms for new agile RPO mission scenarios like free-floating manipulation, planning under failure, and estimating high-fidelity dynamics of tumbling bodies. To facilitate the development and testing of novel RPO algorithms, we introduce SmallSatSim, a high-fidelity simulation toolkit that leverages the MuJoCo physics engine to accurately model small satellite RPO dynamics in local microgravity robotic free-flight settings, including under model disturbances and perturbations. The framework includes cutting edge out-of-the-box free-flyer control techniques. A GPU-accelerated pipeline using MuJoCo MJX and JAX is implemented for sampling- and learning-based simulation uses cases. SmallSatSim also supports configurable failure models, enabling the evaluation of safe control strategies under adversarial conditions. Visualization, logging, and GPU-enabled parallelization further enhance SmallSatSim's capability for RPO testing. We outline SmallSatSim's features and intended use cases, and demonstrate its use for robotic RPO planning and control. The open-sourced toolkit aims to accelerate research in autonomous, agile robotic small satellite operations.
comment: 7 pages, 7 figures
Adapting Critic Match Loss Landscape Visualization to Off-policy Reinforcement Learning
This work extends an established critic match loss landscape visualization method from online to off-policy reinforcement learning (RL), aiming to reveal the optimization geometry behind critic learning. Off-policy RL differs from stepwise online actor-critic learning in its replay-based data flow and target computation. Based on these two structural differences, the critic match loss landscape visualization method is adapted to the Soft Actor-Critic (SAC) algorithm by aligning the loss evaluation with its batch-based data flow and target computation, using a fixed replay batch and precomputed critic targets from the selected policy. Critic parameters recorded during training are projected onto a principal component plane, where the critic match loss is evaluated to form a 3-D landscape with an overlaid 2-D optimization path. Applied to a spacecraft attitude control problem, the resulting landscapes are analyzed both qualitatively and quantitatively using sharpness, basin area, and local anisotropy metrics, together with temporal landscape snapshots. Comparisons between convergent SAC, divergent SAC, and divergent Action-Dependent Heuristic Dynamic Programming (ADHDP) cases reveal distinct geometric patterns and optimization behaviors under different algorithmic structures. The results demonstrate that the adapted critic match loss visualization framework serves as a geometric diagnostic tool for analyzing critic optimization dynamics in replay-based off-policy RL-based control problems.
comment: Revised manuscript, submitted to Astrodynamics
MorFiC: Fixing Value Miscalibration for Zero-Shot Quadruped Transfer
Generalizing learned locomotion policies across quadrupedal robots with different morphologies remain a challenge. Policies trained on a single robot often break when deployed on embodiments with different mass distributions, kinematics, joint limits, or actuation constraints, forcing per robot retraining. We present MorFiC, a reinforcement learning approach for zero-shot cross-morphology locomotion using a single shared policy. MorFiC resolves a key failure mode in multi-morphology actor-critic training: a shared critic tends to average incompatible value targets across embodiments, yielding miscalibrated advantages. To address this, MorFiC conditions the critic via morphology-aware modulation driven by robot physical and control parameters, generating morphology-specific value estimates within a shared network. Trained with a single source robot with morphology randomization in simulation, MorFiC can transfer to unseen robots and surpasses morphology-conditioned PPO baselines by improving stable average speed and longest stable run on multiple targets, including speed gains of +16.1% on A1, ~2x on Cheetah, and ~5x on B1. We additionally show that MorFiC reduces the value-prediction error variance across morphologies and stabilizes the advantage estimates, demonstrating that the improved value-function calibration corresponds to a stronger transfer performance. Finally, we demonstrate zero-shot deployment on two Unitree Go1 and Go2 robots without fine-tuning, indicating that critic-side conditioning is a practical approach for cross-morphology generalization.
Visualizing Critic Match Loss Landscapes for Interpretation of Online Reinforcement Learning Control Algorithms
Reinforcement learning has proven its power on various occasions. However, its performance is not always guaranteed when system dynamics change. Instead, it largely relies on users' empirical experience. For reinforcement learning algorithms with an actor-critic structure, the critic neural network reflects the approximation and optimization process in the RL algorithm. Analyzing the performance of the critic neural network helps to understand the mechanism of the algorithm. To support systematic interpretation of such algorithms in dynamic control problems, this work proposes a critic match loss landscape visualization method for online reinforcement learning. The method constructs a loss landscape by projecting recorded critic parameter trajectories onto a low-dimensional linear subspace. The critic match loss is evaluated over the projected parameter grid using fixed reference state samples and temporal-difference targets. This yields a three-dimensional loss surface together with a two-dimensional optimization path that characterizes critic learning behavior. To extend analysis beyond visual inspection, quantitative landscape indices and a normalized system performance index are introduced, enabling structured comparison across different training outcomes. The approach is demonstrated using the Action-Dependent Heuristic Dynamic Programming algorithm on cart-pole and spacecraft attitude control tasks. Comparative analyses across projection methods and training stages reveal distinct landscape characteristics associated with stable convergence and unstable learning. The proposed framework enables both qualitative and quantitative interpretation of critic optimization behavior in online reinforcement learning.
comment: Revised manuscript, submitted to Acta Astronautica
Bots and Blocks: Presenting a project-based approach for robotics education
To prepare students for upcoming trends and challenges, it is important to teach them about the helpful and important aspects of modern technologies, such as robotics. However, classic study programs often fail to prepare students for working in the industry because of the lack of practical experience, caused by solely theoretical lecturing. The challenge is to teach both practical and theoretical skills interactively to improve the students' learning. In the scope of the paper, a project-based learning approach is proposed, where students are taught in an agile, semester-spanning project how to work with robots. This project is part of the applied computer science degree study program Digital Technologies. The paper presents the framework as well as an exemplary project featuring the development of a disassembly software ecosystem for hardware robots. In the project, the students are taught the programming of robots with the help of the Robot Operating System (ROS). To ensure the base qualifications, the students are taught in so-called schools, an interactive mix of lectures and exercises. At the beginning of the course, the basics of the technologies are covered, while the students work more and more in their team with the robot on a specific use case. The use case here is to automate the disassembly of build block assemblies.
comment: 12 pages, 3 figures, 23 references
Interp3R: Continuous-time 3D Geometry Estimation with Frames and Events
In recent years, 3D visual foundation models pioneered by pointmap-based approaches such as DUSt3R have attracted a lot of interest, achieving impressive accuracy and strong generalization across diverse scenes. However, these methods are inherently limited to recovering scene geometry only at the discrete time instants when images are captured, leaving the scene evolution during the blind time between consecutive frames largely unexplored. We introduce Interp3R, to the best of our knowledge the first method that enhances pointmap-based models to estimate depth and camera poses at arbitrary time instants. Interp3R leverages asynchronous event data to interpolate pointmaps produced by frame-based models, enabling temporally continuous geometric representations. Depth and camera poses are then jointly recovered by aligning the interpolated pointmaps together with those predicted by the underlying frame-based models into a consistent spatial framework. We train Interp3R exclusively on a synthetic dataset, yet demonstrate strong generalization across a wide range of synthetic and real-world benchmarks. Extensive experiments show that Interp3R outperforms by a considerable margin state-of-the-art baselines that follow a two-stage pipeline of 2D video frame interpolation followed by 3D geometry estimation.
comment: 18 pages, 6 figures, 5 tables
Architecting Autonomy for Safe Microgravity Free-Flyer Inspection
Small free-flying spacecraft can provide vital extravehicular activity (EVA) services like inspection and repair for future orbital outposts like the Lunar Gateway. Operating adjacent to delicate space station and microgravity targets, these spacecraft require formalization to describe the autonomy that a free-flyer inspection mission must provide. This work explores the transformation of general mission requirements for this class of free-flyer into a set of concrete decisions for the planning and control autonomy architectures that will power such missions. Flowing down from operator commands for inspection of important regions and mission time-criticality, a motion planning problem emerges that provides the basis for developing autonomy solutions. Unique constraints are considered such as velocity limitations, pointing, and keep-in/keep-out zones, with mission fallback techniques for providing hierarchical safety guarantees under model uncertainties and failure. Planning considerations such as cost function design and path vs. trajectory control are discussed. The typical inputs and outputs of the planning and control autonomy stack of such a mission are also provided. Notional system requirements such as solve times and propellant use are documented to inform planning and control design. The entire proposed autonomy framework for free-flyer inspection is realized in the SmallSatSim simulation environment, providing a reference example of free-flyer inspection autonomy. The proposed autonomy architecture serves as a blueprint for future implementations of small satellite autonomous inspection in proximity to mission-critical hardware, going beyond the existing literature in terms of both (1) providing realistic system requirements for an autonomous inspection mission and (2) translating these requirements into autonomy design decisions for inspection planning and control.
comment: 10 pages, 6 figures, published in the Proceedings of the 2025 IEEE Aerospace Conference
VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning
Vision-Language-Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text-based chain-of-thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long-horizon tasks. We propose VLA-Thinker, a thinking-with-image reasoning framework that models perception as a dynamically invocable reasoning action. To train such a system, we introduce a two-stage training pipeline consisting of (1) an SFT cold-start phase with curated visual Chain-of-Thought data to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align complete reasoning-action trajectories with task-level success. Extensive experiments on LIBERO and RoboTwin 2.0 benchmarks demonstrate that VLA-Thinker significantly improves manipulation performance, achieving 97.5% success rate on LIBERO and strong gains across long-horizon robotic tasks. Project and Codes: https://cywang735.github.io/VLA-Thinker/ .
comment: We introduce VLA-Thinker, the first VLA model capable of thinking-with-image reasoning, which models visual perception as a dynamically invocable reasoning action, enabling Multimodal Embodied Chain-of-Thought
One-Policy-Fits-All: Geometry-Aware Action Latents for Cross-Embodiment Manipulation ICRA 2026
Cross-embodiment manipulation is crucial for enhancing the scalability of robot manipulation and reducing the high cost of data collection. However, the significant differences between embodiments, such as variations in action spaces and structural disparities, pose challenges for joint training across multiple sources of data. To address this, we propose One-Policy-Fits-All (OPFA), a framework that enables learning a single, versatile policy across multiple embodiments. We first learn a Geometry-Aware Latent Representation (GaLR), which leverages 3D convolution networks and transformers to build a shared latent action space across different embodiments. Then we design a unified latent retargeting decoder that extracts embodiment-specific actions from the latent representations, without any embodiment-specific decoder tuning. OPFA enables end-to-end co-training of data from diverse embodiments, including various grippers and dexterous hands with arbitrary degrees of freedom, significantly improving data efficiency and reducing the cost of skill transfer. We conduct extensive experiments across 11 different end-effectors. The results demonstrate that OPFA significantly improves policy performance in diverse settings by leveraging heterogeneous embodiment data. For instance, cross-embodiment co-training can improve success rates by more than 50% compared to single-source training. Moreover, by adding only a few demonstrations from a new embodiment (e.g., eight), OPFA can achieve performance comparable to that of a well-trained model with 72 demonstrations.
comment: ICRA 2026
R3DP: Real-Time 3D-Aware Policy for Embodied Manipulation
Embodied manipulation requires accurate 3D understanding of objects and their spatial relations to plan and execute contact-rich actions. While large-scale 3D vision models provide strong priors, their computational cost incurs prohibitive latency for real-time control. We propose Real-time 3D-aware Policy (R3DP), which integrates powerful 3D priors into manipulation policies without sacrificing real-time performance. A core innovation of R3DP is the asynchronous fast-slow collaboration module, which seamlessly integrates large-scale 3D priors into the policy without compromising real-time performance. The system maintains real-time efficiency by querying the pre-trained slow system (VGGT) only on sparse key frames, while simultaneously employing a lightweight Temporal Feature Prediction Network (TFPNet) to predict features for all intermediate frames. By leveraging historical data to exploit temporal correlations, TFPNet explicitly improves task success rates through consistent feature estimation. Additionally, to enable more effective multi-view fusion, we introduce a Multi-View Feature Fuser (MVFF) that aggregates features across views by explicitly incorporating camera intrinsics and extrinsics. R3DP offers a plug-and-play solution for integrating large models into real-time inference systems. We evaluate R3DP against multiple baselines across different visual configurations. R3DP effectively harnesses large-scale 3D priors to achieve superior results, outperforming single-view and multi-view DP by 32.9% and 51.4% in average success rate, respectively. Furthermore, by decoupling heavy 3D reasoning from policy execution, R3DP achieves a 44.8% reduction in inference time compared to a naive DP+VGGT integration.
WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning
Autonomous driving systems depend on on models that can reason about high-level scene contexts and accurately predict the dynamics of their surrounding environment. Vision- Language Models (VLMs) have recently emerged as promising tools for decision-making and scene understanding, offering strong capabilities in contextual reasoning. However, their limited spatial comprehension constrains their effectiveness as end-to-end driving models. World Models (WM) internalize environmental dynamics to predict future scene evolution. Recently explored as ego-motion predictors and foundation models for autonomous driving, they represent a promising direction for addressing key challenges in the field, particularly enhancing generalization while maintaining dynamic prediction. To leverage the complementary strengths of context-based decision making and prediction, we propose WorldVLM: A hybrid architecture that unifies VLMs and WMs. In our design, the high-level VLM generates behavior commands to guide the driving WM, enabling interpretable and context-aware actions. We evaluate conditioning strategies and provide insights into the hybrid design challenges.
comment: 8 pages, 6 figures, 5 tables
Physics-Informed Policy Optimization via Analytic Dynamics Regularization ICML 2026
Reinforcement learning (RL) has achieved strong performance in robotic control; however, state-of-the-art policy learning methods, such as actor-critic methods, still suffer from high sample complexity and often produce physically inconsistent actions. This limitation stems from neural policies implicitly rediscovering complex physics from data alone, despite accurate dynamics models being readily available in simulators. In this paper, we introduce a novel physics-informed RL framework, called PIPER, that seamlessly integrates physical constraints directly into neural policy optimization with analytical soft physics constraints. At the core of our method is the integration of a differentiable Lagrangian residual as a regularization term within the actor's objective. This residual, extracted from a robot's simulator description, subtly biases policy updates towards dynamically consistent solutions. Crucially, this physics integration is realized through an additional loss term during policy optimization, requiring no alterations to existing simulators or core RL algorithms. Extensive experiments demonstrate that our method significantly improves learning efficiency, stability, and control accuracy, establishing a new paradigm for efficient and physically consistent robotic control.
comment: 11 pages, 8 figures. Submitted to ICML 2026
Towards Versatile Opti-Acoustic Sensor Fusion and Volumetric Mapping ICRA 2026
Accurate 3D volumetric mapping is critical for autonomous underwater vehicles operating in obstacle-rich environments. Vision-based perception provides high-resolution data but fails in turbid conditions, while sonar is robust to lighting and turbidity but suffers from low resolution and elevation ambiguity. This paper presents a volumetric mapping framework that fuses a stereo sonar pair with a monocular camera to enable safe navigation under varying visibility conditions. Overlapping sonar fields of view resolve elevation ambiguity, producing fully defined 3D point clouds at each time step. The framework identifies regions of interest in camera images, associates them with corresponding sonar returns, and combines sonar range with camera-derived elevation cues to generate additional 3D points. Each 3D point is assigned a confidence value reflecting its reliability. These confidence-weighted points are fused using a Gaussian Process Volumetric Mapping framework that prioritizes the most reliable measurements. Experimental comparisons with other opti-acoustic and sonar-based approaches, along with field tests in a marina environment, demonstrate the method's effectiveness in capturing complex geometries and preserving critical information for robot navigation in both clear and turbid conditions. Our code is open-source to support community adoption.
comment: To appear at ICRA 2026 in Vienna, Austria
OCRA: Object-Centric Learning with 3D and Tactile Priors for Human-to-Robot Action Transfer
We present OCRA, an Object-Centric framework for video-based human-to-Robot Action transfer that learns directly from human demonstration videos to enable robust manipulation. Object-centric learning emphasizes task-relevant objects and their interactions while filtering out irrelevant background, providing a natural and scalable way to teach robots. OCRA leverages multi-view RGB videos, the state-of-the-art 3D foundation model VGGT, and advanced detection and segmentation models to reconstruct object-centric 3D point clouds, capturing rich interactions between objects. To handle properties not easily perceived by vision alone, we incorporate tactile priors via a large-scale dataset of over one million tactile images. These 3D and tactile priors are fused through a multimodal module (ResFiLM) and fed into a Diffusion Policy to generate robust manipulation actions. Extensive experiments on both vision-only and visuo-tactile tasks show that OCRA significantly outperforms existing baselines and ablations, demonstrating its effectiveness for learning from human demonstration videos.
comment: Project page: https://sressers.github.io/OCRA/
eNavi: Event-based Imitation Policies for Low-Light Indoor Mobile Robot Navigation
Event cameras provide high dynamic range and microsecond-level temporal resolution, making them well-suited for indoor robot navigation, where conventional RGB cameras degrade under fast motion or low-light conditions. Despite advances in event-based perception spanning detection, SLAM, and pose estimation, there remains limited research on end-to-end control policies that exploit the asynchronous nature of event streams. To address this gap, we introduce a real-world indoor person-following dataset collected using a TurtleBot 2 robot, featuring synchronized raw event streams, RGB frames, and expert control actions across multiple indoor maps, trajectories under both normal and low-light conditions. We further build a multimodal data preprocessing pipeline that temporally aligns event and RGB observations while reconstructing ground-truth actions from odometry to support high-quality imitation learning. Building on this dataset, we propose a late-fusion RGB-Event navigation policy that combines dual MobileNet encoders with a transformer-based fusion module trained via behavioral cloning. A systematic evaluation of RGB-only, Event-only, and RGB-Event fusion models across 12 training variations ranging from single-path imitation to general multi-path imitation shows that policies incorporating event data, particularly the fusion model, achieve improved robustness and lower action prediction error, especially in unseen low-light conditions where RGB-only models fail. We release the dataset, synchronization pipeline, and trained models at https://eventbasedvision.github.io/eNavi/
From Scanning Guidelines to Action: A Robotic Ultrasound Agent with LLM-Based Reasoning
Robotic ultrasound offers advantages over free-hand scanning, including improved reproducibility and reduced operator dependency. In clinical practice, US acquisition relies heavily on the sonographer's experience and situational judgment. When transferring this process to robotic systems, such expertise is often encoded explicitly through fixed procedures and task-specific models, yielding pipelines that can be difficult to adapt to new scanning tasks. In this work, we propose a unified framework for autonomous robotic US scanning that leverages a LLM-based agent to interpret US scanning guidelines and execute scans by dynamically invoking a set of provided software tools. Instead of encoding fixed scanning procedures, the LLM agent retrieves and reasons over guideline steps from scanning handbooks and adapts its planning decisions based on observations and the current scanning state. This enables the system to handle variable and decision-dependent workflows, such as adjusting scanning strategies, repeating steps, or selecting the appropriate next tool call in response to image quality or anatomical findings. Because the reasoning underlying tool selection is also critical for transparent and trustworthy planning, we further fine tune the LLM agent using a RL based strategy to improve both its reasoning quality and the correctness of tool selection and parameterization, while maintaining robust generalization to unseen guidelines and related tasks. We first validate the approach via verbal execution on 10 US scanning guidelines, assessing reasoning as well as tool selection and parameterization, and showing the benefit of RL fine tuning. We then demonstrate real world feasibility on robotic scanning of the gallbladder, spine, and kidney. Overall, the framework follows diverse guidelines and enables reliable autonomous scanning across multiple anatomical targets within a unified system.
comment: Code: https://github.com/yuan-12138/RUSSAgent; Video: https://youtu.be/pfMOc4e2IGA
WestWorld: A Knowledge-Encoded Scalable Trajectory World Model for Diverse Robotic Systems
Trajectory world models play a crucial role in robotic dynamics learning, planning, and control. While recent works have explored trajectory world models for diverse robotic systems, they struggle to scale to a large number of distinct system dynamics and overlook domain knowledge of physical structures. To address these limitations, we introduce WestWorld, a knoWledge-Encoded Scalable Trajectory World model for diverse robotic systems. To tackle the scalability challenge, we propose a novel system-aware Mixture-of-Experts (Sys-MoE) that dynamically combines and routes specialized experts for different robotic systems via a learnable system embedding. To further enhance zero-shot generalization, we incorporate domain knowledge of robot physical structures by introducing a structural embedding that aligns trajectory representations with morphological information. After pretraining on 89 complex environments spanning diverse morphologies across both simulation and real-world settings, WestWorld achieves significant improvements over competitive baselines in zero- and few-shot trajectory prediction. Additionally, it shows strong scalability across a wide range of robotic environments and significantly improves performance on downstream model-based control for different robots. Finally, we deploy our model on a real-world Unitree Go1, where it demonstrates stable locomotion performance (see our demo on the website: https://westworldrobot.github.io/). The code will be available upon publication.
OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism
Embodied AI agents increasingly require parallel execution of multiple tasks, such as manipulation, conversation, and memory construction, from shared observations under distinct time constraints. Recent Mixture-of-Transformers (MoT) Vision-Language-Action Models (VLAs) architecturally support such heterogeneous outputs, yet existing inference systems fail to achieve efficient multi-task parallelism for on-device deployment due to redundant computation and resource contention. We identify isolated KV cache management as the root cause. To address this, we propose unified KV cache management, an inference paradigm that treats KV cache as a first-class shared resource across tasks and over time. This abstraction enables two key optimizations: cross-task KV sharing eliminates redundant prefill of shared observations, while cross-frame continuous batching decouples variable-length language decoding from fixed-rate action generation across control cycles. We implement this paradigm for $π_{0.5}$, the most popular MoT VLA, and evaluate under representative robotic configurations. OxyGen achieves up to 3.7$\times$ speedup over isolated execution, delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously without action quality degradation.
comment: Preprint
AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control
Vision-Language Navigation (VLN) for Unmanned Aerial Vehicles (UAVs) demands complex visual interpretation and continuous control in dynamic 3D environments. Existing hierarchical approaches rely on dense oracle guidance or auxiliary object detectors, creating semantic gaps and limiting genuine autonomy. We propose AerialVLA, a minimalist end-to-end Vision-Language-Action framework mapping raw visual observations and fuzzy linguistic instructions directly to continuous physical control signals. First, we introduce a streamlined dual-view perception strategy that reduces visual redundancy while preserving essential cues for forward navigation and precise grounding, which additionally facilitates future simulation-to-reality transfer. To reclaim genuine autonomy, we deploy a fuzzy directional prompting mechanism derived solely from onboard sensors, completely eliminating the dependency on dense oracle guidance. Ultimately, we formulate a unified control space that integrates continuous 3-Degree-of-Freedom (3-DoF) kinematic commands with an intrinsic landing signal, freeing the agent from external object detectors for precision landing. Extensive experiments on the TravelUAV benchmark demonstrate that AerialVLA achieves state-of-the-art performance in seen environments. Furthermore, it exhibits superior generalization in unseen scenarios by achieving nearly three times the success rate of leading baselines, validating that a minimalist, autonomy-centric paradigm captures more robust visual-motor representations than complex modular systems.
comment: 18 pages, 4 figures. Code and demo videos will be available at: https://github.com/XuPeng23/AerialVLA
Deconfounded Lifelong Learning for Autonomous Driving via Dynamic Knowledge Spaces
End-to-End autonomous driving (E2E-AD) systems face challenges in lifelong learning, including catastrophic forgetting, difficulty in knowledge transfer across diverse scenarios, and spurious correlations between unobservable confounders and true driving intents. To address these issues, we propose DeLL, a Deconfounded Lifelong Learning framework that integrates a Dirichlet process mixture model (DPMM) with the front-door adjustment mechanism from causal inference. The DPMM is employed to construct two dynamic knowledge spaces: a trajectory knowledge space for clustering explicit driving behaviors and an implicit feature knowledge space for discovering latent driving abilities. Leveraging the non-parametric Bayesian nature of DPMM, our framework enables adaptive expansion and incremental updating of knowledge without predefining the number of clusters, thereby mitigating catastrophic forgetting. Meanwhile, the front-door adjustment mechanism utilizes the DPMM-derived knowledge as valid mediators to deconfound spurious correlations, such as those induced by sensor noise or environmental changes, and enhances the causal expressiveness of the learned representations. Additionally, we introduce an evolutionary trajectory decoder that enables non-autoregressive planning. To evaluate the lifelong learning performance of E2E-AD, we propose new evaluation protocols and metrics based on Bench2Drive. Extensive evaluations in the closed-loop CARLA simulator demonstrate that our framework significantly improves adaptability to new driving scenarios and overall driving performance, while effectively retaining previous acquired knowledge.
VIP-Loco: A Visually Guided Infinite Horizon Planning Framework for Legged Locomotion
Perceptive locomotion for legged robots requires anticipating and adapting to complex, dynamic environments. Model Predictive Control (MPC) serves as a strong baseline, providing interpretable motion planning with constraint enforcement, but struggles with high-dimensional perceptual inputs and rapidly changing terrain. In contrast, model-free Reinforcement Learning (RL) adapts well across visually challenging scenarios but lacks planning. To bridge this gap, we propose VIP-Loco, a framework that integrates vision-based scene understanding with RL and planning. During training, an internal model maps proprioceptive states and depth images into compact kinodynamic features used by the RL policy. At deployment, the learned models are used within an infinite-horizon MPC formulation, combining adaptability with structured planning. We validate VIP-Loco in simulation on challenging locomotion tasks, including slopes, stairs, crawling, tilting, gap jumping, and climbing, across three robot morphologies: a quadruped (Unitree Go1), a biped (Cassie), and a wheeled-biped (TronA1-W). Through ablations and comparisons with state-of-the-art methods, we show that VIP-Loco unifies planning and perception, enabling robust, interpretable locomotion in diverse environments.
comment: 8 pages, 5 figures
Data-Driven Physics Embedded Dynamics with Predictive Control and Reinforcement Learning for Quadrupeds
State of the art quadrupedal locomotion approaches integrate Model Predictive Control (MPC) with Reinforcement Learning (RL), enabling complex motion capabilities with planning and terrain adaptive behaviors. However, they often face compounding errors over long horizons and have limited interpretability due to the absence of physical inductive biases. We address these issues by integrating Lagrangian Neural Networks (LNNs) into an RL MPC framework, enabling physically consistent dynamics learning. At deployment, our inverse dynamics infinite horizon MPC scheme avoids costly matrix inversions, improving computational efficiency by up to 4x with minimal loss of task performance. We validate our framework through multiple ablations of the proposed LNN and its variants. We show improved sample efficiency, reduced long-horizon error, and faster real time planning compared to unstructured neural dynamics. Lastly, we also test our framework on the Unitree Go1 robot to show real world viability.
comment: 9 pages, 6 figures
OmniClone: Engineering a Robust, All-Rounder Whole-Body Humanoid Teleoperation System
Whole-body humanoid teleoperation enables humans to remotely control humanoid robots, serving as both a real-time operational tool and a scalable engine for collecting demonstrations for autonomous learning. Despite recent advances, existing systems are validated using aggregate metrics that conflate distinct motion regimes, masking critical failure modes. This lack of diagnostic granularity, compounded by tightly coupled and labor-intensive system configurations, hinders robust real-world deployment. A key open challenge is building a teleoperation system that is simultaneously robust, versatile, and affordable for practical use. Here we present OmniClone, a whole-body humanoid teleoperation system that achieves high-fidelity, multi-skill control on a single consumer GPU with modest data requirements. Central to our approach is OmniBench, a diagnostic benchmark that evaluates policies across stratified motion categories and difficulty levels on unseen motions, exposing the narrow specialization of prior systems. Guided by these diagnostics, we identify an optimized training data recipe and integrate system-level improvements: subject-agnostic retargeting and robust communication, that collectively reduce Mean Per-Joint Position Error (MPJPE) by over 66% while requiring orders-of-magnitude fewer computational resources than comparable methods. Crucially, OmniClone is control-source-agnostic: a single unified policy supports real-time teleoperation, generated motion playback, and Vision-Language-Action (VLA) models, while generalizing across operators of vastly different body proportions. By uniting diagnostic evaluation with practical engineering, OmniClone provides an accessible foundation for scalable humanoid teleoperation and autonomous learning.
comment: Website: https://omniclone.github.io/
Load-Aware Locomotion Control for Humanoid Robots in Industrial Transportation Tasks
Humanoid robots deployed in industrial environments are required to perform load-carrying transportation tasks that tightly couple locomotion and manipulation. However, achieving stable and robust locomotion under varying payloads and upper-body motions is challenging due to dynamic coupling and partial observability. This paper presents a load-aware locomotion framework for industrial humanoids based on a decoupled yet coordinated loco-manipulation architecture. Lower-body locomotion is controlled via a reinforcement learning policy producing residual joint actions on kinematically derived nominal configurations. A kinematics-based locomotion reference with a height-conditioned joint-space offset guides learning, while a history-based state estimator infers base linear velocity and height and encodes residual load- and manipulation-induced disturbances in a compact latent representation. The framework is trained entirely in simulation and deployed on a full-size humanoid robot without fine-tuning. Simulation and real-world experiments demonstrate faster training, accurate height tracking, and stable loco-manipulation. Project page: https://lequn-f.github.io/LALO/
comment: This work has been submitted to the IEEE Transactions on Industrial Electronics for possible publication
Seeking Physics in Diffusion Noise
Do video diffusion models encode signals predictive of physical plausibility? We probe intermediate denoising representations of a pretrained Diffusion Transformer (DiT) and find that physically plausible and implausible videos are partially separable in mid-layer feature space across noise levels. This separability cannot be fully attributed to visual quality or generator identity, suggesting recoverable physics-related cues in frozen DiT features. Leveraging this observation, we introduce progressive trajectory selection, an inference-time strategy that scores parallel denoising trajectories at a few intermediate checkpoints using a lightweight physics verifier trained on frozen features, and prunes low-scoring candidates early. Extensive experiments on PhyGenBench demonstrate that our method improves physical consistency while reducing inference cost, achieving comparable results to Best-of-K sampling with substantially fewer denoising steps.
comment: 32 pages, 8 figures, 10 tables
Geometry-Aware Set-Membership Multilateration: Directional Bounds and Anchor Selection
In this paper, we study anchor selection for range-based localization under unknown-but-bounded measurement errors. We start from the convex localization set $\X=\Xd\cap\Hset$ recently introduced in \cite{CalafioreSIAM}, where $\Xd$ is a polyhedron obtained from pairwise differences of squared-range equations between the unknown location $x$ and the anchors, and $\Hset$ is the intersection of upper-range hyperspheres. Our first goal is \emph{offline} design: we derive geometry-only E- and D-type scores from the centered scatter matrix $S(A)=AQ_mA\tran$, where $A$ collects the anchor coordinates and $Q_m=I_m-\frac{1}{m}\one\one\tran$ is the centering projector, showing that $λ_{\min}(S(A))$ controls worst-direction and diameter surrogates for the polyhedral certificate $\Xd$, while $\det S(A)$ controls principal-axis volume surrogates. Our second goal is \emph{online} uncertainty assessment for a selected subset of anchors: exploiting the special structure $\X=\Xd\cap\Hset$, we derive a simplex-aggregated enclosing ball for $\Hset$ and an exact support-function formula for $\Hset$, which lead to finite hybrid bounds for the actual localization set $\X$, even when the polyhedral certificate deteriorates. Numerical experiments are performed in two dimensions, showing that geometry-based subset selection is close to an oracle combinatorial search, that the D-score slightly dominates the E-score for the area-oriented metric considered here, and that the new $\Hset$-aware certificates track the realized size of the selected localization set closely.
Design of a Bio-Inspired Miniature Submarine for Low-Cost Water Quality Monitoring
Water quality monitoring is essential for protecting aquatic ecosystems and detecting environmental pollution. This paper presents the design and experimental validation of a bio-inspired miniature submarine for low-cost water quality monitoring. Inspired by the jet propulsion mechanism of squids, the proposed system employs pump-driven water jets for propulsion and steering, combined with a pump-based buoyancy control mechanism that enables both depth regulation and water sampling. The vehicle integrates low-cost, commercially available components including an ESP32 microcontroller, IMU, pressure sensor, GPS receiver, and LoRa communication module. The complete system can be constructed at a hardware cost of approximately $122.5, making it suitable for educational and environmental monitoring applications. Experimental validation was conducted through pool tests and field trials in a lake. During a 360 degrees rotation test, roll and pitch deviations remained within +/-2 degrees and +/-1.5 degrees, respectively, demonstrating stable attitude control. Steering experiments showed a heading step response with approximately 2 s rise time and 5 s settling time. Depth control experiments achieved a target depth of 2.5 m with steady-state error within +/-0.1 m. Field experiments further demonstrated reliable navigation and successful water sampling operations. The results confirm that the proposed platform provides a compact, stable, and cost-effective solution for small-scale aquatic environmental monitoring.
AeroGen: Agentic Drone Autonomy through Single-Shot Structured Prompting & Drone SDK
Designing correct UAV autonomy programs is challenging due to joint navigation, sensing and analytics requirements. While LLMs can generate code, their reliability for safety-critical UAVs remains uncertain. This paper presents AeroGen, an open-loop framework that enables consistently correct single-shot AI-generated drone control programs through structured guardrail prompting and integration with the AeroDaaS drone SDK. AeroGen encodes API descriptions, flight constraints and operational world rules directly into the system context prompt, enabling generic LLMs to produce constraint-aware code from user prompts, with minimal example code. We evaluate AeroGen across a diverse benchmark of 20 navigation tasks and 5 drone missions on urban, farm and inspection environments, using both imperative and declarative user prompts. AeroGen generates about 40 lines of AeroDaaS Python code in about 20s per mission, in both real-world and simulations, showing that structured prompting with a well-defined SDK improves robustness, correctness and deployability of LLM-generated drone autonomy programs.
A Real-Time Neuro-Symbolic Ethical Governor for Safe Decision Control in Autonomous Robotic Manipulation
Ethical decision governance has become a critical requirement for autonomous robotic systems operating in human-centered and safety-sensitive environments. This paper presents a real-time neuro-symbolic ethical governor designed to enable risk-aware supervisory control in autonomous robotic manipulation tasks. The proposed framework integrates transformer-based ethical reasoning with a probabilistic ethical risk field formulation and a threshold-based override control mechanism. language-grounded ethical intent inference capability is learned from natural language task descriptions using a fine-tuned DistilBERT model trained on the ETHICS commonsense dataset. A continuous ethical risk metric is subsequently derived from predicted unsafe action probability, confidence uncertainty, and probabilistic variance to support adaptive decision filtering. The effectiveness of the proposed approach is validated through simulated autonomous robot-arm task scenarios involving varying levels of human proximity and operational hazard. Experimental results demonstrate stable model convergence, reliable ethical risk discrimination, and improved safety-aware decision outcomes without significant degradation of task execution efficiency. The proposed neuro-symbolic architecture further provides enhanced interpretability compared with purely data-driven safety filters, enabling transparent ethical reasoning in real-time control loops. The findings suggest that ethical decision governance can be effectively modeled as a dynamic supervisory risk layer for autonomous robotic systems, with potential applicability to broader cyber-physical and assistive robotics domains.
comment: 6 pages, 6 figures, 5 equations
Navigation beyond Wayfinding: Robots Collaborating with Visually Impaired Users for Environmental Interactions
Robotic guidance systems have shown promise in supporting blind and visually impaired (BVI) individuals with wayfinding and obstacle avoidance. However, most existing systems assume a clear path and do not support a critical aspect of navigation - environmental interactions that require manipulating objects to enable movement. These interactions are challenging for a human-robot pair because they demand (i) precise localization and manipulation of interaction targets (e.g., pressing elevator buttons) and (ii) dynamic coordination between the user's and robot's movements (e.g., pulling out a chair to sit). We present a collaborative human-robot approach that combines our robotic guide dog's precise sensing and localization capabilities with the user's ability to perform physical manipulation. The system alternates between two modes: lead mode, where the robot detects and guides the user to the target, and adaptation mode, where the robot adjusts its motion as the user interacts with the environment (e.g., opening a door). Evaluation results show that our system enables navigation that is safer, smoother, and more efficient than both a traditional white cane and a non-adaptive guiding system, with the performance gap widening as tasks demand higher precision in locating interaction targets. These findings highlight the promise of human-robot collaboration in advancing assistive technologies toward more generalizable and realistic navigation support.
comment: Accepted to ACM/IEEE HRI 2026, 10 pages, 6 figures
Towards Equitable Robotic Furnishing Agents for Aging-in-Place: ADL-Grounded Design Exploration
In aging-in-place contexts, small difficulties in Activities of Daily Living (ADL) can accumulate, affecting well-being through fatigue, anxiety, reduced autonomy, and safety risks. This position paper argues that robotics for older adult wellbeing must move beyond "convenience features" and centre equity, justice, and responsibility. We conducted ADL-grounded semi-structured interviews with four adults in their 70s-80s, identifying recurrent challenges (finding/ organising items, taking medication, and transporting objects) and deriving requirements to reduce compounded cognitive-physical burden. Based on these insights, we propose an in-home robotic furnishing-agent concept leveraging computer vision and generative AI and LLMs for natural-language interaction, context-aware reminders, safe actuation, and user-centred transparency. We then report video-stimulated follow-up interviews with the same participants, highlighting preferences for confirmation before actuation, predictability, adjustable speed/autonomy, and multimodal feedback, as well as equity-related concerns. We conclude with open questions on evaluating and deploying equitable robotic wellbeing systems in real homes.
comment: Accepted at the ACM/IEEE International Conference on Human-Robot Interaction (HRI) 2026 Workshop: Equitable Robotics for Wellbeing (Eq-RW)
Semi-Automatic Flute Robot and Its Acoustic Sensing
Flute performance requires mastery of complex fingering combinations and register-dependent embouchure control, particularly jet offset adjustment for low-register production. Existing haptic and semi-automated systems do not address both aspects simultaneously through mechanical actuation. To our knowledge, no prior system fully automates fingering while mechanically assisting low-register tone production without requiring embouchure control. We developed a semi-automatic flute robot with an automatic fingering mechanism: fourteen servo motors actuate all keys via wire-based and rack-and-pinion drives in response to MIDI input, enabling performers to produce complete musical pieces through airflow alone. A jet offset assist mechanism rotates the head joint by a calibrated $22^\circ$ during low-register passages, shifting the jet offset toward a low-register configuration without modifying the instrument or embouchure. Fundamental frequency estimation confirmed correct pitch production across the chromatic range (C4--C7) and during musical performance. All key and lever movements were completed within 77.50~ms, corresponding to tempo capacity exceeding standard requirements. Harmonic analysis ($Δ\mathrm{SPL} = \mathrm{SPL}_2 - \mathrm{SPL}_3$) showed a consistent increase in $Δ$SPL for all low-register notes when activated, consistent with the intended jet offset shift. Head joint rotation completed within 40.00~ms. These results demonstrate mechanical feasibility of integrating automated fingering and register-dependent jet offset assistance under controlled conditions.
comment: This paper was submitted to a journal and received thorough reviews with high marks from the experts. Despite addressing three rounds of major revisions, it was ultimately rejected due to an unreasonable reviewer. We are uploading it here as a preprint
Federated Multi-Agent Mapping for Planetary Exploration
Multi-agent robotic exploration stands to play an important role in space exploration as the next generation of robotic systems ventures to far-flung environments. A key challenge in this new paradigm will be to effectively share and utilize the vast amount of data generated onboard while operating in bandwidth-constrained regimes typical of space missions. Federated learning (FL) is a promising tool for bridging this gap. Drawing inspiration from the upcoming CADRE Lunar rover mission, we propose a federated multi-agent mapping approach that jointly trains a global map model across agents without transmitting raw data. Our method leverages implicit neural mapping to generate parsimonious, adaptable representations, reducing data transmission by up to 93.8% compared to raw maps. Furthermore, we enhance this approach with meta-initialization on Earth-based traversability datasets to significantly accelerate map convergence; reducing iterations required to reach target performance by 80% compared to random initialization. We demonstrate the efficacy of our approach on Martian terrains and glacier datasets, achieving downstream path planning F1 scores as high as 0.95 while outperforming on map reconstruction losses.
comment: 7 pages, 6 figures
HoRD: Robust Humanoid Control via History-Conditioned Reinforcement Learning and Online Distillation
Humanoid robots can suffer significant performance drops under small changes in dynamics, task specifications, or environment setup. We propose HoRD, a two-stage learning framework for robust humanoid control under domain shift. First, we train a high-performance teacher policy via history-conditioned reinforcement learning, where the policy infers latent dynamics context from recent state--action trajectories to adapt online to diverse randomized dynamics. Second, we perform online distillation to transfer the teacher's robust control capabilities into a transformer-based student policy that operates on sparse root-relative 3D joint keypoint trajectories. By combining history-conditioned adaptation with online distillation, HoRD enables a single policy to adapt zero-shot to unseen domains without per-domain retraining. Extensive experiments show HoRD outperforms strong baselines in robustness and transfer, especially under unseen domains and external perturbations. Code and project page are available at https://tonywang-0517.github.io/hord/.
RoboClaw: An Agentic Framework for Scalable Long-Horizon Robotic Tasks
Vision-Language-Action (VLA) systems have shown strong potential for language-driven robotic manipulation. However, scaling them to long-horizon tasks remains challenging. Existing pipelines typically separate data collection, policy learning, and deployment, resulting in heavy reliance on manual environment resets and brittle multi-policy execution. We present RoboClaw, an agentic robotics framework that unifies data collection, policy learning, and task execution under a single VLM-driven controller. At the policy level, RoboClaw introduces Entangled Action Pairs (EAP), which couple forward manipulation behaviors with inverse recovery actions to form self-resetting loops for autonomous data collection. This mechanism enables continuous on-policy data acquisition and iterative policy refinement with minimal human intervention. During deployment, the same agent performs high-level reasoning and dynamically orchestrates learned policy primitives to accomplish long-horizon tasks. By maintaining consistent contextual semantics across collection and execution, RoboClaw reduces mismatch between the two phases and improves multi-policy robustness. Experiments in real-world manipulation tasks demonstrate improved stability and scalability compared to conventional open-loop pipelines, while significantly reducing human effort throughout the robot lifecycle, achieving a 25% improvement in success rate over baseline methods on long-horizon tasks and reducing human time investment by 53.7%.
STRIDE: Structured Lagrangian and Stochastic Residual Dynamics via Flow Matching
Robotic systems operating in unstructured environments must operate under significant uncertainty arising from intermittent contacts, frictional variability, and unmodeled compliance. While recent model-free approaches have demonstrated impressive performance, many deployment settings still require predictive models that support planning, constraint handling, and online adaptation. Analytical rigid-body models provide strong physical structure but often fail to capture complex interaction effects, whereas purely data-driven models may violate physical consistency, exhibit data bias, and accumulate long-horizon drift. In this work, we propose STRIDE, a dynamics learning framework that explicitly separates conservative rigid-body mechanics from uncertain, effectively stochastic non-conservative interaction effects. The structured component is modeled using a Lagrangian Neural Network (LNN) to preserve energy-consistent inertial dynamics, while residual interaction forces are represented using Conditional Flow Matching (CFM) to capture multi-modal interaction phenomena. The two components are trained jointly end-to-end, enabling the model to retain physical structure while representing complex stochastic behavior. We evaluate STRIDE on systems of increasing complexity, including a pendulum, the Unitree Go1 quadruped, and the Unitree G1 humanoid. Results show 20% reduction in long-horizon prediction error and 30% reduction in contact force prediction error compared to deterministic residual baselines, supporting more reliable model-based control in uncertain robotic environments.
comment: 9 pages, 7 figures
HandelBot: Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies
Mastering dexterous manipulation with multi-fingered hands has been a grand challenge in robotics for decades. Despite its potential, the difficulty of collecting high-quality data remains a primary bottleneck for high-precision tasks. While reinforcement learning and simulation-to-real-world transfer offer a promising alternative, the transferred policies often fail for tasks demanding millimeter-scale precision, such as bimanual piano playing. In this work, we introduce HandelBot, a framework that combines a simulation policy and rapid adaptation through a two-stage pipeline. Starting from a simulation-trained policy, we first apply a structured refinement stage to correct spatial alignments by adjusting lateral finger joints based on physical rollouts. Next, we use residual reinforcement learning to autonomously learn fine-grained corrective actions. Through extensive hardware experiments across five recognized songs, we demonstrate that HandelBot can successfully perform precise bimanual piano playing. Our system outperforms direct simulation deployment by a factor of 1.8x and requires only 30 minutes of physical interaction data.
comment: Website: https://amberxie88.github.io/handelbot
Adaptive Sliding Mode Control for Vehicle Platoons with State-Dependent Friction Uncertainty
Multi-robot formation control has various applications in domains such as vehicle troops, platoons, payload transportation, and surveillance. Maintaining formation in a vehicle platoon requires designing a suitable control scheme that can tackle external disturbances and uncertain system parameters while maintaining a predefined safe distance between the robots. A crucial challenge in this context is dealing with the unknown/uncertain friction forces between wheels and the ground, which vary with changes in road surface, wear in tires, and speed of the vehicle. Although state-of-the-art adaptive controllers can handle a priori bounded uncertainties, they struggle with accurately modeling and identifying frictional forces, which are often state-dependent and cannot be a priori bounded. This thesis proposes a new adaptive sliding mode controller for wheeled mobile robot-based vehicle platoons that can handle the unknown and complex behavior of frictional forces without prior knowledge of their parameters and structures. The controller uses the adaptive sliding mode control techniques to regulate the platoon's speed and maintain a predefined inter-robot distance, even in the presence of external disturbances and uncertain system parameters. This approach involves a two-stage process: first, the kinematic controller calculates the desired velocities based on the desired trajectory; and second, the dynamics model generates the commands to achieve the desired motion. By separating the kinematics and dynamics of the robot, this approach can simplify the control problem and allow for more efficient and robust control of the wheeled mobile robot.
comment: Extended version based on the author MSc thesis. Related to an earlier IEEE ICAR 2021 publication
A Modular Architecture Design for Autonomous Driving Racing in Controlled Environments
This paper presents a modular autonomous driving architecture for Formula Student Driverless competition vehicles operating in closed-circuit environments. The perception module employs YOLOv11 for real-time traffic cone detection, achieving 0.93 mAP@0.5 on the FSOCO dataset, combined with neural stereo depth estimation from a ZED 2i camera for 3D cone localization with sub-0.5 m median error at distances up to 7 m. State estimation fuses RTK-GNSS positioning and IMU measurements through an Extended Kalman Filter (EKF) based on a kinematic bicycle model, achieving centimeter-level localization accuracy with a 12 cm improvement over raw GNSS. Path planning computes the racing line via cubic spline interpolation on ordered track boundaries and assigns speed profiles constrained by curvature and vehicle dynamics. A regulated pure pursuit controller tracks the planned trajectory with a dynamic lookahead parameterized by speed error. The complete pipeline is implemented as a modular ROS 2 architecture on an NVIDIA Jetson Orin NX platform, with each subsystem deployed as independent nodes communicating through a dual-computer configuration. Experimental validation combines real-world sensor evaluation with simulation-based end-to-end testing, where realistic sensor error distributions are injected to assess system-level performance under representative conditions.
RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into Policy Design
Robotic manipulation policies have made rapid progress in recent years, yet most existing approaches give limited consideration to memory capabilities. Consequently, they struggle to solve tasks that require reasoning over historical observations and maintaining task-relevant information over time, which are common requirements in real-world manipulation scenarios. Although several memory-aware policies have been proposed, systematic evaluation of memory-dependent manipulation remains underexplored, and the relationship between architectural design choices and memory performance is still not well understood. To address this gap, we introduce RMBench, a simulation benchmark comprising 9 manipulation tasks that span multiple levels of memory complexity, enabling systematic evaluation of policy memory capabilities. We further propose Mem-0, a modular manipulation policy with explicit memory components designed to support controlled ablation studies. Through extensive simulation and real-world experiments, we identify memory-related limitations in existing policies and provide empirical insights into how architectural design choices influence memory performance. The website is available at https://rmbench.github.io/.
comment: website: https://rmbench.github.io/
ExoPredicator: Learning Abstract Models of Dynamic Worlds for Robot Planning ICLR 2026
Long-horizon embodied planning is challenging because the world does not only change through an agent's actions: exogenous processes (e.g., water heating, dominoes cascading) unfold concurrently with the agent's actions. We propose a framework for abstract world models that jointly learns (i) symbolic state representations and (ii) causal processes for both endogenous actions and exogenous mechanisms. Each causal process models the time course of a stochastic cause-effect relation. We learn these world models from limited data via variational Bayesian inference combined with LLM proposals. Across five simulated tabletop robotics environments, the learned models enable fast planning that generalizes to held-out tasks with more objects and more complex goals, outperforming a range of baselines.
comment: ICLR 2026. The last two authors contributed equally in co-advising
REACT3D: Recovering Articulations for Interactive Physical 3D Scenes
Interactive 3D scenes are increasingly vital for embodied intelligence, yet existing datasets remain limited due to the labor-intensive process of annotating part segmentation, kinematic types, and motion trajectories. We present REACT3D, a scalable zero-shot framework that converts static 3D scenes into simulation-ready interactive replicas with consistent geometry, enabling direct use in diverse downstream tasks. Our contributions include: (i) openable-object detection and segmentation to extract candidate movable parts from static scenes, (ii) articulation estimation that infers joint types and motion parameters, (iii) hidden-geometry completion followed by interactive object assembly, and (iv) interactive scene integration in widely supported formats to ensure compatibility with standard simulation platforms. We achieve state-of-the-art performance on detection/segmentation and articulation metrics across diverse indoor scenes, demonstrating the effectiveness of our framework and providing a practical foundation for scalable interactive scene generation, thereby lowering the barrier to large-scale research on articulated scene understanding. Our project page is https://react3d.github.io/
comment: 8 pages
Taxonomy and Trends in Reinforcement Learning for Robotics and Control Systems: A Structured Review
Reinforcement learning (RL) has become a foundational approach for enabling intelligent robotic behavior in dynamic and uncertain environments. This work presents an in-depth review of RL principles, advanced deep reinforcement learning (DRL) algorithms, and their integration into robotic and control systems. Beginning with the formalism of Markov Decision Processes (MDPs), the study outlines essential elements of the agent-environment interaction and explores core algorithmic strategies including actor-critic methods, value-based learning, and policy gradients. Emphasis is placed on modern DRL techniques such as DDPG, TD3, PPO, and SAC, which have shown promise in solving high-dimensional, continuous control tasks. A structured taxonomy is introduced to categorize RL applications across domains such as locomotion, manipulation, multi-agent coordination, and human-robot interaction, along with training methodologies and deployment readiness levels. The review synthesizes recent research efforts, highlighting technical trends, design patterns, and the growing maturity of RL in real-world robotics. Overall, this work aims to bridge theoretical advances with practical implementations, providing a consolidated perspective on the evolving role of RL in autonomous robotic systems.
Density Matrix-based Dynamics for Quantum Robotic Swarms
In a robotic swarm, parameters such as position and proximity to the target can be described in terms of probability amplitudes. This idea led to recent studies on a quantum approach to the definition of the swarm, including a block-matrix representation. However, the size of such matrix-based representation increases drastically with the swarm size, making them impractical for large swarms. Hence, in this work, we propose a new approach for modeling robotic swarms and robotic networks by considering them as mixed quantum states that can be represented mathematically via density matrices. The size of such an approach only depends on the available degrees of freedom of the robot, and not its swarm size and thus scales well to large swarms. Moreover, it also enables the extraction of local information of the robots from the global swarm information contained in the density matrices, facilitating decentralized behavior that aligns with the collective swarm behavior. Our approach is validated on several simulations including large-scale swarms of up to 1000 robots. Finally, we provide some directions for future research that could potentially widen the impact of our approach.
Walking through Doors is Hard, even without Staircases: Universality and PSPACE-hardness of Planar Door Gadgets SP
An open-close door gadget has two states and three tunnels that can be traversed by an agent (player, robot, etc.): the "opening" and "closing" tunnels set the gadget's state to open and closed, respectively, while the "traverse" tunnel can be traversed if and only if the door is in the open state. We prove that it is PSPACE-complete to decide whether an agent can move from one location to another through a planar system of any such door gadget, removing the traditional need for crossover gadgets and thereby simplifying past PSPACE-hardness proofs of Lemmings and Nintendo games Super Mario Bros., Legend of Zelda, and Donkey Kong Country. Even stronger, we show that any gadget in the motion-planning-through-gadgets framework can be simulated by a planar system of door gadgets: the open-close door gadget is a universal gadget. We prove that these results hold for a variety of door gadgets. In particular, the opening, closing, and traverse tunnel locations can have an arbitrary cyclic order around the door; each tunnel can be directed or undirected; and the opening tunnel can instead be an optional button (with identical entrance and exit locations). Furthermore, we show the same hardness and universality results for two simpler types of door gadgets: self-closing door gadgets and symmetric self-closing door gadgets. Again we show that any self-closing door gadget planarly simulates any gadget, and thus the reachability motion planning problem is PSPACE-complete. Then we apply this framework to prove new PSPACE-hardness results for eight different 3D Mario video games and Sokobond.
comment: 36 pages, 35 figures. All cases are now proved PSPACE-complete. New universality proofs. Earlier version published at FUN 2020
Hydrodynamic Performance Enhancement of Unmanned Underwater Gliders with Soft Robotic Morphing Wings for Agility Improvement
This work assesses the hydrodynamic efficiency of Underwater Unmanned Vehicles (UUVs) equipped with soft morphing wings compared to conventional rigid wings. Unlike rigid wings, deformable counterparts can alter their aerodynamic properties on demand. Improvements in hydrodynamic efficiency extend a UUV's operational range and may determine mission feasibility. Structural and Computational Fluid Dynamics (CFD) simulations were conducted for both a soft morphing wing and a UUV incorporating it. The results show that a UUV employing soft wings achieves 9.75 percent higher overall efficiency than an equivalent vehicle with traditional rigid wings. These findings confirm the potential of soft robotics to enhance underwater vehicle performance, particularly in applications requiring pressure-agnostic operation.
comment: Conference paper accepted at 9th IEEE-RAS International Conference on Soft Robotics (RoboSoft 2026)
SERN: Bandwidth-Adaptive Cross-Reality Synchronization for Simulation-Enhanced Robot Navigation
Cross reality integration of simulation and physical robots is a promising approach for multi-robot operations in contested environments, where communication may be intermittent, interference may be present, and observability may be degraded. We present SERN (Simulation-Enhanced Realistic Navigation), a framework that tightly couples a high-fidelity virtual twin with physical robots to support real-time collaborative decision making. SERN makes three main contributions. First, it builds a virtual twin from geospatial and sensor data and continuously corrects it using live robot telemetry. Second, it introduces a physics-aware synchronization pipeline that combines predictive modeling with adaptive PD control. Third, it provides a bandwidth-adaptive ROS bridge that prioritizes critical topics when communication links are constrained. We also introduce a multi-metric cost function that balances latency, reliability, computation, and bandwidth. Theoretically, we show that when the adaptive controller keeps the physical and virtual input mismatch small, synchronization error remains bounded under moderate packet loss and latency. Empirically, SERN reduces end-to-end message latency by 15% to 25% and processing load by about 15% compared with a standard ROS setup, while maintaining tight real-virtual alignment with less than 5 cm positional error and less than 2 degrees rotational error. In a navigation task, SERN achieves a 95% success rate, compared with 85% for a real-only setup and 70% for a simulation-only setup, while also requiring fewer interventions and less time to reach the goal. These results show that a simulation-enhanced cross-reality stack can improve situational awareness and multi-agent coordination in contested environments by enabling look-ahead planning in the virtual twin while using real sensor feedback to correct discrepancies.
Interpretable Responsibility Sharing as a Heuristic for Task and Motion Planning
This article introduces a novel heuristic for Task and Motion Planning (TAMP) named Interpretable Responsibility Sharing (IRS), which enhances planning efficiency in domestic robots by leveraging human-constructed environments and inherent biases. Utilizing auxiliary objects (e.g., trays and pitchers), which are commonly found in household settings, IRS systematically incorporates these elements to simplify and optimize task execution. The heuristic is rooted in the novel concept of Responsibility Sharing (RS), where auxiliary objects share the task's responsibility with the embodied agent, dividing complex tasks into manageable sub-problems. This division not only reflects human usage patterns but also aids robots in navigating and manipulating within human spaces more effectively. By integrating Optimized Rule Synthesis (ORS) for decision-making, IRS ensures that the use of auxiliary objects is both strategic and context-aware, thereby improving the interpretability and effectiveness of robotic planning. Experiments conducted across various household tasks demonstrate that IRS significantly outperforms traditional methods by reducing the effort required in task execution and enhancing the overall decision-making process. This approach not only aligns with human intuitive methods but also offers a scalable solution adaptable to diverse domestic environments. Code is available at https://github.com/asyncs/IRS.
comment: Accepted for the Special Issue "Planning and Learning for Autonomous Robotics" in Robotics and Autonomous Systems
World In Your Hands: A Large-Scale and Open-Source Ecosystem for Learning Human-Centric Manipulation in the Wild
We introduce World In Your Hands (WIYH), a large-scale open-source ecosystem comprising over 1,000 hours of human manipulation data collected in-the-wild with millimeter-scale motion accuracy. Specifically, WIYH includes (1) the Oracle Suite, a wearable data collection kit with an auto-labeling pipeline for accurate motion capture; (2) the WIYH Dataset, featuring over 1,000 hours of multimodal manipulation data across hundreds of skills in diverse real-world scenarios; and (3) extensive annotations and benchmarks supporting tasks from perception to action. Furthermore, experiments based on the WIYH ecosystem show that integrating WIYH's human-centric data improves robotic manipulation success rates from 8% to 60% in cluttered scenes. World In Your Hands provides a foundation for advancing human-centric data collection and cross-embodiment policy learning. All data and hardware design will be open-source.
comment: This dataset represents the first large-scale collection of real-world, human-centric multimodal data integrating vision, language, tactile sensing, and action (VLTA) Github: https://github.com/tars-robotics/World-In-Your-Hands
Risk-Aware Obstacle Avoidance Algorithm for Real-Time Applications
Robust navigation in changing marine environments requires autonomous systems capable of perceiving, reasoning, and acting under uncertainty. This study introduces a hybrid risk-aware navigation architecture that integrates probabilistic modeling of obstacles along the vehicle path with smooth trajectory optimization for autonomous surface vessels. The system constructs probabilistic risk maps that capture both obstacle proximity and the behavior of dynamic objects. A risk-biased Rapidly Exploring Random Tree (RRT) planner leverages these maps to generate collision-free paths, which are subsequently refined using B-spline algorithms to ensure trajectory continuity. Three distinct RRT* rewiring modes are implemented based on the cost function: minimizing the path length, minimizing risk, and optimizing a combination of the path length and total risk. The framework is evaluated in experimental scenarios containing both static and dynamic obstacles. The results demonstrate the system's ability to navigate safely, maintain smooth trajectories, and dynamically adapt to changing environmental risks. Compared with conventional LIDAR or vision-only navigation approaches, the proposed method shows improvements in operational safety and autonomy, establishing it as a promising solution for risk-aware autonomous vehicle missions in uncertain and dynamic environments.
$χ_{0}$: Resource-Aware Robust Manipulation via Taming Distributional Inconsistencies
High-reliability long-horizon robotic manipulation has traditionally relied on large-scale data and compute to understand complex real-world dynamics. However, we identify that the primary bottleneck to real-world robustness is not resource scale alone, but the distributional shift among the human demonstration distribution, the inductive bias learned by the policy, and the test-time execution distribution -- a systematic inconsistency that causes compounding errors in multi-stage tasks. To mitigate these inconsistencies, we propose $χ_{0}$, a resource-efficient framework with effective modules designated to achieve production-level robustness in robotic manipulation. Our approach builds off three technical pillars: (i) Model Arithmetic, a weight-space merging strategy that efficiently soaks up diverse distributions of different demonstrations, varying from object appearance to state variations; (ii) Stage Advantage, a stage-aware advantage estimator that provides stable, dense progress signals, overcoming the numerical instability of prior non-stage approaches; and (iii) Train-Deploy Alignment, which bridges the distribution gap via spatio-temporal augmentation, heuristic DAgger corrections, and temporal chunk-wise smoothing. $χ_{0}$ enables two sets of dual-arm robots to collaboratively orchestrate long-horizon garment manipulation, spanning tasks from flattening, folding, to hanging different clothes. Our method exhibits high-reliability autonomy; we are able to run the system from arbitrary initial state for consecutive 24 hours non-stop. Experiments validate that $χ_{0}$ surpasses the state-of-the-art $π_{0.5}$ in success rate by nearly 250%, with only 20-hour data and 8 A100 GPUs. Code, data and models will be released to facilitate the community.
Eva-VLA: Evaluating Vision-Language-Action Models' Robustness Under Real-World Physical Variations
Vision-Language-Action (VLA) models have emerged as promising solutions for robotic manipulation, yet their robustness to real-world physical variations remains critically underexplored. To bridge this gap, we propose Eva-VLA, the first unified framework to systematically evaluate the robustness of VLA models by formulating uncontrollable physical variations as continuous optimization problems. Specifically, our framework addresses two fundamental challenges in VLA models' physical robustness evaluation: 1) how to systematically characterize diverse physical perturbations encountered in real-world deployment while maintaining reproducibility, and 2) how to efficiently discover worst-case scenarios without incurring prohibitive real-world data collection costs. To tackle the first challenge, we decouple real-world variations into three key dimensions: 3D object transformations that affect spatial reasoning, illumination changes that challenge visual perception, and adversarial regions that disrupt scene understanding. For the second challenge, we introduce a continuous black-box optimization mechanism that maps these perturbations into a continuous parameter space, enabling the systematic exploration of worst-case scenarios. Extensive experiments validate the effectiveness of our approach. Notably, OpenVLA exhibits an average failure rate of over 90% across three physical variations on the LIBERO-Long task, exposing critical systemic fragilities. Furthermore, applying the generated worst-case scenarios during adversarial training quantifiably increases model robustness, validating the effectiveness of this approach. Our evaluation exposes the gap between laboratory and real-world conditions, while the Eva-VLA framework can serve as an effective data augmentation method to enhance the resilience of robotic manipulation systems.
DiffusionRL: Efficient Training of Diffusion Policies for Robotic Grasping Using RL-Adapted Large-Scale Datasets
Diffusion models have been successfully applied in areas such as image, video, and audio generation. Recent works show their promise for sequential decision-making and dexterous manipulation, leveraging their ability to model complex action distributions. However, challenges persist due to the data limitations and scenario-specific adaptation needs. In this paper, we address these challenges by proposing an optimized approach to training diffusion policies using large, pre-built datasets that are enhanced using Reinforcement Learning (RL). Our end-to-end pipeline leverages RL-based enhancement of the DexGraspNet dataset, lightweight diffusion policy training on a dexterous manipulation task for a five-fingered robotic hand, and a pose sampling algorithm for validation. The pipeline achieved a high success rate of 80% for three DexGraspNet objects. By eliminating manual data collection, our approach lowers barriers to adopting diffusion models in robotics, enhancing generalization and robustness for real-world applications.
Multimodal Belief-Space Covariance Steering with Active Probing and Influence for Interactive Driving ICRA 2026
Autonomous driving in complex traffic requires reasoning under uncertainty. Common approaches rely on prediction-based planning or risk-aware control, but these are typically treated in isolation, limiting their ability to capture the coupled nature of action and inference in interactive settings. This gap becomes especially critical in uncertain scenarios, where simply reacting to predictions can lead to unsafe maneuvers or overly conservative behavior. Our central insight is that safe interaction requires not only estimating human behavior but also shaping it when ambiguity poses risks. To this end, we introduce a hierarchical belief model that structures human behavior across coarse discrete intents and fine motion modes, updated via Bayesian inference for interpretable multi-resolution reasoning. On top of this, we develop an active probing strategy that identifies when multimodal ambiguity in human predictions may compromise safety and plans disambiguating actions that both reveal intent and gently steer human decisions toward safer outcomes. Finally, a runtime risk-evaluation layer based on Conditional Value-at-Risk (CVaR) ensures that all probing actions remain within human risk tolerance during influence. Our simulations in lane-merging and unsignaled intersection scenarios demonstrate that our approach achieves higher success rates and shorter completion times compared to existing methods. These results highlight the benefit of coupling belief inference, probing, and risk monitoring, yielding a principled and interpretable framework for planning under uncertainty.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA 2026)
OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera CVPR 2026
Robust 3D semantic occupancy is crucial for legged/humanoid robots, yet most semantic scene completion (SSC) systems target wheeled platforms with forward-facing sensors. We present OneOcc, a vision-only panoramic SSC framework designed for gait-introduced body jitter and 360° continuity. OneOcc combines: (i) Dual-Projection fusion (DP-ER) to exploit the annular panorama and its equirectangular unfolding, preserving 360° continuity and grid alignment; (ii) Bi-Grid Voxelization (BGV) to reason in Cartesian and cylindrical-polar spaces, reducing discretization bias and sharpening free/occupied boundaries; (iii) a lightweight decoder with Hierarchical AMoE-3D for dynamic multi-scale fusion and better long-range/occlusion reasoning; and (iv) plug-and-play Gait Displacement Compensation (GDC) learning feature-level motion correction without extra sensors. We also release two panoramic occupancy benchmarks: QuadOcc (real quadruped, first-person 360°) and Human360Occ (H3O) (CARLA human-ego 360° with RGB, Depth, semantic occupancy; standardized within-/cross-city splits). OneOcc sets a new state of the art on QuadOcc, outperforming strong vision baselines and remaining competitive with classical LiDAR baselines; on H3O it gains +3.83 mIoU (within-city) and +8.08 (cross-city). Modules are lightweight, enabling deployable full-surround perception for legged/humanoid robots. Datasets and code will be publicly available at https://github.com/MasterHow/OneOcc.
comment: Accepted to CVPR 2026. Datasets and code will be publicly available at https://github.com/MasterHow/OneOcc
Beyond Frame-wise Tracking: A Trajectory-based Paradigm for Efficient Point Cloud Tracking ICRA 2026
LiDAR-based 3D single object tracking (3D SOT) is a critical task in robotics and autonomous systems. Existing methods typically follow frame-wise motion estimation or a sequence-based paradigm. However, the two-frame methods are efficient but lack long-term temporal context, making them vulnerable in sparse or occluded scenes, while sequence-based methods that process multiple point clouds gain robustness at a significant computational cost. To resolve this dilemma, we propose a novel trajectory-based paradigm and its instantiation, TrajTrack. TrajTrack is a lightweight framework that enhances a base two-frame tracker by implicitly learning motion continuity from historical bounding box trajectories alone-without requiring additional, costly point cloud inputs. It first generates a fast, explicit motion proposal and then uses an implicit motion modeling module to predict the future trajectory, which in turn refines and corrects the initial proposal. Extensive experiments on the large-scale NuScenes benchmark show that TrajTrack achieves new state-of-the-art performance, dramatically improving tracking precision by 3.02% over a strong baseline while running at 55 FPS. Besides, we also demonstrate the strong generalizability of TrajTrack across different base trackers. Code is available at https://github.com/FiBonaCci225/TrajTrack.
comment: Acceptted in ICRA 2026
ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation CVPR 2026
Vision-and-Language Navigation (VLN) requires agents to accurately perceive complex visual environments and reason over navigation instructions and histories. However, existing methods passively process redundant visual inputs and treat all historical contexts indiscriminately, resulting in inefficient perception and unfocused reasoning. To address these challenges, we propose \textbf{ProFocus}, a training-free progressive framework that unifies \underline{Pro}active Perception and \underline{Focus}ed Reasoning through collaboration between large language models (LLMs) and vision-language models (VLMs). For proactive perception, ProFocus transforms panoramic observations into structured ego-centric semantic maps, enabling the orchestration agent to identify missing visual information needed for reliable decision-making, and to generate targeted visual queries with corresponding focus regions that guide the perception agent to acquire the required observations. For focused reasoning, we propose Branch-Diverse Monte Carlo Tree Search (BD-MCTS) to identify top-$k$ high-value waypoints from extensive historical candidates. The decision agent focuses reasoning on the historical contexts associated with these waypoints, rather than considering all historical waypoints equally. Extensive experiments validate the effectiveness of ProFocus, achieving state-of-the-art performance among zero-shot methods on R2R and REVERIE benchmarks.
comment: Accepted by CVPR 2026
SIL: Symbiotic Interactive Learning for Language-Conditioned Human-Agent Co-Adaptation
Today's autonomous agents, largely driven by foundation models (FMs), can understand natural language instructions and solve long-horizon tasks with human-like reasoning. However, current human-robot interaction largely follows a one-way master-apprentice technique where the agent passively executes commands without reciprocal learning. This neglects the co-adaptive, multi-turn nature of everyday human interactions. We introduce symbiotic interactive learning (SIL), a bidirectional co-adaptation framework in a shared latent task space, where human and agent maintain joint belief states that evolve with interaction history. This enables proactive clarification, adaptive suggestions, and shared plan refinement. SIL leverages FMs for spatial perception and reasoning, together with a triplet-loss-trained neural encoder that grounds FMs' outputs into task-specific latent representations. To support long-term stability as tasks evolve, SIL uses episodic and semantic memory architectures, regularised via elastic weight consolidation to mitigate catastrophic forgetting. We evaluate SIL on simulated and real-world embodied tasks, including instruction following, information retrieval, query-oriented reasoning, and interactive dialogue, achieving a $90.4\%$ task completion rate and a belief alignment score of $ρ\approx 0.83$, an absolute improvement of about $20$ percentage points over the best ablations. Demos and resources: https://linusnep.github.io/SIL/.
Multiagent Systems
EARCP: Self-Regulating Coherence-Aware Ensemble Architecture for Sequential Decision Making -- Ensemble Auto-Regule par Coherence et Performance
We present EARCP (Ensemble Auto-Régulé par Cohérence et Performance), a novel ensemble architecture that dynamically weights heterogeneous expert models based on both their individual performance and inter-model coherence. Unlike traditional ensemble methods that rely on static or offline-learned combinations, EARCP continuously adapts model weights through a principled online learning mechanism that balances exploitation of high-performing models with exploration guided by consensus signals. The architecture combines theoretical foundations from multiplicative weight update algorithms with a novel coherence-based regularization term, providing both theoretical guarantees through regret bounds and practical robustness in non-stationary environments. We formalize the EARCP framework, prove sublinear regret bounds of O(sqrt(T log M)) under standard assumptions, and demonstrate its effectiveness through empirical evaluation on sequential prediction tasks including time series forecasting, activity recognition, and financial prediction. The architecture is designed as a general-purpose framework applicable to any domain requiring ensemble learning with temporal dependencies. An open-source implementation is available at https://github.com/Volgat/earcp and via PyPI (pip install earcp).
comment: 13 pages, 1 table, 1 algorithm. Open-source implementation available at https://github.com/Volgat/earcp and via pip install earcp. Dual-licensed: free for academic researchers, students, and organizations with gross revenue under $100,000/year; commercial license required for organizations exceeding this threshold (contact author)
EcoFair-CH-MARL: Scalable Constrained Hierarchical Multi-Agent RL with Real-Time Emission Budgets and Fairness Guarantees ECAI
Global decarbonisation targets and tightening market pressures demand maritime logistics solutions that are simultaneously efficient, sustainable, and equitable. We introduce EcoFair-CH-MARL, a constrained hierarchical multi-agent reinforcement learning framework that unifies three innovations: (i) a primal-dual budget layer that provably bounds cumulative emissions under stochastic weather and demand; (ii) a fairness-aware reward transformer with dynamically scheduled penalties that enforces max-min cost equity across heterogeneous fleets; and (iii) a two-tier policy architecture that decouples strategic routing from real-time vessel control, enabling linear scaling in agent count. New theoretical results establish O(\sqrt{T}) regret for both constraint violations and fairness loss. Experiments on a high-fidelity maritime digital twin (16 ports, 50 vessels) driven by automatic identification system traces, plus an energy-grid case study, show up to 15% lower emissions, 12% higher through-put, and a 45% fair-cost improvement over state-of-the-art hierarchical and constrained MARL baselines. In addition, EcoFair-CH-MARL achieves stronger equity (lower Gini and higher min-max welfare) than fairness-specific MARL baselines (e.g., SOTO, FEN), and its modular design is compatible with both policy- and value-based learners. EcoFair-CH-MARL therefore advances the feasibility of large-scale, regulation-compliant, and socially responsible multi-agent coordination in safety-critical domains.
comment: Conference: The 28th European Conference on Artificial Intelligence (ECAI)
An End-to-end Architecture for Collider Physics and Beyond
We present, to our knowledge, the first language-driven agent system capable of executing end-to-end collider phenomenology tasks, instantiated within a decoupled, domain-agnostic architecture for autonomous High-Energy Physics phenomenology. Guided only by natural-language prompts supplemented with standard physics notation, ColliderAgent carries out workflows from a theoretical Lagrangian to final phenomenological outputs without relying on package-specific code. In this framework, a hierarchical multi-agent reasoning layer is coupled to Magnus, a unified execution backend for phenomenological calculations and simulation toolchains. We validate the system on representative literature reproductions spanning leptoquark and axion-like-particle scenarios, higher-dimensional effective operators, parton-level and detector-level analyses, and large-scale parameter scans leading to exclusion limits. These results point to a route toward more automated, scalable, and reproducible research in collider physics, cosmology, and physics more broadly.
comment: 15 pages, 3 figure, project website: https://github.com/HET-AGI/ColliderAgent
Autonomous Agents Coordinating Distributed Discovery Through Emergent Artifact Exchange
We present ScienceClaw + Infinite, a framework for autonomous scientific investigation in which independent agents conduct research without central coordination, and any contributor can deploy new agents into a shared ecosystem. The system is built around three components: an extensible registry of over 300 interoperable scientific skills, an artifact layer that preserves full computational lineage as a directed acyclic graph (DAG), and a structured platform for agent-based scientific discourse with provenance-aware governance. Agents select and chain tools based on their scientific profiles, produce immutable artifacts with typed metadata and parent lineage, and broadcast unsatisfied information needs to a shared global index. The ArtifactReactor enables plannerless coordination: peer agents discover and fulfill open needs through pressure-based scoring, while schema-overlap matching triggers multi-parent synthesis across independent analyses. An autonomous mutation layer actively prunes the expanding artifact DAG to resolve conflicting or redundant workflows, while persistent memory allows agents to continuously build upon complex epistemic states across multiple cycles. Infinite converts these outputs into auditable scientific records through structured posts, provenance views, and machine-readable discourse relations, with community feedback steering subsequent investigation cycles. Across four autonomous investigations, peptide design for the somatostatin receptor SSTR2, lightweight impact-resistant ceramic screening, cross-domain resonance bridging biology, materials, and music, and formal analogy construction between urban morphology and grain-boundary evolution, the framework demonstrates heterogeneous tool chaining, emergent convergence among independently operating agents, and traceable reasoning from raw computation to published finding.
MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering
Recent advances in Retrieval-Augmented Generation (RAG) have enabled large language models (LLMs) to ground outputs in clinical evidence. However, connecting LLMs with external databases introduces the risk of contextual leakage: a subtle privacy threat where unique combinations of medical details enable patient re-identification even without explicit identifiers. Current benchmarks in healthcare heavily focus on accuracy, ignoring such privacy issues, despite strict regulations like Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR). To fill this gap, we present MedPriv-Bench, the first benchmark specifically designed to jointly evaluate privacy preservation and clinical utility in medical open-ended question answering. Our framework utilizes a multi-agent, human-in-the-loop pipeline to synthesize sensitive medical contexts and clinically relevant queries that create realistic privacy pressure. We establish a standardized evaluation protocol leveraging a pre-trained RoBERTa-Natural Language Inference (NLI) model as an automated judge to quantify data leakage, achieving an average of 85.9% alignment with human experts. Through an extensive evaluation of 9 representative LLMs, we demonstrate a pervasive privacy-utility trade-off. Our findings underscore the necessity of domain-specific benchmarks to validate the safety and efficacy of medical AI systems in privacy-sensitive environments.
comment: 17 pages, 5 figures
Understanding Strategic Platform Entry and Seller Exploration: A Stackelberg Model WWW
Online market platforms play an increasingly powerful role in the economy. An empirical phenomenon is that platforms, such as Amazon, Apple, and DoorDash, also enter their own marketplaces, imitating successful products developed by third-party sellers. We formulate a Stackelberg model, where the platform acts as the leader by committing to an entry policy: when will it enter and compete on a product? We study this model through a theoretical and computational framework. We begin with a single seller, and consider different kinds of policies for entry. We characterize the seller's optimal explore-exploit strategy via a Gittins-index policy, and give an algorithm to compute the platform's optimal entry policy. We then consider multiple sellers, to account for competition and information spillover. Here, the Gittins-index characterization fails, and we employ deep reinforcement learning to examine seller equilibrium behavior. Our findings highlight the incentives that drive platform entry and seller innovation, consistent with empirical evidence from markets such as Amazon and Google Play, with implications for regulatory efforts to preserve innovation and market diversity.
comment: 12 pages, 3 figures, Accepted to The Web Conference (WWW) 2026
The Provenance Paradox in Multi-Agent LLM Routing: Delegation Contracts and Attested Identity in LDP
Multi-agent LLM systems delegate tasks across trust boundaries, but current protocols do not govern delegation under unverifiable quality claims. We show that when delegates can inflate self-reported quality scores, quality-based routing produces a provenance paradox: it systematically selects the worst delegates, performing worse than random. We extend the LLM Delegate Protocol (LDP) with delegation contracts that bound authority through explicit objectives, budgets, and failure policies; a claimed-vs-attested identity model that distinguishes self-reported from verified quality; and typed failure semantics enabling automated recovery. In controlled experiments with 10 simulated delegates and validated with real Claude models, routing by self-claimed quality scores performs worse than random selection (simulated: 0.55 vs. 0.68; real models: 8.90 vs. 9.30), while attested routing achieves near-optimal performance (d = 9.51, p < 0.001). Sensitivity analysis across 36 configurations confirms the paradox emerges reliably when dishonest delegates are present. All extensions are backward-compatible with sub-microsecond validation overhead.
comment: 9 pages, 6 figures. Open-source: https://github.com/sunilp/ldp-protocol
Federated Multi-Agent Mapping for Planetary Exploration
Multi-agent robotic exploration stands to play an important role in space exploration as the next generation of robotic systems ventures to far-flung environments. A key challenge in this new paradigm will be to effectively share and utilize the vast amount of data generated onboard while operating in bandwidth-constrained regimes typical of space missions. Federated learning (FL) is a promising tool for bridging this gap. Drawing inspiration from the upcoming CADRE Lunar rover mission, we propose a federated multi-agent mapping approach that jointly trains a global map model across agents without transmitting raw data. Our method leverages implicit neural mapping to generate parsimonious, adaptable representations, reducing data transmission by up to 93.8% compared to raw maps. Furthermore, we enhance this approach with meta-initialization on Earth-based traversability datasets to significantly accelerate map convergence; reducing iterations required to reach target performance by 80% compared to random initialization. We demonstrate the efficacy of our approach on Martian terrains and glacier datasets, achieving downstream path planning F1 scores as high as 0.95 while outperforming on map reconstruction losses.
comment: 7 pages, 6 figures
Dominated Actions in Imperfect-Information Games
Dominance is a fundamental concept in game theory. In normal-form games dominated strategies can be identified in polynomial time. As a consequence, iterative removal of dominated strategies can be performed efficiently as a preprocessing step for reducing the size of a game before computing a Nash equilibrium. For imperfect-information games in extensive form, we could convert the game to normal form and then iteratively remove dominated strategies in the same way; however, this conversion may cause an exponential blowup in game size. In this paper we define and study the concept of dominated actions in imperfect-information games. Our main result is a polynomial-time algorithm for determining whether an action is dominated (strictly or weakly) by any mixed strategy in two-player perfect-recall games with publicly observable actions, which can be extended to iteratively remove dominated actions. This allows us to efficiently reduce the size of the game tree as a preprocessing step for Nash equilibrium computation. We explore the role of dominated actions empirically in ``All In or Fold'' No-Limit Texas Hold'em poker.
R3R: Decentralized Multi-Agent Collision Avoidance with Infinite-Horizon Safety
Existing decentralized methods for multi-agent motion planning lack formal, infinite-horizon safety guarantees, especially for communication-constrained systems. We present R3R which, to our knowledge, is the first decentralized and asynchronous framework for multi-agent motion planning under range-limited communication constraints with infinite-horizon safety guarantees for systems of nonlinear agents. R3R's novelty lies in combining our gatekeeper safety framework with a geometric constraint termed R-Boundedness, which together establish a formal link between an agent's communication radius and its ability to plan safely. We constrain trajectories to lie within a fixed planning radius, determined by a function of the agent's communication radius. This enables trajectories to be certified as provably safe for all time using only local information. Our algorithm is fully asynchronous, and ensures the forward invariance of these guarantees even in time-varying networks where agents asynchronously join and replan. We evaluate our approach in simulations of up to 128 Dubins vehicles, validating our theoretical safety guarantees in dense, obstacle-rich scenarios. We further show that R3R's computational complexity scales with local agent density rather than problem size, providing a practical solution for scalable and provably safe multi-agent systems.
comment: 8 pages, LaTeX; submitted to the American Control Conference (ACC) 2026
QLLM: Do We Really Need a Mixing Network for Credit Assignment in Multi-Agent Reinforcement Learning?
Credit assignment remains a fundamental challenge in multi agent reinforcement learning (MARL) and is commonly addressed through value decomposition under the centralized training with decentralized ex ecution (CTDE) paradigm. However, existing value decomposition meth ods typically rely on predefined mixing networks that require additional training, often leading to imprecise credit attribution and limited in terpretability. We propose QLLM, a novel framework that leverages large language models (LLMs) to construct training-free credit assign ment functions (TFCAFs), where the TFCAFs are nonlinear with re spect to the global state and offer enhanced interpretability while intro ducing no extra learnable parameters. A coder-evaluator framework is employed to ensure the correctness and executability of the generated code. Extensive experiments on standard MARL benchmarks demon strate that QLLM consistently outperforms baselines while requiring fewer learnable parameters. Furthermore, it demonstrates generalization across a broad set of value decomposition algorithms. Code is available at https://github.com/MaoMaoLYJ/pymarl-qllm.
Emergent Coordination in Multi-Agent Language Models
When are multi-agent LLM systems merely a collection of individual agents versus an integrated collective with higher-order structure? We introduce an information-theoretic framework to test -- in a purely data-driven way -- whether multi-agent systems show signs of higher-order structure. This information decomposition lets us measure whether dynamical emergence is present in multi-agent LLM systems, localize it, and distinguish spurious temporal coupling from performance-relevant cross-agent synergy. We implement a practical criterion and an emergence capacity criterion operationalized as partial information decomposition of time-delayed mutual information (TDMI). We apply our framework to experiments using a simple guessing game without direct agent communication and minimal group-level feedback with three randomized interventions. Groups in the control condition exhibit strong temporal synergy but little coordinated alignment across agents. Assigning a persona to each agent introduces stable identity-linked differentiation. Combining personas with an instruction to ``think about what other agents might do'' shows identity-linked differentiation and goal-directed complementarity across agents. Taken together, our framework establishes that multi-agent LLM systems can be steered with prompt design from mere aggregates to higher-order collectives. Our results are robust across emergence measures and entropy estimators, and not explained by coordination-free baselines or temporal dynamics alone. Without attributing human-like cognition to the agents, the patterns of interaction we observe mirror well-established principles of collective intelligence in human groups: effective performance requires both alignment on shared objectives and complementary contributions across members.
SERN: Bandwidth-Adaptive Cross-Reality Synchronization for Simulation-Enhanced Robot Navigation
Cross reality integration of simulation and physical robots is a promising approach for multi-robot operations in contested environments, where communication may be intermittent, interference may be present, and observability may be degraded. We present SERN (Simulation-Enhanced Realistic Navigation), a framework that tightly couples a high-fidelity virtual twin with physical robots to support real-time collaborative decision making. SERN makes three main contributions. First, it builds a virtual twin from geospatial and sensor data and continuously corrects it using live robot telemetry. Second, it introduces a physics-aware synchronization pipeline that combines predictive modeling with adaptive PD control. Third, it provides a bandwidth-adaptive ROS bridge that prioritizes critical topics when communication links are constrained. We also introduce a multi-metric cost function that balances latency, reliability, computation, and bandwidth. Theoretically, we show that when the adaptive controller keeps the physical and virtual input mismatch small, synchronization error remains bounded under moderate packet loss and latency. Empirically, SERN reduces end-to-end message latency by 15% to 25% and processing load by about 15% compared with a standard ROS setup, while maintaining tight real-virtual alignment with less than 5 cm positional error and less than 2 degrees rotational error. In a navigation task, SERN achieves a 95% success rate, compared with 85% for a real-only setup and 70% for a simulation-only setup, while also requiring fewer interventions and less time to reach the goal. These results show that a simulation-enhanced cross-reality stack can improve situational awareness and multi-agent coordination in contested environments by enabling look-ahead planning in the virtual twin while using real sensor feedback to correct discrepancies.
Combining Tree-Search, Generative Models, and Nash Bargaining Concepts in Game-Theoretic Reinforcement Learning IJCAI'25
Opponent modeling methods typically involve two crucial steps: building a belief distribution over opponents' strategies, and exploiting this opponent model by playing a best response. However, existing approaches typically require domain-specific heurstics to come up with such a model, and algorithms for approximating best responses are hard to scale in large, imperfect information domains. In this work, we introduce a scalable and generic multiagent training regime for opponent modeling using deep game-theoretic reinforcement learning. We first propose Generative Best Respoonse (GenBR), a best response algorithm based on Monte-Carlo Tree Search (MCTS) with a learned deep generative model that samples world states during planning. This new method scales to large imperfect information domains and can be plug and play in a variety of multiagent algorithms. We use this new method under the framework of Policy Space Response Oracles (PSRO), to automate the generation of an \emph{offline opponent model} via iterative game-theoretic reasoning and population-based training. We propose using solution concepts based on bargaining theory to build up an opponent mixture, which we find identifying profiles that are near the Pareto frontier. Then GenBR keeps updating an \emph{online opponent model} and reacts against it during gameplay. We conduct behavioral studies where human participants negotiate with our agents in Deal-or-No-Deal, a class of bilateral bargaining games. Search with generative modeling finds stronger policies during both training time and test time, enables online Bayesian co-player prediction, and can produce agents that achieve comparable social welfare and Nash bargaining score negotiating with humans as humans trading among themselves.
comment: Accepted by IJCAI'25 main track
Verified Multi-Agent Orchestration: A Plan-Execute-Verify-Replan Framework for Complex Query Resolution ICLR 2026
We present Verified Multi-Agent Orchestration (VMAO), a framework that coordinates specialized LLM-based agents through a verification-driven iterative loop. Given a complex query, our system decomposes it into a directed acyclic graph (DAG) of sub-questions, executes them through domain-specific agents in parallel, verifies result completeness via LLM-based evaluation, and adaptively replans to address gaps. The key contributions are: (1) dependency-aware parallel execution over a DAG of sub-questions with automatic context propagation, (2) verification-driven adaptive replanning that uses an LLM-based verifier as an orchestration-level coordination signal, and (3) configurable stop conditions that balance answer quality against resource usage. On 25 expert-curated market research queries, VMAO improves answer completeness from 3.1 to 4.2 and source quality from 2.6 to 4.1 (1-5 scale) compared to a single-agent baseline, demonstrating that orchestration-level verification is an effective mechanism for multi-agent quality assurance.
comment: ICLR 2026 Workshop on MALGAI
Systems and Control (EESS)
EcoFair-CH-MARL: Scalable Constrained Hierarchical Multi-Agent RL with Real-Time Emission Budgets and Fairness Guarantees ECAI
Global decarbonisation targets and tightening market pressures demand maritime logistics solutions that are simultaneously efficient, sustainable, and equitable. We introduce EcoFair-CH-MARL, a constrained hierarchical multi-agent reinforcement learning framework that unifies three innovations: (i) a primal-dual budget layer that provably bounds cumulative emissions under stochastic weather and demand; (ii) a fairness-aware reward transformer with dynamically scheduled penalties that enforces max-min cost equity across heterogeneous fleets; and (iii) a two-tier policy architecture that decouples strategic routing from real-time vessel control, enabling linear scaling in agent count. New theoretical results establish O(\sqrt{T}) regret for both constraint violations and fairness loss. Experiments on a high-fidelity maritime digital twin (16 ports, 50 vessels) driven by automatic identification system traces, plus an energy-grid case study, show up to 15% lower emissions, 12% higher through-put, and a 45% fair-cost improvement over state-of-the-art hierarchical and constrained MARL baselines. In addition, EcoFair-CH-MARL achieves stronger equity (lower Gini and higher min-max welfare) than fairness-specific MARL baselines (e.g., SOTO, FEN), and its modular design is compatible with both policy- and value-based learners. EcoFair-CH-MARL therefore advances the feasibility of large-scale, regulation-compliant, and socially responsible multi-agent coordination in safety-critical domains.
comment: Conference: The 28th European Conference on Artificial Intelligence (ECAI)
Progress-Based Fault Detection and Health-Aware Task Allocation for Heterogeneous Multi-Robot Systems
We present a progress-based fault detection module and its integration with dynamic task allocation for heterogeneous robot teams. The detector monitors a normalized task-completion signal with a lightweight Kalman filter (KF) and a normalized innovation squared (NIS) test, augmented with a low-rate stall gate, an uncertainty gate, and debounce logic. Health estimates influence the allocator via health-weighted costs and health-dependent masks; reallocation is event-triggered and regularized with an $\ell_1$ assignment-change penalty to limit reassignment churn while preserving feasibility through slack variables. The detector has constant per-robot update cost, and the allocation remains a convex quadratic program (QP). Experiments on a common team-task setup evaluate measurement-noise increases, velocity-slip biases, communication dropouts, and task abandonment. The results show timely detection in the noise and bias cases, maintained task completion with limited reassignment, and the expected observability delays under communication dropouts.
comment: Accepted for publication in the Proceedings of the 2026 American Control Conference (ACC)
Functional Safety Analysis for Infrastructure-Enabled Depot Autonomy System
This paper presents the functional safety analysis for an Infrastructure-Enabled Depot Autonomy (IX-DA) system. The IX-DA system automates the marshalling of delivery vehicles within a controlled depot environment, navigating connected autonomous vehicles (CAVs) between drop-off zones, service stations (washing, calibration, charging, loading), and pick-up zones without human intervention. We describe the system architecture comprising three principal subsystems -- the connected autonomous vehicle, the infrastructure sensing and compute layer, and the human operator interface -- and derive their functional requirements. Using ISO 26262-compliant Hazard Analysis and Risk Assessment (HARA) methodology, we identify eight hazardous events, evaluate them across different operating scenarios, and assign Automotive Safety Integrity Levels~(ASILs) ranging from Quality Management (QM) to ASIL C. Six safety goals are derived and allocated to vehicle and infrastructure subsystems. The analysis demonstrates that high-speed uncontrolled operation imposes the most demanding safety requirements (ASIL C), while controlled low-speed operation reduces most goals to QM, offering a practical pathway for phased deployment.
Collective Grid: Privacy-Preserved Multi-Operator Energy Sharing Optimization via Federated Energy Prediction
Electricity consumption in mobile networks is increasing with the continued 5G expansion, rising data traffic, and more complex infrastructures. However, energy management is often handled independently by each mobile network operator (MNO), leading to limited coordination and missed opportunities for collective efficiency gains. To address this gap, we propose a privacy-preserving framework for automated energy infrastructure sharing among co-located MNOs. Our framework consists of three modules: (i) a federated learning-based privacy-preserving site energy consumption forecasting module, (ii) an orchestration module in which a mixed-integer linear program is solved to schedule energy purchases from the grid, utilization of renewable sources, and shared battery charging or discharging, based on real-time prices, forecasts, and battery state, and (iii) an energy source selection module which handles the selection of cost-effective power sources and storage actions based on predicted demand across MNOs for the next control window. Using data from operational networks, our experiments confirm that the proposed solution substantially reduces operational costs and outperforms non-sharing baselines, with gains that increase as network density rises in 5G-and-beyond deployments.
comment: 6 pages, 6 figures, accepted in ICC
Consensus in Plug-and-Play Heterogeneous Dynamical Networks: A Passivity Compensation Approach
This paper investigates output consensus in heterogeneous dynamical networks within a plug-and-play framework. The networks are interconnected through nonlinear diffusive couplings and operate in the presence of measurement and communication noise. Focusing on systems that are input feedforward passive (IFP), we propose a passivity-compensation approach that exploits the surplus passivity of coupling links to locally offset shortages of passivity at the nodes. This mechanism enables subnetworks to be interconnected without requiring global reanalysis, thereby preserving modularity. Specifically, we derive locally verifiable interface conditions, expressed in terms of passivity indices and coupling gains, to guarantee that consensus properties of individual subnetworks are preserved when forming larger networks.
High-Probability Bounds for SGD under the Polyak-Lojasiewicz Condition with Markovian Noise
We present the first uniform-in-time high-probability bound for SGD under the PL condition, where the gradient noise contains both Markovian and martingale difference components. This significantly broadens the scope of finite-time guarantees, as the PL condition arises in many machine learning and deep learning models while Markovian noise naturally arises in decentralized optimization and online system identification problems. We further allow the magnitude of noise to grow with the function value, enabling the analysis of many practical sampling strategies. In addition to the high-probability guarantee, we establish a matching $1/k$ decay rate for the expected suboptimality. Our proof technique relies on the Poisson equation to handle the Markovian noise and a probabilistic induction argument to address the lack of almost-sure bounds on the objective. Finally, we demonstrate the applicability of our framework by analyzing three practical optimization problems: token-based decentralized linear regression, supervised learning with subsampling for privacy amplification, and online system identification.
comment: Submitted to SIAM Journal on Optimization
Bayesian and Classical Feature Ranking for Interpretable BLDC Fault Diagnosis
This paper compares Bayesian and classical feature ranking methods for interpretable fault diagnosis of brushless DC (BLDC) motors. Two Bayesian approaches, spike-and-slab and ARD logistic ranking, are evaluated against three classical baselines on a public BLDC benchmark in binary and multiclass settings using current-based, rotational-speed-based, and combined feature sets. The strongest overall results are obtained for the combined representation. In binary classification, ReliefF achieves the highest balanced accuracy of 0.923, while ARD logistic and spike-and-slab remain very close at 0.919 and 0.920 with much smaller subsets ($k=5$). In multiclass classification, ARD logistic performs best for the combined variant with balanced accuracy 0.914, followed closely by LASSO (0.913) and spike-and-slab (0.912). The results show that Bayesian ranking is particularly competitive for current-only and combined descriptors, while ReliefF remains especially effective for speed-based ranking. Because the benchmark consists of short segmented observations from a limited number of experimental conditions, the findings are interpreted primarily as benchmark-specific evidence rather than strong claims of fault generalization.
comment: This work has been submitted to the IEEE for possible publication
Surgi-HDTMR: Closing the Sensorimotor Loop in Bimanual Microsurgery via Haptics, Digital Twin, and Mixed Reality
Robotic microsurgery demands precise bimanual control, intuitive interaction, and informative force feedback. However, most training platforms for robotic microsurgery lack immersive 3D interaction and high-fidelity haptics. Here, we present Surgi-HDTMR, a mixed-reality (MR) and digital-twin (DT) training system that couples bimanual haptic teleoperation with a benchtop microsurgical robotic platform, and 3D-printed phantoms. A metrically co-registered, time-synchronized DT aligns in-situ MR guidance with the physical workspace and drives a depth-adaptive haptic model that renders contact, puncture, and tissue-retraction forces. In a within-subjects study of simulated cortical navigation and tumor resection, Surgi-HDTMR shortened task time, reduced harmful contacts and collisions, and improved perceptual accuracy relative to non-haptic and non-adaptive baselines. These results suggest that tightly coupling MR overlays with a synchronized DT, together with depth-adaptive haptics, can accelerate skill acquisition and improve safety in robot-assisted microsurgery, pointing toward next-generation surgical training.
Predicting power grid frequency dynamics with invertible Koopman-based architectures
The system frequency is a critical measure of power system stability and understanding, and modeling it are key to ensure reliable power system operations. Koopman-based autoencoders are effective at approximating complex nonlinear data patterns, with potential applications in the frequency dynamics of power systems. However, their non-invertibility can result in a distorted latent representation, leading to significant prediction errors. Invertible neural networks (INNs) in combination with the Koopman operator framework provide a promising approach to address these limitations. In this study, we analyze different INN architectures and train them on simulation datasets. We further apply extensions to the networks to address inherent limitations of INNs and evaluate their impact. We find that coupling-layer INNs achieve the best performance when used in isolation. In addition, we demonstrate that hybrid approaches can improve the performance when combined with suitable INNs, while reducing the generalization capabilities in combination with disadvantageous architectures. Overall, our results provide a clearer overview of how architectural choices influence INN performance, offering guidance for selecting and designing INNs for modeling power system frequency dynamics.
A Comprehensive Survey of Redundancy Systems with a Focus on Triple Modular Redundancy (TMR)
Despite its maturity, the field of fault-tolerant redundancy suffers from significant terminological fragmentation, where functionally equivalent methods are frequently described under disparate names across academic and industrial domains. This survey addresses this ambiguity by providing a structured and comprehensive analysis of redundancy techniques, with a primary focus on Triple Modular Redundancy (TMR). A unified taxonomy is established to classify redundancy strategies into Spatial, Temporal, and Mixed categories, alongside the introduction of a novel five-class framework for voter architectures. Key findings synthesize practical tradeoffs, contrasting high-reliability spatial TMR for safety-critical applications against resource-efficient temporal methods for constrained systems. Furthermore, the shift toward Mixed and Adaptive TMR (e.g., Approximate Triple Modular Redundancy (ATMR), X-Rel) for dynamic and error-tolerant applications, such as Artificial Intelligence (AI) acceleration, is explored. This work identifies critical research gaps, including the threat of Multi-Bit Upsets (MBUs) in sub-28nm technologies, the scarcity of public-domain data on proprietary high-integrity systems, and the absence of high-level toolchains for dynamic reconfiguration. Finally, suggestions are offered for future research directions, emphasizing the need for terminological standardization, MBU-resilient design methodologies, and the development of open-source tools for adaptive fault tolerance.
comment: 33 Pages, 7 Figures, under review in ACM Computing Survay
DRCC-LPVMPC: Robust Data-Driven Control for Autonomous Driving and Obstacle Avoidance
Safety in obstacle avoidance is critical for autonomous driving. While model predictive control (MPC) is widely used, simplified prediction models such as linearized or single-track vehicle models introduce discrepancies between predicted and actual behavior that can compromise safety. This paper proposes a distributionally robust chance-constrained linear parameter-varying MPC (DRCC-LPVMPC) framework that explicitly accounts for such discrepancies. The single-track vehicle dynamics are represented in a quasi-linear parameter-varying (quasi-LPV) form, with model mismatches treated as additive uncertainties of unknown distribution. By constructing chance constraints from finite sampled data and employing a Wasserstein ambiguity set, the proposed method avoids restrictive assumptions on boundedness or Gaussian distributions. The resulting DRCC problem is reformulated as tractable convex constraints and solved in real time using a quadratic programming solver. Recursive feasibility of the approach is formally established. Simulation and real-world experiments demonstrate that DRCC-LPVMPC maintains safer obstacle clearance and more reliable tracking than conventional nonlinear MPC and LPVMPC controllers under significant uncertainties.
Robust Safety Filters for Lipschitz-Bounded Adaptive Closed-Loop Systems with Structured Uncertainties
Adaptive control provides closed-loop stability and reference tracking for uncertain dynamical systems through online parameter adaptation. These properties alone, however, do not ensure safety in the sense of forward invariance of state constraints, particularly during transient phases of adaptation. Control barrier function (CBF)-based safety filters have been proposed to address this limitation, but existing approaches often rely on conservative constraint tightening or static safety margins within quadratic program formulations. This paper proposes a reference-based adaptive safety framework for systems with structured parametric uncertainty that explicitly accounts for transient plant-reference mismatch. Safety is enforced at the reference level using a barrier-function-based filter, while adaptive control drives the plant to track the safety-certified reference. By exploiting Lipschitz bounds on the closed-loop error dynamics, a robust CBF condition is derived and reformulated as a convex second-order cone program (SOCP). The resulting approach reduces conservatism while preserving formal guarantees of forward invariance, stability, and tracking.
comment: 6 pages, 4 figures, submitted to the IEEE for possible publication
DexterousMag: A Reconfigurable Electromagnetic Actuation System for Miniature Helical Robot
Despite the promise of magnetically actuated miniature helical robots for minimally invasive interventions, state-of-the-art electromagnetic actuation systems are often space-inefficient and geometrically fixed. These constraints hinder clinical translation and, moreover, prevent task-adaptive trade-offs among workspace coverage, energy distribution, and field/gradient capability. We present DexterousMag, a robot-arm-assisted three-coil electromagnetic actuation system that enables continuous geometric reconfiguration of a compact coil group, thereby redistributing magnetic-field and gradient capability for task-adaptive operation. The reconfiguration is realized by a parallel mechanism that exposes a single geometric DOF of the coil group, conveniently parameterized by the polar angle. Using an FEM-based modeling pipeline, we precompute actuation and gradient libraries and quantify the resulting trade-offs under current limits: configurations that favor depth reach expand the feasible region but reduce peak field/gradient, whereas configurations that favor near-surface capability concentrate stronger fields/gradients and support lifting. We validate these trade-offs on representative tasks (deep translation, planar tracking, and 3D lifting) and further demonstrate a proof-of-concept online geometry scheduling scheme for combined tasks, benchmarked against fixed-geometry settings. Overall, DexterousMag establishes continuous geometric reconfiguration as an operational mechanism for enlarging the practical envelope of miniature helical robot actuation while improving energy efficiency and safety.
Context-Aware Adaptive Shared Control for Magnetically-Driven Bimanual Dexterous Micromanipulation
Magnetically actuated robots provide a promising untethered platform for navigation in confined environments, enabling biological studies and targeted micro-delivery. However, dexterous manipulation in complex structures remains challenging. While single-arm magnetic actuation suffices for simple transport, steering through tortuous or bifurcating channels demands coordinated control of multiple magnetic sources to generate the torques required for precise rotation and directional guidance. Bimanual teleoperation enables such dexterous steering but imposes high cognitive demands, as operators must handle the nonlinear dynamics of magnetic actuation while coordinating two robotic manipulators. To address these limitations, we propose Bi-CAST, a context-aware adaptive shared control framework for bimanual magnetic micromanipulation. A multimodal network fuses spatio-temporal visual features, spatial risk metrics, and historical states to continuously adjust the control authority of each manipulator in real time. In parallel, a bidirectional haptic interface integrates force-based intent recognition with risk-aware guidance, enabling force feedback to provide a continuous channel for dynamic human-machine authority negotiation. We validate the framework through user studies with eight participants performing three navigation tasks of increasing complexity in a vascular phantom. Compared with fixed authority and discrete switching baselines, Bi-CAST achieves up to 76.6% reduction in collisions, 25.9% improvement in trajectory smoothness, and 44.4% lower NASA-TLX workload, while delivering the fastest task completion times.
Data-Enabled Policy and Value Iteration for Continuous-Time Linear Quadratic Output Feedback Control
This paper proposes efficient policy iteration and value iteration algorithms for the continuous-time linear quadratic regulator problem with unmeasurable states and unknown system dynamics, from the perspective of direct data-driven control. Specifically, by re-examining the data characteristics of input-output filtered vectors and introducing QR decomposition, an improved substitute state construction method is presented that further eliminates redundant information, ensures a full row rank data matrix, and enables a complete parameterized representation of the feedback controller. Furthermore, the original problem is transformed into an equivalent linear quadratic regulator problem defined on the substitute state with a known input matrix, verifying the stabilizability and detectability of the transformed system. Consequently, model-free policy iteration and value iteration algorithms are designed that fully exploit the full row rank substitute state data matrix. The proposed algorithms offer distinct advantages: they avoid the need for prior knowledge of the system order or the calculation of signal derivatives and integrals; the iterative equations can be solved directly without relying on the traditional least-squares paradigm, guaranteeing feasibility in both single-output and multi-output settings; and they demonstrate superior numerical stability, reduced data demand, and higher computational efficiency. Moreover, the heuristic results regarding trajectory generation for continuous-time systems are discussed, circumventing potential failure modes associated with existing approaches.
Low-Data Predictive Maintenance of Railway Station Doors and Elevators Using Bayesian Proxy Flow Modeling
This paper proposes a low-data predictive maintenance framework for automatic doors and elevators in a railway station building. The method is intended for assets without direct condition monitoring, where only aggregate passenger traffic information and expert knowledge about movement patterns are available. Passenger flows are modeled on a reduced station graph using a Bayesian formulation with uncertain totals and routing shares. The inferred flows are converted into approximate operating-cycle loads for doors and elevators through simple stochastic proxy relations. These loads are combined with uncertain age- and cycle-based maintenance thresholds to estimate the probability that predefined maintenance conditions have been reached. A cost-aware scheduling model is then used to align maintenance activities while accounting for service costs, disruption, delay penalties, and grouping opportunities within each asset class. The framework is illustrated on a simulated case study reflecting a real station layout. The results show that proxy operational data can support maintenance scheduling with low incremental implementation cost and can improve alignment relative to a calendar-based policy.
comment: This work has been submitted to the IEEE for possible publication
A Systematic Comparison and Evaluation of Building Ontologies for Deploying Data-Driven Analytics in Smart Buildings
Ontologies play a critical role in data exchange, information integration, and knowledge sharing across diverse smart building applications. Yet, semantic differences between the prevailing building ontologies hamper their purpose of bringing data interoperability and restrict the ability to reuse building ontologies in real-world applications. In this paper, we propose and adopt a framework to conduct a systematic comparison and evaluation of four popular building ontologies (Brick Schema, RealEstateCore, Project Haystack and Google's Digital Buildings) from both axiomatic design and assertions in a use case, namely the Terminological Box (TBox) evaluation and the Assertion Box (ABox) evaluation. In the TBox evaluation, we use the SQuaRE-based Ontology Quality Evaluation (OQuaRE) Framework and concede that Project Haystack and Brick Schema are more compact with respect to the ontology axiomatic design. In the ABox evaluation, we apply an empirical study with sample building data that suggests that Brick Schema and RealEstateCore have greater completeness and expressiveness in capturing the main concepts and relations within the building domain. The results implicitly indicate that there is no universal building ontology for integrating Linked Building Data (LBD). We also discuss ontology compatibility and investigate building ontology design patterns (ODPs) to support ontology matching, alignment, and harmonisation.
comment: 32 pages
Topological Conditions for Echo Chamber Formation under the FJ model: A Cluster Consensus-based Approach
The Friedkin-Johnsen (FJ) model is a popular opinion dynamics model that explains the disagreement that can occur even among closely interacting individuals. Cluster consensus is a special type of disagreement, where agents in a network split into subgroups such that those within a subgroup agree and those in different subgroups disagree. In large-scale social networks, users often distribute into echo chambers (i.e. groups of users with aligned views) while discussing contested issues such as electoral politics, social norms, etc. Additionally, they are exposed only to opinions and news sources that align with their existing beliefs. Hence, the interaction network plays a key role in the formation of an echo chamber. Since cluster consensus can represent echo chambers in a social network, we examine the conditions for cluster consensus under the FJ model with the objective of determining the properties of the interaction network that lead to echo chamber formation. We present topology-based necessary and sufficient conditions for cluster consensus under the FJ model, regardless of the edge weights in the network and stubbornness values (which are difficult to estimate parameters in a social network). A major advantage of the proposed results is that they are applicable to arbitrary digraphs. Moreover, using the proposed conditions, we explain the emergence of bow-tie structures which are often observed in real-world echo chambers. Finally, we also develop a computationally feasible methodology to verify the proposed conditions for cluster consensus.
Geometry-Aware Set-Membership Multilateration: Directional Bounds and Anchor Selection
In this paper, we study anchor selection for range-based localization under unknown-but-bounded measurement errors. We start from the convex localization set $\X=\Xd\cap\Hset$ recently introduced in \cite{CalafioreSIAM}, where $\Xd$ is a polyhedron obtained from pairwise differences of squared-range equations between the unknown location $x$ and the anchors, and $\Hset$ is the intersection of upper-range hyperspheres. Our first goal is \emph{offline} design: we derive geometry-only E- and D-type scores from the centered scatter matrix $S(A)=AQ_mA\tran$, where $A$ collects the anchor coordinates and $Q_m=I_m-\frac{1}{m}\one\one\tran$ is the centering projector, showing that $λ_{\min}(S(A))$ controls worst-direction and diameter surrogates for the polyhedral certificate $\Xd$, while $\det S(A)$ controls principal-axis volume surrogates. Our second goal is \emph{online} uncertainty assessment for a selected subset of anchors: exploiting the special structure $\X=\Xd\cap\Hset$, we derive a simplex-aggregated enclosing ball for $\Hset$ and an exact support-function formula for $\Hset$, which lead to finite hybrid bounds for the actual localization set $\X$, even when the polyhedral certificate deteriorates. Numerical experiments are performed in two dimensions, showing that geometry-based subset selection is close to an oracle combinatorial search, that the D-score slightly dominates the E-score for the area-oriented metric considered here, and that the new $\Hset$-aware certificates track the realized size of the selected localization set closely.
On Globally Optimal Stochastic Policy Gradient Methods for Domain Randomized LQR Synthesis
Domain randomization is a simple, effective, and flexible scheme for obtaining robust feedback policies aimed at reducing the sim-to-real gap due to model mismatch. While domain randomization methods have yielded impressive demonstrations in the robotics-learning literature, general and theoretically motivated principles for designing optimization schemes that effectively leverage the randomization are largely unexplored. We address this gap by considering a stochastic policy gradient descent method for the domain randomized linear-quadratic regulator synthesis problem, a situation simple enough to provide theoretical guarantees. In particular, we demonstrate that stochastic gradients obtained by repeatedly sampling new systems at each gradient step converge to global optima with appropriate hyperparameters choices, and yield better controllers with lower variability in the final controllers when compared to approaches that do not resample. Sampling is often a quick and cheap operation, so computing policy gradients with newly sampled systems at each iteration is preferable to evaluating gradients on a fixed set of systems.
On the Stability of Undesirable Equilibria in the Quadratic Program Framework for Safety-Critical Control
Control Lyapunov functions (CLFs) and Control Barrier Functions (CBFs) have been used to develop provably safe controllers by means of quadratic programs (QPs). This framework guarantees safety in the form of trajectory invariance with respect to a given set, but it can introduce undesirable equilibrium points to the closed loop system, which can be asymptotically stable. In this work, we present a detailed study of the formation and stability of equilibrium points with the CLF-CBF-QP framework with multiple CBFs. In particular, we prove that undesirable equilibrium points occur for most systems, and their stability is dependent on the CLF and CBF geometrical properties. We introduce the concept of CLF-CBF compatibility for a system, regarding a CLF-CBF pair inducing no stable equilibrium points other than the CLF global minimum on the corresponding closed-loop dynamics. Sufficient conditions for CLF-CBF compatibility for LTI and drift-less full-rank systems with quadratic CLF and CBFs are derived, and we propose a novel control strategy to induce smooth changes in the CLF geometry at certain regions of the state space in order to satisfy the CLF-CBF compatibility conditions, aiming to achieve safety with respect to multiple safety objectives and quasi-global convergence of the trajectories towards the CLF minimum. Numerical simulations illustrate the applicability of the proposed method.
comment: Accepted for publication at IFAC Automatica
Input Convex Lipschitz Recurrent Neural Networks for Robust and Efficient Process Modeling and Optimization
Computational efficiency and robustness are essential in process modeling, optimization, and control for real-world engineering applications. While neural network-based approaches have gained significant attention in recent years, conventional neural networks often fail to address these two critical aspects simultaneously or even independently. Inspired by natural physical systems and established literature, input convex architectures are known to enhance computational efficiency in optimization tasks, whereas Lipschitz-constrained architectures improve robustness. However, combining these properties within a single model requires careful review, as inappropriate methods for enforcing one property can undermine the other. To overcome this, we introduce a novel network architecture, termed Input Convex Lipschitz Recurrent Neural Networks (ICL-RNNs). This architecture seamlessly integrates the benefits of convexity and Lipschitz continuity, enabling fast and robust neural network-based modeling and optimization. The ICL-RNN outperforms existing recurrent units in both computational efficiency and robustness. Additionally, it has been successfully applied to practical engineering scenarios, such as chemical process modeling and the modeling and control of Organic Rankine Cycle-based waste heat recovery systems. Source code is available at https://github.com/killingbear999/ICLRNN.
A Modular Architecture Design for Autonomous Driving Racing in Controlled Environments
This paper presents a modular autonomous driving architecture for Formula Student Driverless competition vehicles operating in closed-circuit environments. The perception module employs YOLOv11 for real-time traffic cone detection, achieving 0.93 mAP@0.5 on the FSOCO dataset, combined with neural stereo depth estimation from a ZED 2i camera for 3D cone localization with sub-0.5 m median error at distances up to 7 m. State estimation fuses RTK-GNSS positioning and IMU measurements through an Extended Kalman Filter (EKF) based on a kinematic bicycle model, achieving centimeter-level localization accuracy with a 12 cm improvement over raw GNSS. Path planning computes the racing line via cubic spline interpolation on ordered track boundaries and assigns speed profiles constrained by curvature and vehicle dynamics. A regulated pure pursuit controller tracks the planned trajectory with a dynamic lookahead parameterized by speed error. The complete pipeline is implemented as a modular ROS 2 architecture on an NVIDIA Jetson Orin NX platform, with each subsystem deployed as independent nodes communicating through a dual-computer configuration. Experimental validation combines real-world sensor evaluation with simulation-based end-to-end testing, where realistic sensor error distributions are injected to assess system-level performance under representative conditions.
Spiking neurons as predictive controllers of linear systems
Neurons communicate with downstream systems via sparse and incredibly brief electrical pulses, or spikes. Using these events, they control various targets such as neuromuscular units, neurosecretory systems, and other neurons in connected circuits. This gave rise to the idea of spiking neurons as controllers, in which spikes are the control signal. Using instantaneous events directly as the control inputs, also called `impulse control', is challenging as it does not scale well to larger networks and has low analytical tractability. Therefore, current spiking control usually relies on filtering the spike signal to approximate analog control. This ultimately means spiking neural networks (SNNs) have to output a continuous control signal, necessitating continuous energy input into downstream systems. Here, we circumvent the need for rate-based representations, providing a scalable method for task-specific spiking control with sparse neural activity. In doing so, we take inspiration from both optimal control and neuroscience theory, and define a spiking rule where spikes are only emitted if they bring a dynamical system closer to a target. From this principle, we derive the required connectivity for an SNN, and show that it can successfully control linear systems. We show that for physically constrained systems, predictive control is required, and the control signal ends up exploiting the passive dynamics of the downstream system to reach a target. Finally, we show that the control method scales to both high-dimensional networks and systems. Importantly, in all cases, we maintain a closed-form mathematical derivation of the network connectivity, the network dynamics and the control objective. This work advances the understanding of SNNs as biologically-inspired controllers, providing insight into how real neurons could exert control, and enabling applications in neuromorphic hardware design.
Nonlinear Bayesian Filtering with Natural Gradient Gaussian Approximation
Practical Bayes filters often assume the state distribution of each time step to be Gaussian for computational tractability, resulting in the so-called Gaussian filters. When facing nonlinear systems, Gaussian filters such as extended Kalman filter (EKF) or unscented Kalman filter (UKF) typically rely on certain linearization techniques, which can introduce large estimation errors. To address this issue, this paper reconstructs the prediction and update steps of Gaussian filtering as solutions to two distinct optimization problems, whose optimal conditions are found to have analytical forms from Stein's lemma. It is observed that the stationary point for the prediction step requires calculating the first two moments of the prior distribution, which is equivalent to that step in existing moment-matching filters. In the update step, instead of linearizing the model to approximate the stationary points, we propose an iterative approach to directly minimize the update step's objective to avoid linearization errors. For the purpose of performing the steepest descent on the Gaussian manifold, we derive its natural gradient that leverages Fisher information matrix to adjust the gradient direction, accounting for the curvature of the parameter space. Combining this update step with moment matching in the prediction step, we introduce a new iterative filter for nonlinear systems called \textit{N}atural Gr\textit{a}dient Gaussia\textit{n} Appr\textit{o}ximation filter, or NANO filter for short. We prove that NANO filter locally converges to the optimal Gaussian approximation at each time step. Furthermore, the estimation error is proven exponentially bounded for nearly linear measurement equation and low noise levels through constructing a supermartingale-like property across consecutive time steps.
Enhancing Sample Efficiency in Multi-Agent RL with Uncertainty Quantification and Selective Exploration
Multi-agent reinforcement learning (MARL) methods have achieved state-of-the-art results on a range of multi-agent tasks. Yet, MARL algorithms typically require significantly more environment interactions than their single-agent counterparts to converge, a problem exacerbated by the difficulty in exploring over a large joint action space and the high variance intrinsic to MARL environments. To tackle these issues, we propose a novel algorithm that combines a decomposed centralized critic with decentralized ensemble learning, incorporating several key contributions. The main component in our scheme is a selective exploration method that leverages ensemble kurtosis. We extend the global decomposed critic with a diversity-regularized ensemble of individual critics and utilize its excess kurtosis to guide exploration toward high-uncertainty states and actions. To improve sample efficiency, we train the centralized critic with a novel truncated variation of the TD($λ$) algorithm, enabling efficient off-policy learning with reduced variance. On the actor side, our suggested algorithm adapts the mixed samples approach to MARL, mixing on-policy and off-policy loss functions for training the actors. This approach balances between stability and efficiency and outperforms purely off-policy learning. The evaluation shows our method outperforms state-of-the-art baselines on standard MARL benchmarks, including a variety of SMAC II maps.
Continuous-time Risk-sensitive Reinforcement Learning via Quadratic Variation Penalty
This paper studies continuous-time risk-sensitive reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation with the exponential-form objective. The risk-sensitive objective arises either as the agent's risk attitude or as a distributionally robust approach against the model uncertainty. Owing to the martingale perspective in Jia and Zhou (J Mach Learn Res 24(161): 1--61, 2023) the risk-sensitive RL problem is shown to be equivalent to ensuring the martingale property of a process involving both the value function and the q-function, augmented by an additional penalty term: the quadratic variation of the value process, capturing the variability of the value-to-go along the trajectory. This characterization allows for the straightforward adaptation of existing RL algorithms developed for non-risk-sensitive scenarios to incorporate risk sensitivity by adding the realized variance of the value process. Additionally, I highlight that the conventional policy gradient representation is inadequate for risk-sensitive problems due to the nonlinear nature of quadratic variation; however, q-learning offers a solution and extends to infinite horizon settings. Finally, I prove the convergence of the proposed algorithm for Merton's investment problem and quantify the impact of temperature parameter on the behavior of the learning procedure. I also conduct simulation experiments to demonstrate how risk-sensitive RL improves the finite-sample performance in the linear-quadratic control problem.
comment: 54 pages, 2 figures, 1 table
Quaternionic Pole Placement via Companion Forms and the Ackermann Formula
We present an extension of state-feedback pole placement for quaternionic systems, based on companion forms and the Ackermann formula. For controllable single-input quaternionic LTI models, we define a companion polynomial that annihilates its companion matrix, characterize spectra via right-eigenvalue similarity classes, and prove coefficient-matching design in controllable coordinates. We then derive a coordinate-free Ackermann gain expression valid for real target polynomials, and state its scope and limitations. Short examples demonstrate correctness, practical use, and numerical simplicity.
comment: 8 pages. Revised version resubmitted to IEEE Transactions on Automatic Control; proofs clarified, notation streamlined, and examples corrected. Co-funded by the European Union under the project ROBOPROX (reg. no. CZ.02.01.01/00/22_008/0004590)
Opinion Clustering under the Friedkin-Johnsen Model: Agreement in Disagreement
The convergence of opinions in the Friedkin-Johnsen (FJ) framework is well studied, but the topological conditions leading to opinion clustering remain less explored. To bridge this gap, we examine the role of topology in the emergence of opinion clusters within the network. The key contribution of the paper lies in the introduction of the notion of topologically prominent agents, referred to as Locally Topologically Persuasive (LTP) agents. Interestingly, each LTP agent is associated with a unique set of (non-influential) agents in its vicinity. Using them, we present conditions to obtain opinion clusters in the FJ framework in any arbitrarily connected digraph. A key advantage of the proposed result is that the resulting opinion clusters are independent of the edge weights and the stubbornness of the agents. Finally, we demonstrate using simulation results that, by suitably placing LTP agents, one can design networks that achieve any desired opinion clustering.
comment: Accepted for Presentation in the American Control Conference 2026
NDKF: A Neural-Enhanced Distributed Kalman Filter for Nonlinear Multi-Sensor Estimation
We propose a Neural-Enhanced Distributed Kalman Filter (NDKF) for multi-sensor state estimation in nonlinear systems. Unlike traditional Kalman filters that rely on explicit analytical models and assume centralized fusion, NDKF leverages neural networks to replace analytical process and measurement models with learned mappings while each node performs local prediction and update steps and exchanges only compact posterior summaries with its neighbors. This distributed design reduces communication overhead and avoids a central fusion bottleneck. We provide sufficient mean-square stability conditions under bounded Jacobians and well-conditioned innovations, together with practically checkable proxies such as Jacobian norm control and innovation monitoring. We also discuss consistency under learned-model mismatch, including covariance inflation and covariance-intersection fusion when cross-correlations are uncertain. Simulations on a 2D nonlinear system with four partially observing nodes show that NDKF outperforms a distributed EKF baseline under model mismatch and yields improved estimation accuracy with modest communication requirements.
comment: Accepted for publication in the Proceedings of the 2026 American Control Conference (ACC). This arXiv version includes a supplementary appendix that does not appear in the IEEE conference proceedings. An implementation of the NDKF is available in the GitHub repository accompanying this paper: https://github.com/sfarzan/NDKF
Resilient Chaotic Cross-Layer Routing for Smart Grid IoT Networks
This paper presents the Distributed Adaptive Multi-Radio Cross-Layer Routing (DAMCR) protocol, designed to enhance reliability, adaptability, and energy efficiency in smart grid and industrial Internet of Things (IoT) communication networks. DAMCR integrates Chaotic Frequency-Hopping Spread Spectrum (C-FHSS) to improve physical-layer security and jamming resilience with Link-Adaptive Quality Power Control (LAQPC) to dynamically regulate transmission power based on instantaneous link quality and residual node energy. To meet heterogeneous traffic requirements, the protocol incorporates priority-aware message classification that differentiates between periodic monitoring data and time-critical fault and protection messages. The proposed framework is implemented and evaluated in MATLAB using a heterogeneous network composed of LoRa, Wi-Fi, and dual-radio nodes operating under AWGN, Rayleigh, and Rician fading environments. Extensive simulation results demonstrate that DAMCR consistently achieves a Packet Delivery Ratio (PDR) exceeding 95% across all evaluated scenarios, while maintaining end-to-end latency between 17 and 23 ms, even in the presence of controlled jamming attacks. These results confirm that the tight integration of chaos-based spectrum agility, cross-technology routing, and energy-aware cross-layer adaptation significantly improves communication reliability, latency stability, and resilience compared to conventional single-radio and static-routing protocols.
Geometric Control Theory Over Networks: Minimal Node Cardinality Disturbance Decoupling Problems
In this paper we show how to formulate and solve disturbance decoupling problems over networks while choosing a minimal number of input and output nodes. Feedback laws that isolate and eliminate the impact of disturbance nodes on specific target nodes to be protected are provided using state, output, and dynamical feedback. For that, we leverage the fact that when reformulated in terms of sets of nodes rather than subspaces, the controlled and conditional invariance properties admit a simple graphical interpretation. For state and dynamical feedback, the minimal input and output cardinality solutions can be computed exactly in polynomial time, via min-cut/max-flow algorithms.
Frequency-Separable Hamiltonian Neural Network for Multi-Timescale Dynamics
While Hamiltonian mechanics provides a powerful inductive bias for neural networks modeling dynamical systems, Hamiltonian Neural Networks and their variants often fail to capture complex temporal dynamics spanning multiple timescales. This limitation is commonly linked to the spectral bias of deep neural networks, which favors learning low-frequency, slow-varying dynamics. Prior approaches have sought to address this issue through symplectic integration schemes that enforce energy conservation or by incorporating geometric constraints to impose structure on the configuration-space. However, such methods either remain limited in their ability to fully capture multiscale dynamics or require substantial domain specific assumptions. In this work, we exploit the observation that Hamiltonian functions admit decompositions into explicit fast and slow modes and can be reconstructed from these components. We introduce the Frequency-Separable Hamiltonian Neural Network (FS-HNN), which parameterizes the system Hamiltonian using multiple networks, each governed by Hamiltonian dynamics and trained on data sampled at distinct timescales. We further extend this framework to partial differential equations by learning a state- and boundary-conditioned symplectic operators. Empirically, we show that FS-HNN improves long-horizon extrapolation performance on challenging dynamical systems and generalizes across a broad range of ODE and PDE problems.
Robotics
H-RINS: Hierarchical Tightly-coupled Radar-Inertial Navigation via Smoothing and Mapping
Millimeter-wave radar provides robust perception in visually degraded environments. However, radar-inertial state estimation is inherently susceptible to drift. Because radar yields only sparse, body-frame velocity measurements, it provides weak constraints on absolute orientation. Consequently, IMU biases remain poorly observable over the short time horizons typical of sliding-window filters. To address this fundamental observability challenge, we propose a tightly coupled, hierarchical radar-inertial factor graph framework. Our architecture decouples the estimation problem into a high-rate resetting graph and a persistent global graph. The resetting graph fuses IMU preintegration, radar velocities, and adaptive Zero-Velocity Updates (ZUPT) to generate the smooth, low-latency odometry required for real-time control. Concurrently, the persistent graph is a full-state factor graph maintaining the complete information of poses, velocities, and biases by fusing inertial data with keyframe-based geometric mapping and loop closures. Leveraging Incremental Smoothing and Mapping, the persistent graph can operate without explicit marginalization of variables, preserving their information while ensuring long-term bias observability. The cornerstone of our approach is a probabilistic tight-coupling mechanism: fully observable, optimized biases and their exact covariances are continuously injected from the persistent graph into the resetting graph's prior, effectively anchoring the high-rate estimator against integration drift. Extensive evaluations demonstrate our system achieves high accuracy with drift-reduced estimation at 27x real-time execution speeds. We release the implementation code and datasets upon the acceptance of the paper.
comment: 8 pages, 5 figures, Submitted to conference
GelSphere: An Omnidirectional Rolling Vision-Based Tactile Sensor for Online 3D Reconstruction and Normal Force Estimation
We present GelSphere, a spherical vision-based tactile sensor designed for real-time continuous surface scanning. Unlike traditional vision-based tactile sensors that can only sense locally and are damaged when slid across surfaces, and cylindrical tactile sensors that can only roll along a fixed direction, our design enables omnidirectional rolling on surfaces. We accomplish this through our novel sensing system design, which has steel balls inside the sensor, forming a bearing layer between the gel and the rigid housing that allows rolling motion in all axes. The sensor streams tactile images through Wi-Fi, with online large-surface reconstruction capabilities. We present quantitative results for both reconstruction accuracy and image fusion performance. The results show that our sensor maintains geometric fidelity and high reconstruction accuracy even under multi-directional rolling, enabling uninterrupted surface scanning.
Stiffness Copilot: An Impedance Policy for Contact-Rich Teleoperation
In teleoperation of contact-rich manipulation tasks, selecting robot impedance is critical but difficult. The robot must be compliant to avoid damaging the environment, but stiff to remain responsive and to apply force when needed. In this paper, we present Stiffness Copilot, a vision-based policy for shared-control teleoperation in which the operator commands robot pose and the policy adjusts robot impedance online. To train Stiffness Copilot, we first infer direction-dependent stiffness matrices in simulation using privileged contact information. We then use these matrices to supervise a lightweight vision policy that predicts robot stiffness from wrist-camera images and transfers zero-shot to real images at runtime. In a human-subject study, Stiffness Copilot achieved safety comparable to using a constant low stiffness while matching the efficiency of using a constant high stiffness.
comment: Project website: https://stiffness-copilot.github.io
Amortizing Trajectory Diffusion with Keyed Drift Fields
Diffusion-based trajectory planners can synthesize rich, multimodal action sequences for offline reinforcement learning, but their iterative denoising incurs substantial inference-time cost, making closed-loop planning slow under tight compute budgets. We study the problem of achieving diffusion-like trajectory planning behavior with one-step inference, while retaining the ability to sample diverse candidate plans and condition on the current state in a receding-horizon control loop. Our key observation is that conditional trajectory generation fails under naïve distribution-matching objectives when the similarity measure used to align generated trajectories with the dataset is dominated by unconstrained future dimensions. In practice, this causes attraction toward average trajectories, collapses action diversity, and yields near-static behavior. Our key insight is that conditional generative planning requires a conditioning-aware notion of neighborhood: trajectory updates should be computed using distances in a compact key space that reflects the condition, while still applying updates in the full trajectory space. Building on this, we introduce Keyed Drifting Policies (KDP), a one-step trajectory generator trained with a drift-field objective that attracts generated trajectories toward condition-matched dataset windows and repels them from nearby generated samples, using a stop-gradient drifted target to amortize iterative refinement into training. At inference, the resulting policy produces a full trajectory window in a single forward pass. Across standard RL benchmarks and real-time hardware deployments, KDP achieves strong performance with one-step inference and substantially lower planning latency than diffusion sampling. Project website, code and videos: https://keyed-drifting.github.io/
Distributional Uncertainty and Adaptive Decision-Making in System
Complex engineered systems require coordinated design choices across heterogeneous components under multiple conflicting objectives and uncertain specifications. Monotone co-design provides a compositional framework for such problems by modeling each subsystem as a design problem: a feasible relation between provided functionalities and required resources in partially ordered sets. Existing uncertain co-design models rely on interval bounds, which support worst-case reasoning but cannot represent probabilistic risk or multi-stage adaptive decisions. We develop a distributional extension of co-design that models uncertain design outcomes as distributions over design problems and supports adaptive decision processes through Markov-kernel re-parameterizations. Using quasi-measurable and quasi-universal spaces, we show that the standard co-design interconnection operations remain compositional under this richer notion of uncertainty. We further introduce queries and observations that extract probabilistic design trade-offs, including feasibility probabilities, confidence bounds, and distributions of minimal required resources. A task-driven unmanned aerial vehicle case study illustrates how the framework captures risk-sensitive and information-dependent design choices that interval-based models cannot express.
URDF-Anything+: Autoregressive Articulated 3D Models Generation for Physical Simulation
Articulated objects are fundamental for robotics, simulation of physics, and interactive virtual environments. However, reconstructing them from visual input remains challenging, as it requires jointly inferring both part geometry and kinematic structure. We present, an end-to-end autoregressive framework that directly generates executable articulated object models from visual observations. Given image and object-level 3D cues, our method sequentially produces part geometries and their associated joint parameters, resulting in complete URDF models without reliance on multi-stage pipelines. The generation proceeds until the model determines that all parts have been produced, automatically inferring complete geometry and kinematics. Building on this capability, we enable a new Real-Follow-Sim paradigm, where high-fidelity digital twins constructed from visual observations allow policies trained and tested purely in simulation to transfer to real robots without online adaptation. Experiments on large-scale articulated object benchmarks and real-world robotic tasks demonstrate that outperforms prior methods in geometric reconstruction quality, joint parameter accuracy, and physical executability.
Vision-guided Autonomous Dual-arm Extraction Robot for Bell Pepper Harvesting
Agricultural robotics has emerged as a critical solution to the labor shortages and rising costs associated with manual crop harvesting. Bell pepper harvesting, in particular, is a labor-intensive task, accounting for up to 50% of total production costs. While automated solutions have shown promise in controlled greenhouse environments, harvesting in unstructured outdoor farms remains an open challenge due to environmental variability and occlusion. This paper presents VADER (Vision-guided Autonomous Dual-arm Extraction Robot), a dual-arm mobile manipulation system designed specifically for the autonomous harvesting of bell peppers in outdoor environments. The system integrates a robust perception pipeline coupled with a dual-arm planning framework that coordinates a gripping arm and a cutting arm for extraction. We validate the system through trials in various realistic conditions, demonstrating a harvest success rate exceeding 60% with a cycle time of under 100 seconds per fruit, while also featuring a teleoperation fail-safe based on the GELLO teleoperation framework to ensure robustness. To support robust perception, we contribute a hierarchically structured dataset of over 3,200 images spanning indoor and outdoor domains, pairing wide-field scene images with close-up pepper images to enable a coarse-to-fine training strategy from fruit detection to high-precision pose estimation. The code and dataset will be made publicly available upon acceptance.
comment: 9 pages; first four authors have equal contribution
ToMPC: Task-oriented Model Predictive Control via ADMM for Safe Robotic Manipulation
This paper proposes a task-oriented model predictive control (ToMPC) framework for safe and efficient robotic manipulation in open workspaces. The framework unifies collision-free motion and robot-environment interaction to address diverse scenarios. Additionally, it introduces task-oriented obstacle avoidance that leverages kinematic redundancy to enhance manipulation efficiency in obstructed environments. This complex optimization problem is solved by the alternating direction method of multipliers (ADMM), which decomposes the problem into two subproblems tackled by differential dynamic programming (DDP) and quadratic programming (QP), respectively. The effectiveness of this approach is validated in simulation and hardware experiments on a Franka Panda robotic manipulator. Results demonstrate that the framework can plan motion and/or force trajectories in real time, maximize the manipulation range while avoiding obstacles, and strictly adhere to safety-related hard constraints.
comment: 8 pages, 10 figures, accepted by IEEE Robotics and Automation Letters (RAL)
SmoothVLA: Aligning Vision-Language-Action Models with Physical Constraints via Intrinsic Smoothness Optimization
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for robotic manipulation. However, existing post-training methods face a dilemma between stability and exploration: Supervised Fine-Tuning (SFT) is constrained by demonstration quality and lacks generalization, whereas Reinforcement Learning (RL) improves exploration but often induces erratic, jittery trajectories that violate physical constraints. To bridge this gap, we propose SmoothVLA, a novel reinforcement learning fine-tuning framework that synergistically optimizes task performance and motion smoothness. The technical core is a physics-informed hybrid reward function that integrates binary sparse task rewards with a continuous dense term derived from trajectory jerk. Crucially, this reward is intrinsic, that computing directly from policy rollouts, without requiring extrinsic environment feedback or laborious reward engineering. Leveraging the Group Relative Policy Optimization (GRPO), SmoothVLA establishes trajectory smoothness as an explicit optimization prior, guiding the model toward physically feasible and stable control. Extensive experiments on the LIBERO benchmark demonstrate that SmoothVLA outperforms standard RL by 13.8\% in smoothness and significantly surpasses SFT in generalization across diverse tasks. Our work offers a scalable approach to aligning VLA models with physical-world constraints through intrinsic reward optimization.
Data-Driven Autoregressive Power Prediction for GTernal Robots in the Robotarium
Energy-aware algorithms for multi-robot systems require accurate power consumption models, yet existing approaches rely on kinematic approximations that fail to capture the complex dynamics of real hardware. We present a lightweight autoregressive predictor for the GTernal mobile robot platform deployed in the Georgia Tech Robotarium. Through analysis of 48,000 samples collected across six motion trials, we discover that power consumption exhibits strong temporal autocorrelation ($ρ_1 = 0.95$) that dominates kinematic effects. A 7,041-parameter multi-layer perceptron (MLP) achieves $R^2 = 0.90$ on held-out motion patterns by conditioning on recent power history, reaching the theoretical prediction ceiling imposed by measurement noise. Physical validation across seven robots in a collision avoidance scenario yields mean $R^2 = 0.87$, demonstrating zero-shot transfer to unseen robots and behaviors. The predictor runs in 224 $μ$s per inference, enabling real-time deployment at 150$\times$ the platform's 30 Hz control rate. We release the trained model and dataset to support energy-aware multi-robot algorithm development.
comment: 8 pages, 5 figures
LineMaster Pro: A Low-Cost Intelligent Line Following Robot with PID Control and Ultrasonic Obstacle Avoidance for Educational Robotics
Line following robots are fundamental platforms in robotics education, yet commercially available solutions remain prohibitively expensive ($150-300$) while lacking integrated obstacle detection capabilities essential for real-world applications. This paper presents LineMaster Pro, an intelligent low-cost line following robot implemented on an Arduino Nano platform that integrates dual TCRT5000 infrared sensors for precision line tracking, an HC-SR04 ultrasonic sensor for real-time obstacle detection, a digitally tuned PID controller with Ziegler-Nichols optimization, and a hierarchical finite state machine for robust obstacle avoidance. A systematic four-phase sensor calibration methodology ensures reliable operation across varying lighting and surface conditions. Experimental validation through 200 controlled trials and 72-hour continuous operation demonstrates mean tracking accuracy of 1.18 cm at 0.4 m/s (95\% CI [1.06, 1.30]), obstacle detection reliability of 96.7\% within 10-40 cm range with 0.7\% false positive rate, and 94\% successful recovery from path deviations. The PID implementation achieves 43\% improvement over conventional on-off control ($p<0.001$). At a total hardware cost of \$28.50 based on verified Bangladesh market prices, LineMaster Pro achieves a 94\% cost reduction compared to commercial alternatives, establishing a practical benchmark for accessible robotics education in resource-constrained environments.
Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition
For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, enabling reliable detection of subtle dynamics across observations. To this end, we propose CroBo, a visual state representation learning framework based on a global-to-local reconstruction objective. Given a reference observation compressed into a compact bottleneck token, CroBo learns to reconstruct heavily masked patches in a local target crop from sparse visible cues, using the global bottleneck token as context. This learning objective encourages the bottleneck token to encode a fine-grained representation of scene-wide semantic entities, including their identities, spatial locations, and configurations. As a result, the learned visual states reveal how scene elements move and interact over time, supporting sequential decision making. We evaluate CroBo on diverse vision-based robot policy learning benchmarks, where it achieves state-of-the-art performance. Reconstruction analyses and perceptual straightness experiments further show that the learned representations preserve pixel-level scene composition and encode what-moves-where across observations.
comment: Preprint
Path-conditioned Reinforcement Learning-based Local Planning for Long-Range Navigation
Long-range navigation is commonly addressed through hierarchical pipelines in which a global planner generates a path, decomposed into waypoints, and followed sequentially by a local planner. These systems are sensitive to global path quality, as inaccurate remote sensing data can result in locally infeasible waypoints, which degrade local execution. At the same time, the limited global context available to the local planner hinders long-range efficiency. To address this issue, we propose a reinforcement learning-based local navigation policy that leverages path information as contextual guidance. The policy is conditioned on reference path observations and trained with a reward function mainly based on goal-reaching objectives, without any explicit path-following reward. Through this implicit conditioning, the policy learns to opportunistically exploit path information while remaining robust to misleading or degraded guidance. Experimental results show that the proposed approach significantly improves navigation efficiency when high-quality paths are available and maintains baseline-level performance when path observations are severely degraded or even non-existent. These properties make the method particularly well-suited for long-range navigation scenarios in which high-level plans are approximate and local execution must remain adaptive to uncertainty.
Benchmarking the Energy Cost of Assurance in Neuromorphic Edge Robotics
Deploying trustworthy artificial intelligence on edge robotics imposes a difficult trade-off between high-assurance robustness and energy sustainability. Traditional defense mechanisms against adversarial attacks typically incur significant computational overhead, threatening the viability of power-constrained platforms in environments such as cislunar space. This paper quantifies the energy cost of assurance in event-driven neuromorphic systems. We benchmark the Hierarchical Temporal Defense (HTD) framework on the BrainChip Akida AKD1000 processor against a suite of adversarial temporal attacks. We demonstrate that unlike traditional deep learning defenses which often degrade efficiency significantly with increased robustness, the event-driven nature of the proposed architecture achieves a superior trade-off. The system reduces gradient-based adversarial success rates from 82.1% to 18.7% and temporal jitter success rates from 75.8% to 25.1%, while maintaining an energy consumption of approximately 45 microjoules per inference. We report a counter-intuitive reduction in dynamic power consumption in the fully defended configuration, attributed to volatility-gated plasticity mechanisms that induce higher network sparsity. These results provide empirical evidence that neuromorphic sparsity enables sustainable and high-assurance edge autonomy.
comment: 6 pages, 4 figures. Accepted and presented at the STEAR 2026 Workshop on Sustainable and Trustworthy Edge AI for Robotics, HiPEAC 2026, Krakow, Poland
TransDex: Pre-training Visuo-Tactile Policy with Point Cloud Reconstruction for Dexterous Manipulation of Transparent Objects
Dexterous manipulation enables complex tasks but suffers from self-occlusion, severe depth noise, and depth information loss when manipulating transparent objects. To solve this problem, this paper proposes TransDex, a 3D visuo-tactile fusion motor policy based on point cloud reconstruction pre-training. Specifically, we first propose a self-supervised point cloud reconstruction pre-training approach based on Transformer. This method accurately recovers the 3D structure of objects from interactive point clouds of dexterous hands, even when random noise and large-scale masking are added. Building on this, TransDex is constructed in which perceptual encoding adopts a fine-grained hierarchical scheme and multi-round attention mechanisms adaptively fuse features of the robotic arm and dexterous hand to enable differentiated motion prediction. Results from transparent object manipulation experiments conducted on a real robotic system demonstrate that TransDex outperforms existing baseline methods. Further analysis validates the generalization capabilities of TransDex and the effectiveness of its individual components.
comment: Project page: https://transdex.github.io/
LDHP: Library-Driven Hierarchical Planning for Non-prehensile Dexterous Manipulation
Non-prehensile manipulation is essential for handling thin, large, or otherwise ungraspable objects in unstructured settings. Prior planning and search-based methods often rely on ad-hoc manual designs or generate physically unrealizable motions by ignoring critical gripper properties, while training-based approaches are data-intensive and struggle to generalize to novel, out-of-distribution tasks. We propose a library-driven hierarchical planner (LDHP) that makes executability a first-class design goal: a top-tier contact-state planner proposes object-pose paths using MoveObject primitives, and a bottom-tier grasp planner synthesizes feasible grasp sequences with AdjustGrasp primitives; feasibility is certified by collision checks and quasi-static mechanics, and contact-sensitive segments are recovered via a bounded dichotomy refinement. This gripper-aware decomposition decouples object motion from grasp realizability, yields a task-agnostic pipeline that transfers across manipulation tasks and geometric variations without re-design, and exposes clean hooks for optional learned priors. Real-robot studies on zero-mobility lifting and slot insertion demonstrate consistent execution and robustness to shape and environment changes.
comment: 9 pages
Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving
End-to-end autonomous driving is typically built upon imitation learning (IL), yet its performance is constrained by the quality of human demonstrations. To overcome this limitation, recent methods incorporate reinforcement learning (RL) through sequential fine-tuning. However, such a paradigm remains suboptimal: sequential RL fine-tuning can introduce policy drift and often leads to a performance ceiling due to its dependence on the pretrained IL policy. To address these issues, we propose PaIR-Drive, a general Parallel framework for collaborative Imitation and Reinforcement learning in end-to-end autonomous driving. During training, PaIR-Drive separates IL and RL into two parallel branches with conflict-free training objectives, enabling fully collaborative optimization. This design eliminates the need to retrain RL when applying a new IL policy. During inference, RL leverages the IL policy to further optimize the final plan, allowing performance beyond prior knowledge of IL. Furthermore, we introduce a tree-structured trajectory neural sampler to group relative policy optimization (GRPO) in the RL branch, which enhances exploration capability. Extensive analysis on NAVSIMv1 and v2 benchmark demonstrates that PaIR-Drive achieves Competitive performance of 91.2 PDMS and 87.9 EPDMS, building upon Transfuser and DiffusionDrive IL baselines. PaIR-Drive consistently outperforms existing RL fine-tuning methods, and could even correct human experts' suboptimal behaviors. Qualitative results further confirm that PaIR-Drive can effectively explore and generate high-quality trajectories.
comment: 8 pages, 7 figures, 6 tables
ImagiNav: Scalable Embodied Navigation via Generative Visual Prediction and Inverse Dynamics
Enabling robots to navigate open-world environments via natural language is critical for general-purpose autonomy. Yet, Vision-Language Navigation has relied on end-to-end policies trained on expensive, embodiment-specific robot data. While recent foundation models trained on vast simulation data show promise, the challenge of scaling and generalizing due to the limited scene diversity and visual fidelity in simulation persists. To address this gap, we propose ImagiNav, a novel modular paradigm that decouples visual planning from robot actuation, enabling the direct utilization of diverse in-the-wild navigation videos. Our framework operates as a hierarchy: a Vision-Language Model first decomposes instructions into textual subgoals; a finetuned generative video model then imagines the future video trajectory towards that subgoal; finally, an inverse dynamics model extracts the trajectory from the imagined video, which can then be tracked via a low-level controller. We additionally develop a scalable data pipeline of in-the-wild navigation videos auto-labeled via inverse dynamics and a pretrained Vision-Language Model. ImagiNav demonstrates strong zero-shot transfer to robot navigation without requiring robot demonstrations, paving the way for generalist robots that learn navigation directly from unlabeled, open-world data.
GraspADMM: Improving Dexterous Grasp Synthesis via ADMM Optimization
Synthesizing high-quality dexterous grasps is a fundamental challenge in robot manipulation, requiring adherence to diversity, kinematic feasibility (valid hand-object contact without penetration), and dynamic stability (secure multi-contact forces). The recent framework Dexonomy successfully ensures broad grasp diversity through dense sampling and improves kinematic feasibility via a simulator-based refinement method that excels at resolving exact collisions. However, its reliance on fixed contact points restricts the hand's reachability and prevents the optimization of grasp metrics for dynamic stability. Conversely, purely gradient-based optimizers can maximize dynamic stability but rely on simplified contact approximations that inevitably cause physical penetrations. To bridge this gap, we propose GraspADMM, a novel grasp synthesis framework that preserves sampling-based diversity while improving kinematic feasibility and dynamic stability. By formulating the refinement stage using the Alternating Direction Method of Multipliers (ADMM), we decouple the target contact points on the object from the actual contact locations on the hand. This decomposition allows the pipeline to alternate between updating the target object points to directly maximize dynamic grasp metrics, and adjusting the hand pose to physically reach these targets while strictly respecting collision boundaries. Extensive experiments demonstrate that GraspADMM significantly outperforms state-of-the-art baselines, achieving a nearly 15\% absolute improvement in grasp success rate for type-unaware synthesis and roughly a 100\% relative improvement in type-aware synthesis. Furthermore, our approach maintains robust, physically plausible grasp generation even under extreme low-friction conditions.
ArrayTac: A tactile display for simultaneous rendering of shape, stiffness and friction
Human-computer interaction in the visual and auditory domains has achieved considerable maturity, yet machine-to-human tactile feedback remains underdeveloped. Existing tactile displays struggle to simultaneously render multiple tactile dimensions, such as shape, stiffness, and friction, which limits the realism of haptic simulation. Here, we present ArrayTac, a piezoelectric-driven tactile display capable of simultaneously rendering shape, stiffness, and friction to reproduce realistic haptic signals. The system comprises a 4x4 array of 16 actuator units, each employing a three-stage micro-lever mechanism to amplify the micrometer-scale displacement of the piezoelectric element, with Hall sensor-based closed-loop control at the end effector to enhance response speed and precision. We further implement two end-to-end pipelines: 1) a vision-to-touch framework that converts visual inputs into tactile signals using multimodal foundation models, and 2) a real-time tele-palpation system operating over distances of several thousand kilometers. In user studies, first-time participants accurately identify object shapes and physical properties with high success rates. In a tele-palpation experiment over 1,000km, untrained volunteers correctly identified both the number and type of tumors in a breast phantom with 100% accuracy and precisely localized their positions. The system pioneers a new pathway for high-fidelity haptic feedback by introducing the unprecedented capability to simultaneously render an object's shape, stiffness, and friction, delivering a holistic tactile experience that was previously unattainable.
Building Explicit World Model for Zero-Shot Open-World Object Manipulation
Open-world object manipulation remains a fundamental challenge in robotics. While Vision-Language-Action (VLA) models have demonstrated promising results, they rely heavily on large-scale robot action demonstrations, which are costly to collect and can hinder out-of-distribution generalization. In this paper, we propose an explicit-world-model-based framework for open-world manipulation that achieves zero-shot generalization by constructing a physically grounded digital twin of the environment. The framework integrates open-set perception, digital-twin reconstruction, sampling and evaluation of interaction strategies. By constructing a digital twin of the environment, our approach efficiently explores and evaluates manipulation strategies in physic-enabled simulator and reliably deploys the chosen strategy to the real world. Experimentally, the proposed framework is able to perform multiple open-set manipulation tasks without any task-specific action demonstrations, proving strong zero-shot generalization on both the task and object levels. Project Page: https://bojack-bj.github.io/projects/thesis/
ST-VLA: Enabling 4D-Aware Spatiotemporal Understanding for General Robot Manipulation
Robotic manipulation in open-world environments requires reasoning across semantics, geometry, and long-horizon action dynamics. Existing hierarchical Vision-Language-Action (VLA) frameworks typically use 2D representations to connect high-level reasoning with low-level control, but lack depth awareness and temporal consistency, limiting robustness in complex 3D scenes. We propose ST-VLA, a hierarchical VLA framework using a unified 3D-4D representation to bridge perception and action. ST-VLA converts 2D guidance into 3D trajectories and generates smooth spatial masks that capture 4D spatio-temporal context, providing a stable interface between semantic reasoning and continuous control. To enable effective learning of such representations, we introduce ST-Human, a large-scale human manipulation dataset with 14 tasks and 300k episodes, annotated with 2D, 3D, and 4D supervision via a semi-automated pipeline. Using ST-Human, we train ST-VLM, a spatio-temporal vision-language model that generates spatially grounded and temporally coherent 3D representations to guide policy execution. The smooth spatial masks focus on task-relevant geometry and stabilize latent representations, enabling online replanning and long-horizon reasoning. Experiments on RLBench and real-world manipulation tasks show that \method significantly outperforms state-of-the-art baselines, improving zero-shot success rates by 44.6% and 30.3%. These results demonstrate that offloading spatio-temporal reasoning to VLMs with unified 3D-4D representations substantially improves robustness and generalization for open-world robotic manipulation. Project website: https://oucx117.github.io/ST-VLA/.
comment: 25 pages, under review
Robust Sim-to-Real Cloth Untangling through Reduced-Resolution Observations via Adaptive Force-Difference Quantization
Robotic cloth untangling requires progressively disentangling fabric by adapting pulling actions to changing contact and tension conditions. Because large-scale real-world training is impractical due to cloth damage and hardware wear, sim-to-real policy transfer is a promising solution. However, cloth manipulation is highly sensitive to interaction dynamics, and policies that depend on precise force magnitudes often fail after transfer because similar force responses cannot be reproduced due to the reality gap. We observe that untangling is largely characterized by qualitative tension transitions rather than exact force values. This indicates that directly minimizing the sim-to-real gap in raw force measurements does not necessarily align with the task structure. We therefore hypothesize that emphasizing coarse force-change patterns while suppressing fine environment-dependent variations can improve robustness of sim-to-real transfer. Based on this insight, we propose Adaptive Force-Difference Quantization (ADQ), which reduces observation resolution by representing force inputs as discretized temporal differences and learning state-dependent quantization thresholds adaptively. This representation mitigates overfitting to environment-specific force characteristics and facilitates direct sim-to-real transfer. Experiments in both simulation and real-world cloth untangling demonstrate that ADQ achieves higher success rates and exhibits greater robustness in sim-to-real transfer than policies using raw force inputs. Supplementary video is available at https://youtu.be/ZeoBs-t0AWc
comment: under review
Your Vision-Language-Action Model Already Has Attention Heads For Path Deviation Detection
Vision-Language-Action (VLA) models have demonstrated strong potential for predicting semantic actions in navigation tasks, demonstrating the ability to reason over complex linguistic instructions and visual contexts. However, they are fundamentally hindered by visual-reasoning hallucinations that lead to trajectory deviations. Addressing this issue has conventionally required training external critic modules or relying on complex uncertainty heuristics. In this work, we discover that monitoring a few attention heads within a frozen VLA model can accurately detect path deviations without incurring additional computational overhead. We refer to these heads, which inherently capture the spatiotemporal causality between historical visual sequences and linguistic instructions, as Navigation Heads. Using these heads, we propose an intuitive, training-free anomaly-detection framework that monitors their signals to detect hallucinations in real time. Surprisingly, among over a thousand attention heads, a combination of just three is sufficient to achieve a 44.6 % deviation detection rate with a low false-positive rate of 11.7 %. Furthermore, upon detecting a deviation, we bypass the heavy VLA model and trigger a lightweight Reinforcement Learning (RL) policy to safely execute a shortest-path rollback. By integrating this entire detection-to-recovery pipeline onto a physical robot, we demonstrate its practical robustness. All source code will be publicly available.
comment: Keywords: Vision-Language Action (VLA), Reinforcement Learning (RL), Navigation Path Recovery, Robot Operating System (ROS)
KoopmanFlow: Spectrally Decoupled Generative Control Policy via Koopman Structural Bias
Generative Control Policies (GCPs) show immense promise in robotic manipulation but struggle to simultaneously model stable global motions and high-frequency local corrections. While modern architectures extract multi-scale spatial features, their underlying Probability Flow ODEs apply a uniform temporal integration schedule. Compressed to a single step for real-time Receding Horizon Control (RHC), uniform ODE solvers mathematically smooth over sparse, high-frequency transients entangled within low-frequency steady states. To decouple these dynamics without accumulating pipelined errors, we introduce KoopmanFlow, a parameter-efficient generative policy guided by a Koopman-inspired structural inductive bias. Operating in a unified multimodal latent space with visual context, KoopmanFlow bifurcates generation at the terminal stage. Because visual conditioning occurs before spectral decomposition, both branches are visually guided yet temporally specialized. A macroscopic branch anchors slow-varying trajectories via single-step Consistency Training, while a transient branch uses Flow Matching to isolate high-frequency residuals stimulated by sudden visual cues (e.g., contacts or occlusions). Guided by an explicit spectral prior and optimized via a novel asymmetric consistency objective, KoopmanFlow establishes a fused co-training mechanism. This allows the variant branch to absorb localized dynamics without multi-stage error accumulation. Extensive experiments show KoopmanFlow significantly outperforms state-of-the-art baselines in contact-rich tasks requiring agile disturbance rejection. By trading a surplus latency buffer for a richer structural prior, KoopmanFlow achieves superior control fidelity and parameter efficiency within real-time deployment limits.
Exploration-assisted Bottleneck Transition Toward Robust and Data-efficient Deformable Object Manipulation
Imitation learning has demonstrated impressive results in robotic manipulation but fails under out-of-distribution (OOD) states. This limitation is particularly critical in Deformable Object Manipulation (DOM), where the near-infinite possible configurations render comprehensive data collection infeasible. Although several methods address OOD states, they typically require exhaustive data or highly precise perception. Such requirements are often impractical for DOM owing to its inherent complexities, including self-occlusion. To address the OOD problem in DOM, we propose a novel framework, Exploration-assisted Bottleneck Transition for Deformable Object Manipulation (ExBot), which addresses the OOD challenge through two key advantages. First, we introduce bottleneck states, standardized configurations that serve as starting points for task execution. This enables the reconceptualization of OOD challenges as the problem of transitioning diverse initial states to these bottleneck states, significantly reducing demonstration requirements. Second, to account for imperfect perception, we partition the OOD state space based on recognizability and employ dual action primitives. This approach enables ExBot to manipulate even unrecognizable states without requiring accurate perception. By concentrating demonstrations around bottleneck states and leveraging exploration to alter perceptual conditions, ExBot achieves both data efficiency and robustness to severe OOD scenarios. Real-world experiments on rope and cloth manipulation demonstrate successful task completion from diverse OOD states, including severe self-occlusions.
Multi-Robot Coordination for Planning under Context Uncertainty
Real-world robots often operate in settings where objective priorities depend on the underlying context of operation. When the underlying context is unknown apriori, multiple robots may have to coordinate to gather informative observations to infer the context, since acting based on an incorrect context can lead to misaligned and unsafe behavior. Once the underlying true context is inferred, the robots optimize their task-specific objectives in the preference order induced by the context. We formalize this problem as a Multi-Robot Context-Uncertain Stochastic Shortest Path (MR-CUSSP), which captures context-relevant information at landmark states through joint observations. Our two-stage solution approach is composed of: (1) CIMOP (Coordinated Inference for Multi-Objective Planning) to compute plans that guide robots toward informative landmarks to efficiently infer the true context, and (2) LCBS (Lexicographic Conflict-Based Search) for collision-free multi-robot path planning with lexicographic objective preferences, induced by the context. We evaluate the algorithms using three simulated domains and demonstrate its practical applicability using five mobile robots in the salp domain setup.
comment: 8 pages, 6 figures
Implicit Maximum Likelihood Estimation for Real-time Generative Model Predictive Control ICRA
Diffusion-based models have recently shown strong performance in trajectory planning, as they are capable of capturing diverse, multimodal distributions of complex behaviors. A key limitation of these models is their slow inference speed, which results from the iterative denoising process. This makes them less suitable for real-time applications such as closed-loop model predictive control (MPC), where plans must be generated quickly and adapted continuously to a changing environment. In this paper, we investigate Implicit Maximum Likelihood Estimation (IMLE) as an alternative generative modeling approach for planning. IMLE offers strong mode coverage while enabling inference that is two orders of magnitude faster, making it particularly well suited for real-time MPC tasks. Our results demonstrate that IMLE achieves competitive performance on standard offline reinforcement learning benchmarks compared to the standard diffusion-based planner, while substantially improving planning speed in both open-loop and closed-loop settings. We further validate IMLE in a closed-loop human navigation scenario, operating in real-time, demonstrating how it enables rapid and adaptive plan generation in dynamic environments.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026. Project page: https://kir-.github.io/GMPC-IMLE/
LPV-MPC for Lateral Control in Full-Scale Autonomous Racing
Autonomous racing has attracted significant attention recently, presenting challenges in selecting an optimal controller that operates within the onboard system's computational limits and meets operational constraints such as limited track time and high costs. This paper introduces a Linear Parameter-Varying Model Predictive Controller (LPV-MPC) for lateral control. Implemented on an IAC AV-24, the controller achieved stable performance at speeds exceeding 160 mph (71.5 m/s). We detail the controller design, the methodology for extracting model parameters, and key system-level and implementation considerations. Additionally, we report results from our final race run, providing a comprehensive analysis of both vehicle dynamics and controller performance. A Python implementation of the framework is available at: https://tinyurl.com/LPV-MPC-acados
REFINE-DP: Diffusion Policy Fine-tuning for Humanoid Loco-manipulation via Reinforcement Learning
Humanoid loco-manipulation requires coordinated high-level motion plans with stable, low-level whole-body execution under complex robot-environment dynamics and long-horizon tasks. While diffusion policies (DPs) show promise for learning from demonstrations, deploying them on humanoids poses critical challenges: the motion planner trained offline is decoupled from the low-level controller, leading to poor command tracking, compounding distribution shift, and task failures. The common approach of scaling demonstration data is prohibitively expensive for high-dimensional humanoid systems. To address this challenge, we present REFINE-DP (REinforcement learning FINE-tuning of Diffusion Policy), a hierarchical framework that jointly optimizes a DP high-level planner and an RL-based low-level loco-manipulation controller. The DP is fine-tuned via a PPO-based diffusion policy gradient to improve task success rate, while the controller is simultaneously updated to accurately track the planner's evolving command distribution, reducing the distributional mismatch that degrades motion quality. We validate REFINE-DP on a humanoid robot performing loco-manipulation tasks, including door traversal and long-horizon object transport. REFINE-DP achieves an over $90\%$ success rate in simulation, even in out-of-distribution cases not seen in the pre-trained data, and enables smooth autonomous task execution in real-world dynamic environments. Our proposed method substantially outperforms pre-trained DP baselines and demonstrates that RL fine-tuning is key to reliable humanoid loco-manipulation. https://refine-dp.github.io/REFINE-DP/
D-Compress: Detail-Preserving LiDAR Range Image Compression for Real-Time Streaming on Resource-Constrained Robots ICRA 2026
Efficient 3D LiDAR point cloud compression (LPCC) and streaming are critical for edge server-assisted robotic systems, enabling real-time communication with compact data representations. A widely adopted approach represents LiDAR point clouds as range images, enabling the direct use of mature image and video compression codecs. However, because these codecs are designed with human visual perception in mind, they often compromise geometric details, which downgrades the performance of downstream robotic tasks such as mapping and object detection. Furthermore, rate-distortion optimization (RDO)-based rate control remains largely underexplored for range image compression (RIC) under dynamic bandwidth conditions. To address these limitations, we propose D-Compress, a new detail-preserving and fast RIC framework tailored for real-time streaming. D-Compress integrates both intra- and inter-frame prediction with an adaptive discrete wavelet transform approach for precise residual compression. Additionally, we introduce a new RDO-based rate control algorithm for RIC through new rate-distortion modeling. Extensive evaluations on various datasets demonstrate the superiority of D-Compress, which outperforms state-of-the-art (SOTA) compression methods in both geometric accuracy and downstream task performance, particularly at compression ratios exceeding 100x, while maintaining real-time execution on resource-constrained hardware. Moreover, evaluations under dynamic bandwidth conditions validate the robustness of its rate control mechanism.
comment: To appear in IEEE ICRA 2026
SAATT Nav: a Socially Aware Autonomous Transparent Transportation Navigation Framework for Wheelchairs IROS 2026
While powered wheelchairs reduce physical fatigue as opposed to manual wheelchairs for individuals with mobility impairment, they demand high cognitive workload due to information processing, decision making and motor coordination. Current autonomous systems lack social awareness in navigation and transparency in decision-making, leading to decreased perceived safety and trust from the user and others in context. This work proposes Socially Aware Autonomous Transparent Transportation (SAATT) Navigation framework for wheelchairs as a potential solution. By implementing a Large Language Model (LLM) informed of user intent and capable of predicting other peoples' intent as a decision-maker for its local controller, it is able to detect and navigate social situations, such as passing pedestrians or a pair conversing. Furthermore, the LLM textually communicates its reasoning at each waypoint for transparency. In this experiment, it is compared against a standard global planner, a representative competing social navigation model, and an Ablation study in three simulated environments varied by social levels in eight metrics categorized under Safety, Social Compliance, Efficiency, and Comfort. Overall, SAATT Nav outperforms in most social situations and equivalently or only slightly worse in the remaining metrics, demonstrating the potential of a socially aware and transparent autonomous navigation system to assist wheelchair users.
comment: 8 pages, 4 figures, 2 tables, 1 algorithm. Submitted to IROS 2026
From Fold to Function: Simulation-Driven Design of Origami Mechanisms
Origami-inspired mechanisms can transform flat sheets into functional three-dimensional dynamic structures that are lightweight, compact, and capable of complex motion. These properties make origami increasingly valuable in robotic and deployable systems. However, accurately simulating their folding behavior and interactions with the environment remains challenging. To address this, we present a design framework for origami mechanism simulation that utilizes MuJoCo's deformable-body capabilities. In our approach, origami sheets are represented as graphs of interconnected deformable elements with user-specified constraints such as creases and actuation, defined through an intuitive graphical user interface (GUI). This framework allows users to generate physically consistent simulations that capture both the geometric structure of origami mechanisms and their interactions with external objects and surfaces. We demonstrate our method's utility through a case study on an origami catapult, where design parameters are optimized in simulation using the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) and validated experimentally on physical prototypes. The optimized structure achieves improved throwing performance, illustrating how our system enables rapid, simulation-driven origami design, optimization, and analysis.
comment: 8 Pages, 9 Figures, Submitted to IEEE RoboSoft
Multi-Robot Navigation in Social Mini-Games: Definitions, Taxonomy, and Algorithms
The "Last Mile Challenge" has long been considered an important, yet unsolved, challenge for autonomous vehicles, public service robots, and delivery robots. A central issue in this challenge is the ability of robots to navigate constrained and cluttered environments that have high agency (e.g., doorways, hallways, corridor intersections), often while competing for space with other robots and humans. We refer to these environments as "Social Mini-Games" (SMGs). Traditional navigation approaches designed for MRN do not perform well in SMGs, which has led to focused research on dedicated SMG solvers. However, publications on SMG navigation research make different assumptions, and have different objective functions (safety versus liveness). These assumptions and objectives are sometimes implicitly assumed or described informally. This makes it difficult to establish appropriate baselines for comparison in research papers, as well as making it difficult for practitioners to find the papers relevant to their concrete application. Such ad-hoc representation of the field also presents a barrier to new researchers wanting to start research in this area. SMG navigation research requires its own taxonomy, definitions, and evaluation protocols to guide effective research moving forward. This survey is the first to catalog SMG solvers using a well-defined and unified taxonomy and to classify existing methods accordingly. It also discusses the essential properties of SMG solvers, defines what SMGs are and how they appear in practice, outlines how to evaluate SMG solvers, and highlights the differences between SMG solvers and general navigation systems. The survey concludes with an overview of future directions and open challenges in the field. Our project is open-sourced at https://socialminigames.github.io/{https://socialminigames.github.io/.
comment: Accepted for publication in Autonomous Robots 2026
SERFN: Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows
Real-world fine-tuning of dexterous manipulation policies remains challenging due to limited real-world interaction budgets and highly multimodal action distributions. Diffusion-based policies, while expressive, do not permit conservative likelihood-based updates during fine-tuning because action probabilities are intractable. In contrast, conventional Gaussian policies collapse under multimodality, particularly when actions are executed in chunks, and standard per-step critics fail to align with chunked execution, leading to poor credit assignment. We present SERFN, a sample-efficient off-policy fine-tuning framework with normalizing flow (NF) to address these challenges. The normalizing flow policy yields exact likelihoods for multimodal action chunks, allowing conservative, stable policy updates through likelihood regularization and thereby improving sample efficiency. An action-chunked critic evaluates entire action sequences, aligning value estimation with the policy's temporal structure and improving long-horizon credit assignment. To our knowledge, this is the first demonstration of a likelihood-based, multimodal generative policy combined with chunk-level value learning on real robotic hardware. We evaluate SERFN on two challenging dexterous manipulation tasks in the real world: cutting tape with scissors retrieved from a case, and in-hand cube rotation with a palm-down grasp -- both of which require precise, dexterous control over long horizons. On these tasks, SERFN achieves stable, sample-efficient adaptation where standard methods struggle.
comment: https://srl-ethz.github.io/SERNF/
ComFree-Sim: A GPU-Parallelized Analytical Contact Physics Engine for Scalable Contact-Rich Robotics Simulation and Control
Physics simulation for contact-rich robotics is often bottlenecked by contact resolution: mainstream engines enforce non-penetration and Coulomb friction via complementarity constraints or constrained optimization, requiring per-step iterative solves whose cost grows superlinearly with contact density. We present ComFree-Sim, a GPU-parallelized analytical contact physics engine built on complementarity-free contact modeling. ComFree-Sim computes contact impulses in closed form via an impedance-style prediction--correction update in the dual cone of Coulomb friction. Contact computation decouples across contact pairs and becomes separable across cone facets, mapping naturally to GPU kernels and yielding near-linear runtime scaling with the number of contacts. We further extend the formulation to a unified 6D contact model capturing tangential, torsional, and rolling friction, and introduce a practical dual-cone impedance heuristic. ComFree-Sim is implemented in Warp and exposed through a MuJoCo-compatible interface as a drop-in backend alternative to MuJoCo Warp (MJWarp). Experiments benchmark penetration, friction behaviors, stability, and simulation runtime scaling against MJWarp, demonstrating near-linear scaling and 2--3 times higher throughput in dense contact scenes with comparable physical fidelity. We deploy ComFree-Sim in real-time MPC for in-hand dexterous manipulation on a real-world multi-fingered LEAP hand and in dynamics-aware motion retargeting, demonstrating that low-latency simulation yields higher closed-loop success rates and enables practical high-frequency control in contact-rich tasks.
comment: 9 pages
UniPrototype: Humn-Robot Skill Learning with Uniform Prototypes
Data scarcity remains a fundamental challenge in robot learning. While human demonstrations benefit from abundant motion capture data and vast internet resources, robotic manipulation suffers from limited training examples. To bridge this gap between human and robot manipulation capabilities, we propose UniPrototype, a novel framework that enables effective knowledge transfer from human to robot domains via shared motion primitives. ur approach makes three key contributions: (1) We introduce a compositional prototype discovery mechanism with soft assignments, enabling multiple primitives to co-activate and thus capture blended and hierarchical skills; (2) We propose an adaptive prototype selection strategy that automatically adjusts the number of prototypes to match task complexity, ensuring scalable and efficient representation; (3) We demonstrate the effectiveness of our method through extensive experiments in both simulation environments and real-world robotic systems. Our results show that UniPrototype successfully transfers human manipulation knowledge to robots, significantly improving learning efficiency and task performance compared to existing approaches.The code and dataset will be released upon acceptance at an anonymous repository.
comment: This submission was uploaded in error and has been withdrawn. A substantial revision will need to be completed
Social Robots for People Living with Dementia: A Scoping Review on Deception from Design to Perception
As social robots are increasingly introduced into dementia care, their embodied and interactive design may blur the boundary between artificial and lifelike entities, raising ethical concerns about robotic deception. However, it remains unclear which specific design cues of social robots might lead to social robotic deception (SRD) in people living with dementia (PLwD), and which perceptions and responses of PLwD might indicate that SRD is taking place. To address these questions, we conducted a scoping review of 26 empirical studies reporting PLwD interacting with social robots. We identified three key design cue categories that might contribute to SRD and one that might break the illusion. However, the available literature does not provide sufficient evidence to determine which specific design cues lead to SRD. Thematic analysis of user responses reveals six recurring patterns in how PLwD perceive and respond to social robots. However, conceptual limitations in existing definitions of robotic deception make it difficult to identify when and to what extent deception actually occurs. Building on the results, we propose a dual-process interpretation that clarifies the cognitive basis of false beliefs in human-robot interaction and distinguishes SRD from anthropomorphism or emotional engagement.
Using VLM Reasoning to Constrain Task and Motion Planning IROS 2026
In task and motion planning, high-level task planning is done over an abstraction of the world to enable efficient search in long-horizon robotics problems. However, the feasibility of these task-level plans relies on the downward refinability of the abstraction into continuous motion. When a domain's refinability is poor, task-level plans that appear valid may ultimately fail during motion planning, requiring replanning and resulting in slower overall performance. Prior works mitigate this by encoding refinement issues as constraints to prune infeasible task plans. However, these approaches only add constraints upon refinement failure, expending significant search effort on infeasible branches. We propose VIZ-COAST, a method of leveraging the common-sense spatial reasoning of large pretrained Vision-Language Models to identify issues with downward refinement a priori, bypassing the need to fix these failures during planning. Experiments on three challenging TAMP domains show that our approach is able to extract plausible constraints from images and domain descriptions, drastically reducing planning times and, in some cases, eliminating downward refinement failures altogether, generalizing to a diverse range of instances from the broader domain.
comment: 9 pages, 7 figures, 1 table. Submitted to IROS 2026
Dribble Master: Learning Agile Humanoid Dribbling through Legged Locomotion
Humanoid soccer dribbling is a highly challenging task that demands dexterous ball manipulation while maintaining dynamic balance. Traditional rule-based methods often struggle to achieve accurate ball control due to their reliance on fixed walking patterns and limited adaptability to real-time ball dynamics. To address these challenges, we propose a two-stage curriculum learning framework that enables a humanoid robot to acquire dribbling skills without explicit dynamics or predefined trajectories. In the first stage, the robot learns basic locomotion skills; in the second stage, we fine-tune the policy for agile dribbling maneuvers. We further introduce a virtual camera model in simulation that simulates the field of view and perception constraints of the real robot, enabling realistic ball perception during training. We also design heuristic rewards to encourage active sensing, promoting a broader visual range for continuous ball perception. The policy is trained in simulation and successfully transferred to a physical humanoid robot. Experiment results demonstrate that our method enables effective ball manipulation, achieving flexible and visually appealing dribbling behaviors across multiple environments. This work highlights the potential of reinforcement learning in developing agile humanoid soccer robots. Additional details and videos are available at https://zhuoheng0910.github.io/dribble-master/.
DyQ-VLA: Temporal-Dynamic-Aware Quantization for Embodied Vision-Language-Action Models
Vision-Language-Action (VLA) models are dominant in embodied intelligence but are constrained by inference overheads. While model quantization alleviates these bottlenecks for edge deployment, static quantization approaches remain suboptimal for VLAs due to two critical challenges: (1) Temporal-dynamic sensitivity, where fixed precision wastes resources by ignoring stage-varying error tolerances; and (2) Real-time allocation, where identifying real-time sensitivity to guide bit allocation remains unsolved. To address these challenges, we propose DyQ-VLA, a dynamic quantization framework for VLAs. Specifically, a sensitivity-aware switching strategy leverages real-time kinematic proxies to trigger the bit-width switch, while a kinematic-guided module dynamically allocates the optimal bit-width. Experiments show that DyQ-VLA requires only 30.9% of the original memory footprint while maintaining 99.5% of its original performance, achieving 1.49x simulation and up to 1.43x real-world speedups.
IRIS-SLAM: Unified Geo-Instance Representations for Robust Semantic Localization and Mapping
Geometry foundation models have significantly advanced dense geometric SLAM, yet existing systems often lack deep semantic understanding and robust loop closure capabilities. Meanwhile, contemporary semantic mapping approaches are frequently hindered by decoupled architectures and fragile data association. We propose IRIS-SLAM, a novel RGB semantic SLAM system that leverages unified geometric-instance representations derived from an instance-extended foundation model. By extending a geometry foundation model to concurrently predict dense geometry and cross-view consistent instance embeddings, we enable a semantic-synergized association mechanism and instance-guided loop closure detection. Our approach effectively utilizes viewpoint-agnostic semantic anchors to bridge the gap between geometric reconstruction and open-vocabulary mapping. Experimental results demonstrate that IRIS-SLAM significantly outperforms state-of-the-art methods, particularly in map consistency and wide-baseline loop closure reliability.
comment: This version is being withdrawn because it was submitted without the final review and formal approval of all co-authors. The authors plan to resubmit a revised version once all internal approvals are secured
Humanoid Goalkeeper: Learning from Position Conditioned Task-Motion Constraints
We present a reinforcement learning framework for autonomous goalkeeping with humanoid robots in real-world scenarios. While prior work has demonstrated similar capabilities on quadrupedal platforms, humanoid goalkeeping introduces two critical challenges: (1) generating natural, human-like whole-body motions, and (2) covering a wider guarding range with an equivalent response time. Unlike existing approaches that rely on separate teleoperation or fixed motion tracking for whole-body control, our method learns a single end-to-end RL policy, enabling fully autonomous, highly dynamic, and human-like robot-object interactions. To achieve this, we integrate multiple human motion priors conditioned on perceptual inputs into the RL training via an adversarial scheme. We demonstrate the effectiveness of our method through real-world experiments, where the humanoid robot successfully performs agile, autonomous, and naturalistic interceptions of fast-moving balls. In addition to goalkeeping, we demonstrate the generalization of our approach through tasks such as ball escaping and grabbing. Our work presents a practical and scalable solution for enabling highly dynamic interactions between robots and moving objects, advancing the field toward more adaptive and lifelike robotic behaviors.
VLD: Visual Language Goal Distance for Reinforcement Learning Navigation
Training end-to-end policies from image data to directly predict navigation actions for robotic systems has proven inherently difficult. Existing approaches often suffer from either the sim-to-real gap during policy transfer or a limited amount of training data with action labels. To address this problem, we introduce Vision-Language Distance (VLD) learning, a scalable framework for goal-conditioned navigation that decouples perception learning from policy learning. Instead of relying on raw sensory inputs during policy training, we first train a self-supervised distance-to-goal predictor on internet-scale video data. This predictor generalizes across both image- and text-based goals, providing a distance signal that can be minimized by a reinforcement learning (RL) policy. The RL policy can be trained entirely in simulation using privileged geometric distance signals, with injected noise to mimic the uncertainty of the trained distance predictor. At deployment, the policy consumes VLD predictions, inheriting semantic goal information-"where to go"-from large-scale visual training while retaining the robust low-level navigation behaviors learned in simulation. We propose using ordinal consistency to assess distance functions directly and demonstrate that VLD outperforms prior temporal distance approaches, such as ViNT and VIP. Experiments show that our decoupled design achieves competitive navigation performance in simulation with strong sim-to-real transfer, providing an alternative and, most importantly, scalable path toward reliable, multimodal navigation policies.
Balancing Safety and Optimality in Robot Path Planning: Algorithm and Metric
Path planning for autonomous robots faces a fundamental trade-off between path length and obstacle clearance. While existing algorithms typically prioritize a single objective, we introduce the Unified Path Planner (UPP), a graph-search algorithm that dynamically balances safety and optimality via adaptive heuristic weighting. UPP employs a local inverse-distance safety field and auto-tunes its parameters based on real-time search progress, achieving provable suboptimality bounds while maintaining superior clearance. To enable rigorous evaluation, we introduce the OptiSafe index, a normalized metric that quantifies the trade-off between safety and optimality. Extensive evaluation across 10 environments shows that UPP achieves a 0.94 OptiSafe score in cluttered environments, compared with 0.22-0.85 for existing methods, with only 0.5-1% path-length overhead in simulation and a 100% success rate. Hardware validation on TurtleBot confirms practical advantages despite sim-to-real gaps.
comment: 26 pages
GM3: A General Physical Model for Micro-Mobility Vehicles
Modeling the dynamics of micro-mobility vehicles (MMV) is becoming increasingly important for training autonomous vehicle systems and building urban traffic simulations. However, mainstream tools rely on variants of the Kinematic Bicycle Model (KBM) or mode-specific physics that miss tire slip, load transfer, and rider/vehicle lean. To our knowledge, no unified, physics-based model captures these dynamics across the full range of common MMVs and wheel layouts. We propose the "Generalized Micro-mobility Model" (GM3), a tire-level formulation based on the tire brush representation that supports arbitrary wheel configurations, including single/double track and multi-wheel platforms. We introduce an interactive model-agnostic simulation framework that decouples vehicle/layout specification from dynamics to compare the GM3 with the KBM and other models, consisting of fixed step RK4 integration, human-in-the-loop and scripted control, real-time trajectory traces and logging for analysis. We also empirically validate the GM3 on the Stanford Drone Dataset's deathCircle (roundabout) scene for biker, skater, and cart classes.
Decoupled Action Expert: Confining Task Knowledge to the Conditioning Pathway
Many recent Vision-Language-Action models employ diffusion or flow-matching backbones with hundreds of millions of parameters for action generation. However, unlike image synthesis where the output spans millions of diverse pixels, a manipulation policy generates only short sequences of low-dimensional, physically correlated action values, a far simpler target that should not demand such capacity. We confirm this intuition and show that task-specific knowledge in these policies can be fully confined to the conditioning pathway, leaving the action backbone task-agnostic. To establish this, we introduce a decoupled training recipe: a general-purpose action head is first pretrained on observation-free forward-kinematics data, then frozen while only the conditioning pathway is trained for downstream tasks. Using Diffusion Policy as a testbed, we show that on both MimicGen and LIBERO, a single frozen backbone shared across all tasks matches normally trained counterparts. This confirms that the action expert encodes little task-specific knowledge. Ablations show that the specific pretraining signal (joint positions, end-effector poses, or no conditioning at all) has no effect on downstream performance, indicating that the backbone learns only general trajectory structure. Pushing this finding further, we replace the 244M U-Net in Diffusion Policy with a 5M-parameter MLP backbone that matches or exceeds its performance, calling into question the large capacity budgets allocated to action generation in current VLA designs.
Hierarchical Diffusion Motion Planning with Task-Conditioned Uncertainty-Aware Priors
We propose a novel hierarchical diffusion planner that embeds task and motion structure directly into the noise model. Unlike standard diffusion-based planners that rely on zero-mean, isotropic Gaussian corruption, we introduce task-conditioned structured Gaussians whose means and covariances are derived from Gaussian Process Motion Planning (GPMP), explicitly encoding trajectory smoothness and task semantics in the prior. We first generalize the standard diffusion process to biased, non-isotropic corruption with closed-form forward and posterior expressions. Building on this formulation, our hierarchical design separates prior instantiation from trajectory denoising. At the upper level, the model predicts sparse, task-centric key states and their associated timings, which instantiate a structured Gaussian prior (mean and covariance). At the lower level, the full trajectory is denoised under this fixed prior, treating the upper-level outputs as noisy observations. Experiments on Maze2D goal-reaching and KUKA block stacking show consistently higher success rates and smoother trajectories than isotropic baselines, achieving dataset-level smoothness substantially earlier during training. Ablation studies further show that explicitly structuring the corruption process provides benefits beyond neural conditioning the denoising network alone. Overall, our approach concentrates the prior's probability mass near feasible and semantically meaningful trajectories. Our project page is available at https://hta-diffusion.github.io.
Graphite: A GPU-Accelerated Mixed-Precision Graph Optimization Framework ICRA 2026
We present Graphite, a GPU-accelerated nonlinear least squares graph optimization framework. It provides a CUDA C++ interface to enable the sharing of code between a real-time application, such as a SLAM system, and its optimization tasks. The framework supports techniques to reduce memory usage, including in-place optimization, support for multiple floating point types and mixed-precision modes, and dynamically computed Jacobians. We evaluate Graphite on well-known bundle adjustment problems and find that it achieves similar performance to MegBA, a solver specialized for bundle adjustment, while maintaining generality and using less memory. We also apply Graphite to global visual-inertial bundle adjustment on maps generated from stereo-inertial SLAM datasets, and observe speed-ups of up to 59x compared to a CPU baseline. Our results indicate that our framework enables faster large-scale optimization on both desktop and resource-constrained devices.
comment: Accepted to ICRA 2026
Optimal Modified Feedback Strategies in LQ Games under Control Imperfections
Game-theoretic approaches and Nash equilibrium have been widely applied across various engineering domains. However, practical challenges such as disturbances, delays, and actuator limitations can hinder the precise execution of Nash equilibrium strategies. This work investigates the impact of such implementation imperfections on game trajectories and players' costs in the context of a two-player finite-horizon linear quadratic (LQ) nonzero-sum game. Specifically, we analyze how small deviations by one player, measured or estimated at each stage affect the state trajectory and the other player's cost. To mitigate these effects, we construct a compensation law for the influenced player by augmenting the nominal game with the measurable deviation dynamics. The resulting policy is shown to be optimal within a causal affine policy class, and, for sufficiently small deviations, it locally outperforms the uncompensated equilibrium-derived feedback. Rigorous analysis and proofs are provided, and the effectiveness of the proposed approach is demonstrated through a representative numerical example.
comment: 8 pages, 2 figures, Manuscript accepted to ACC 2026
Multiagent Systems
Chance-Constrained Correlated Equilibria for Robust Noncooperative Coordination
Correlated equilibria enable a coordinator to influence the self-interested agents by recommending actions that no player has an incentive to deviate from. However, the effectiveness of this mechanism relies on accurate knowledge of the agents' cost structures. When cost parameters are uncertain, the recommended actions may no longer be incentive compatible, allowing agents to benefit from deviating from them. We study a chance-constrained correlated equilibrium problem formulation that accounts for uncertainty in agents' costs and guarantees incentive compatibility with a prescribed confidence level. We derive sensitivity results that quantify how uncertainty in individual incentive constraints affects the expected coordination outcome. In particular, the analysis characterizes the value of information by relating the marginal benefit of reducing uncertainty to the dual sensitivities of the incentive constraints, providing guidance on which sources of uncertainty should be prioritized for information acquisition. The results further reveal that increasing the confidence level is not always beneficial and can introduce a tradeoff between robustness and system efficiency. Numerical experiments demonstrate that the proposed framework maintains coordination performance in uncertain environments and are consistent with the theoretical insights developed in the analysis.
A Benchmark for Multi-Party Negotiation Games from Real Negotiation Data
Many real-world multi-party negotiations unfold as sequences of binding, action-level commitments rather than a single final outcome. We introduce a benchmark for this under-studied regime featuring a configurable game generator that sweeps key structural properties such as incentive alignment, goal complexity, and payoff distribution. To evaluate decision-making, we test three value-function approximations - myopic reward, an optimistic upper bound, and a pessimistic lower bound - that act as biased lenses on deal evaluation. Through exact evaluation on small games and comparative evaluation on large, document-grounded instances derived from the Harvard Negotiation Challenge, we map the strategic regimes where each approximation succeeds or fails. We observe that different game structures demand different valuation strategies, motivating agents that learn robust state values and plan effectively over long horizons under binding commitments and terminal only rewards.
A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning CVPR2026
This paper presents a multi-agent perception-action exploration alliance, dubbed A4VL, for efficient long-video reasoning. A4VL operates in a multi-round perception-action exploration loop with a selection of VLM agents. In each round, the team of agents performs video question-answer (VideoQA) via perception exploration followed by action exploration. During perception exploration, each agent learns to extract query-specific perception clue(s) from a few sampled frames and performs clue-based alignment to find the video block(s) that are most relevant to the query-specific event. During action exploration, A4VL performs video reasoning in three steps: (1) each agent produces its initial answer with rational, (2) all agents collaboratively scores one another through cross-reviews and relevance ranking, and (3) based on whether a satisfactory consensus is reached, the decision is made either to start a new round of perception-action deliberation by pruning (e.g., filtering out the lowest performing agent) and re-staging (e.g., new-clue and matching block based perception-action exploration), or to conclude by producing its final answer. The integration of the multi-agent alliance through multi-round perception-action exploration, coupled with event-driven partitioning and cue-guided block alignment, enables A4VL to effectively scale to real world long videos while preserving high quality video reasoning. Evaluation Results on five popular VideoQA benchmarks show that A4VL outperforms 18 existing representative VLMs and 10 recent methods optimized for long-video reasoning, while achieving significantly lower inference latency. Our code is released at https://github.com/git-disl/A4VL.
comment: Accepted by CVPR2026
Beyond Self-Interest: Modeling Social-Oriented Motivation for Human-like Multi-Agent Interactions AAMAS 2026
Large Language Models (LLMs) demonstrate significant potential for generating complex behaviors, yet most approaches lack mechanisms for modeling social motivation in human-like multi-agent interaction. We introduce Autonomous Social Value-Oriented agents (ASVO), where LLM-based agents integrate desire-driven autonomy with Social Value Orientation (SVO) theory. At each step, agents first update their beliefs by perceiving environmental changes and others' actions. These observations inform the value update process, where each agent updates multi-dimensional desire values through reflective reasoning and infers others' motivational states. By contrasting self-satisfaction derived from fulfilled desires against estimated others' satisfaction, agents dynamically compute their SVO along a spectrum from altruistic to competitive, which in turn guides activity selection to balance desire fulfillment with social alignment. Experiments across School, Workplace, and Family contexts demonstrate substantial improvements over baselines in behavioral naturalness and human-likeness. These findings show that structured desire systems and adaptive SVO drift enable realistic multi-agent social simulations.
comment: 9 pages, 6 figures. Accepted to AAMAS 2026 (Oral)
How do Role Models Shape Collective Morality? Exemplar-Driven Moral Learning in Multi-Agent Simulation
Do We Need Role Models? How do Role Models Shape Collective Morality? To explore the questions, we build a multi-agent simulation powered by a Large Language Model, where agents with diverse intrinsic drives, ranging from cooperative to competitive, interact and adapt through a four-stage cognitive loop (plan-act-observe-reflect). We design four experimental games (Alignment, Collapse, Conflict, and Construction) and conduct motivational ablation studies to identify the key drivers of imitation. The results indicate that identity-driven conformity can powerfully override initial dispositions. Agents consistently adapt their values to align with a perceived successful exemplar, leading to rapid value convergence.
ClimateAgents: A Multi-Agent Research Assistant for Social-Climate Dynamics Analysis
The complex interaction between social behaviors and climate change requires more than traditional data-driven prediction; it demands interpretable and adaptive analytical frameworks capable of integrating heterogeneous sources of knowledge. This study introduces ClimateAgents, a multi-agent research assistant designed to support social-climate analysis through coordinated AI agents. Rather than focusing solely on predictive modeling, the framework assists researchers in exploring socio-environmental dynamics by integrating multimodal data retrieval, statistical modeling, textual analysis, and automated reasoning. Traditional approaches to climate analysis often address narrowly defined indicators and lack the flexibility to incorporate cross-domain socio-economic knowledge or adapt to evolving research questions. To address these limitations, ClimateAgents employs a set of collaborative, domain-specialized agents that collectively perform key stages of the research workflow, including hypothesis generation, data analysis, evidence retrieval, and structured reporting. The framework supports exploratory analysis and scenario investigation using datasets from sources such as the United Nations and the World Bank. By combining agent-based reasoning with quantitative analysis of socio-economic behavioral dynamics, ClimateAgents enables adaptive and interpretable exploration of relationships between climate indicators, social variables, and environmental outcomes. The results illustrate how multi-agent AI systems can augment analytical reasoning and facilitate interdisciplinary, data-driven investigation of complex socio-environmental systems.
Non-trivial consensus on directed signed matrix-weighted networks with compound measurement noises and time-varying topologies
This paper studies non-trivial consensus--a relatively novel and unexplored convergence behavior--on directed signed matrix-weighted networks subject to both additive and multiplicative measurement noises under time-varying topologies. Building upon grounded matrix-weighted Laplacian properties, a stochastic dynamic model is established that simultaneously captures inter-dimensional cooperative and antagonistic interactions, compound measurement noises and time-varying network structures. Based on stochastic differential equations theory, protocols that guarantee mean square and almost sure non-trivial consensus are proposed. Specifically, for any predetermined non-trivial consensus state, all agents are proven to converge toward this non-zero value in the mean-square and almost-sure senses. The design of control gain function in our protocols highlights a balanced consideration of the cumulative effect over time, the asymptotic decay property and the finite energy corresponding to measurement noises. Notably, the conditions on time-varying topologies in our protocols only require boundedness of elements in edge weight matrices, which facilitate the practicality of concept "time-varying topology" in matrix-weighted network consensus algorithms. Furthermore, the proposed protocols operate under milder connectivity conditions and no requirements on structural (un)balance properties. The work in this paper demonstrates that groups with both cooperative and antagonistic inter-dimensional interactions can achieve consensus even in the presence of compound measurement noises and time-varying topologies, challenging the conventional belief that consensus is attainable only in fully cooperative settings.
Multi-Robot Coordination for Planning under Context Uncertainty
Real-world robots often operate in settings where objective priorities depend on the underlying context of operation. When the underlying context is unknown apriori, multiple robots may have to coordinate to gather informative observations to infer the context, since acting based on an incorrect context can lead to misaligned and unsafe behavior. Once the underlying true context is inferred, the robots optimize their task-specific objectives in the preference order induced by the context. We formalize this problem as a Multi-Robot Context-Uncertain Stochastic Shortest Path (MR-CUSSP), which captures context-relevant information at landmark states through joint observations. Our two-stage solution approach is composed of: (1) CIMOP (Coordinated Inference for Multi-Objective Planning) to compute plans that guide robots toward informative landmarks to efficiently infer the true context, and (2) LCBS (Lexicographic Conflict-Based Search) for collision-free multi-robot path planning with lexicographic objective preferences, induced by the context. We evaluate the algorithms using three simulated domains and demonstrate its practical applicability using five mobile robots in the salp domain setup.
comment: 8 pages, 6 figures
Grassroots Bonds: A Grassroots Foundation for Market Liquidity
Global cryptocurrencies are unbacked and have high transaction cost incurred by global consensus. In contrast, grassroots cryptocurrencies are backed by the goods and services of their issuers -- any person, natural or legal -- and have no transaction cost beyond operating a smartphone. Liquidity in grassroots cryptocurrencies arises from mutual credit via coin exchange among issuers. However, as grassroots coins are redeemable 1-for-1 against any other grassroots coin, the credit-forming exchange must also be 1-for-1, lest prompt redemption after exchange would leave the parties with undue profit or loss. Thus, grassroots coins are incongruent with liquidity through interest-bearing credit. Here we introduce grassroots bonds, which extend grassroots coins with a maturity date, reframing grassroots coins -- cash -- as mature grassroots bonds. Bond redemption generalises coin redemption, allowing the lending of liquid coins in exchange for interest-bearing future-maturity bonds. We show that digital social contracts -- voluntary agreements among persons, specified, fulfilled, and enforced digitally -- can express the full gamut of financial instruments as the voluntary swap of grassroots bonds, including credit lines, loans, sale of debt, forward contracts, options, and escrow-based instruments, and that classical liquidity ratios are applicable just as well to grassroots bonds. The formal specification presented here was used by AI to derive a working implementation of grassroots bonds in GLP, a concurrent logic programming language implemented in Dart for smartphone deployment. The implementation is illustrated by a running multiagent village market scenario, also implemented in GLP by AI.
Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis? EACL 2026
Multi-agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning. However, most existing frameworks rely on single-vendor teams (e.g., multiple agents from the same model family), which risk correlated failure modes that reinforce shared biases rather than correcting them. We investigate the impact of vendor diversity by comparing Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks. Using three doctor agents instantiated with o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, we evaluate performance on RareBench and DiagnosisArena. Mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy. Overlap analysis reveals the underlying mechanism: mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss. These results highlight vendor diversity as a key design principle for robust clinical diagnostic systems.
comment: Accepted as Oral at the EACL 2026 Workshop on Healthcare and Language Learning (HeaLing)
Multi-Robot Navigation in Social Mini-Games: Definitions, Taxonomy, and Algorithms
The "Last Mile Challenge" has long been considered an important, yet unsolved, challenge for autonomous vehicles, public service robots, and delivery robots. A central issue in this challenge is the ability of robots to navigate constrained and cluttered environments that have high agency (e.g., doorways, hallways, corridor intersections), often while competing for space with other robots and humans. We refer to these environments as "Social Mini-Games" (SMGs). Traditional navigation approaches designed for MRN do not perform well in SMGs, which has led to focused research on dedicated SMG solvers. However, publications on SMG navigation research make different assumptions, and have different objective functions (safety versus liveness). These assumptions and objectives are sometimes implicitly assumed or described informally. This makes it difficult to establish appropriate baselines for comparison in research papers, as well as making it difficult for practitioners to find the papers relevant to their concrete application. Such ad-hoc representation of the field also presents a barrier to new researchers wanting to start research in this area. SMG navigation research requires its own taxonomy, definitions, and evaluation protocols to guide effective research moving forward. This survey is the first to catalog SMG solvers using a well-defined and unified taxonomy and to classify existing methods accordingly. It also discusses the essential properties of SMG solvers, defines what SMGs are and how they appear in practice, outlines how to evaluate SMG solvers, and highlights the differences between SMG solvers and general navigation systems. The survey concludes with an overview of future directions and open challenges in the field. Our project is open-sourced at https://socialminigames.github.io/{https://socialminigames.github.io/.
comment: Accepted for publication in Autonomous Robots 2026
Optimal Modified Feedback Strategies in LQ Games under Control Imperfections
Game-theoretic approaches and Nash equilibrium have been widely applied across various engineering domains. However, practical challenges such as disturbances, delays, and actuator limitations can hinder the precise execution of Nash equilibrium strategies. This work investigates the impact of such implementation imperfections on game trajectories and players' costs in the context of a two-player finite-horizon linear quadratic (LQ) nonzero-sum game. Specifically, we analyze how small deviations by one player, measured or estimated at each stage affect the state trajectory and the other player's cost. To mitigate these effects, we construct a compensation law for the influenced player by augmenting the nominal game with the measurable deviation dynamics. The resulting policy is shown to be optimal within a causal affine policy class, and, for sufficiently small deviations, it locally outperforms the uncompensated equilibrium-derived feedback. Rigorous analysis and proofs are provided, and the effectiveness of the proposed approach is demonstrated through a representative numerical example.
comment: 8 pages, 2 figures, Manuscript accepted to ACC 2026
Systems and Control (EESS)
Chaos-Free Networks are Stable Recurrent Neural Networks
Gated Recurrent Neural Networks (RNNs) are widely used for nonlinear system identification due to their high accuracy, although they often exhibit complex, chaotic dynamics that are difficult to analyze. This paper investigates the system-theoretic properties of the Chaos-Free Network (CFN), an architecture originally proposed to eliminate the chaotic behavior found in standard gated RNNs. First, we formally prove that the CFN satisfies Input-to-State Stability (ISS) by design. However, we demonstrate that ensuring Incremental ISS (delta-ISS) still requires specific parametric constraints on the CFN architecture. Then, to address this, we introduce the Decoupled-Gate Network (DGN), a novel structural variant of the CFN that removes internal state connections in the gating mechanisms. Finally, we prove that the DGN unconditionally satisfies the delta-ISS property, providing an incrementally stable architecture for identifying nonlinear dynamical systems without requiring complex network training modifications. Numerical results confirm that the DGN maintains the modeling capabilities of standard architectures while adhering to these rigorous stability guarantees.
comment: Preprint submitted to IEEE Control Systems Letters (L-CSS) and IEEE Conference on Decision and Control (CDC) 2026. 6 pages, 2 figures
Energy-Aware Integrated Proactive Maintenance Planning and Production Scheduling
Demand-side energy management, such as the real-time pricing (RTP) program, offers manufacturers opportunities to reduce energy costs by shifting production to low-price hours. However, this strategy is challenging to implement when machine degradation is considered, as degraded machines have decreased processing capacity and increased energy consumption. Proactive maintenance (PM) can restore machine health but requires production downtime, creating a challenging trade-off: scheduling maintenance during low-price periods sacrifices energy savings opportunities, while deferring maintenance leads to capacity losses and higher energy consumption. To address this challenge, we propose a hierarchical bi-level control framework that jointly optimizes PM planning and runtime production scheduling, considering the machine degradation. A higher-level optimization, with the lower-level model predictive control (MPC) embedded as a sub-problem, determines PM plans that minimize total operational costs under day-ahead RTP. At runtime, the lower-level MPC executes closed-loop production scheduling to minimize energy costs under realized RTP, meeting delivery targets. Simulation results from a lithium-ion battery pack assembly line case study demonstrate that the framework strategically shifts PM away from bottlenecks and high-price hours, meeting daily production targets while reducing energy costs.
Amortizing Trajectory Diffusion with Keyed Drift Fields
Diffusion-based trajectory planners can synthesize rich, multimodal action sequences for offline reinforcement learning, but their iterative denoising incurs substantial inference-time cost, making closed-loop planning slow under tight compute budgets. We study the problem of achieving diffusion-like trajectory planning behavior with one-step inference, while retaining the ability to sample diverse candidate plans and condition on the current state in a receding-horizon control loop. Our key observation is that conditional trajectory generation fails under naïve distribution-matching objectives when the similarity measure used to align generated trajectories with the dataset is dominated by unconstrained future dimensions. In practice, this causes attraction toward average trajectories, collapses action diversity, and yields near-static behavior. Our key insight is that conditional generative planning requires a conditioning-aware notion of neighborhood: trajectory updates should be computed using distances in a compact key space that reflects the condition, while still applying updates in the full trajectory space. Building on this, we introduce Keyed Drifting Policies (KDP), a one-step trajectory generator trained with a drift-field objective that attracts generated trajectories toward condition-matched dataset windows and repels them from nearby generated samples, using a stop-gradient drifted target to amortize iterative refinement into training. At inference, the resulting policy produces a full trajectory window in a single forward pass. Across standard RL benchmarks and real-time hardware deployments, KDP achieves strong performance with one-step inference and substantially lower planning latency than diffusion sampling. Project website, code and videos: https://keyed-drifting.github.io/
Schrödinger Bridge Over A Compact Connected Lie Group
This work studies the Schrödinger bridge problem for the kinematic equation on a compact connected Lie group. The objective is to steer a controlled diffusion between given initial and terminal densities supported over the Lie group while minimizing the control effort. We develop a coordinate-free formulation of this stochastic optimal control problem that respects the underlying geometric structure of the Lie group, thereby avoiding limitations associated with local parameterizations or embeddings in Euclidean spaces. We establish the existence and uniqueness of solution to the corresponding Schrödinger system. Our results are constructive in that they derive a geometric controller that optimally interpolates probability densities supported over the Lie group. To illustrate the results, we provide numerical examples on $\mathsf{SO}(2)$ and $\mathsf{SO}(3)$.
Distributional Uncertainty and Adaptive Decision-Making in System
Complex engineered systems require coordinated design choices across heterogeneous components under multiple conflicting objectives and uncertain specifications. Monotone co-design provides a compositional framework for such problems by modeling each subsystem as a design problem: a feasible relation between provided functionalities and required resources in partially ordered sets. Existing uncertain co-design models rely on interval bounds, which support worst-case reasoning but cannot represent probabilistic risk or multi-stage adaptive decisions. We develop a distributional extension of co-design that models uncertain design outcomes as distributions over design problems and supports adaptive decision processes through Markov-kernel re-parameterizations. Using quasi-measurable and quasi-universal spaces, we show that the standard co-design interconnection operations remain compositional under this richer notion of uncertainty. We further introduce queries and observations that extract probabilistic design trade-offs, including feasibility probabilities, confidence bounds, and distributions of minimal required resources. A task-driven unmanned aerial vehicle case study illustrates how the framework captures risk-sensitive and information-dependent design choices that interval-based models cannot express.
LLM-Guided Safe Reinforcement Learning for Energy System Topology Reconfiguration
The increasing penetration of renewable generation and the growing variability of electrified demand introduce substantial operational uncertainty to modern power systems. Topology reconfiguration is widely recognized as an effective and economical means to enhance grid resilience. Due to the coexistence of AC power-flow constraints and discrete switching decisions, topology reconfiguration in large-scale systems leads to a highly nonlinear and nonconvex optimization problem, making traditional methods computationally prohibitive. Consequently, several studies have explored reinforcement learning-based approaches to improve scalability and operational efficiency. However, its practical implementation is challenged by the high-dimensional combinatorial action space and the need to ensure safety during learning-based decision-making. To address these challenges, this paper presents a safe and intelligent topology control framework that integrates Large Language Models (LLMs) with a Safety Soft Actor-Critic (Safety-SAC) architecture. Operational voltage and thermal limits are reformulated into smooth safety-cost signals, enabling risk-aware policy optimization within a constrained Markov decision process. A knowledge-based Safety-LLM module is further introduced to refine unsafe or suboptimal transitions through domain knowledge and state-informed reasoning, thus guiding the learning agent toward safer and more effective switching actions. Experiments on the IEEE 36-bus and 118-bus Grid2Op benchmarks show that the proposed method consistently improves reward, survival time, and safety metrics, achieving higher reward, longer survival, and lower safety cost compared with SAC, ACE, and their safety-enhanced variants. These results demonstrate the potential of combining LLM-based reasoning with safe reinforcement learning to achieve scalable and reliable grid topology control.
Discrete-time linear quadratic stochastic control with equality-constrained inputs: Application to energy demand response
We investigate the discrete-time stochastic linear quadratic control problem for a population of cooperative agents under the hard equality constraint on total control inputs, motivated by demand response in renewable energy systems. We establish the optimal solution that respects hard equality constraints for systems with additive noise in the dynamics. The optimal control law is derived using dynamic programming and Karush-Kuhn-Tucker (KKT) conditions, and the resulting control solution depends on a discrete-time Riccati-like recursive equation. Application examples of coordinating the charging of a network of residential batteries to absorb excess solar power generation are demonstrated, and the proposed control is shown to achieve exact power tracking while considering individual State-of-Charge (SoC) objectives
comment: 7 pages, Accepted for publication in American Control Conference
Safety in Admittance Control using Reference Trajectory Shaping
This paper presents a switched model reference admittance control framework to achieve safe and compliant human-robot collaboration through reference trajectory shaping. The proposed method generates variable admittance parameters according to task compliance and task-space safety requirements. Additionally, a disturbance bound is incorporated to enhance robustness against disturbances. Safety guarantees are explicitly established by integrating invariance control, ensuring that the reference trajectory remains within the admissible region. Stability of the switched system is analyzed using a common quadratic Lyapunov function, which confirms asymptotic convergence of the tracking error. The effectiveness of the approach is demonstrated through simulations on a two link manipulator and comparisons with existing methods are also presented. Furthermore, real time implementation on a single link manipulator validates the practical feasibility of the controller, highlighting its ability to achieve both compliance and safety in physical interaction scenarios.
On the Impact of Operating Points on Small-Signal Stability: Decentralized Stability Sets via Scaled Relative Graphs SC
This paper presents a decentralized frequency-domain framework to characterize the influence of the operating point on the small-signal stability of converter-dominated power systems. The approach builds on Scaled Relative Graph (SRG) analysis, extended here to address Linear Parameter-Varying (LPV) systems. By exploiting the affine dependence of converter admittances on their steady-state operating points, the centralized small-signal stability assessment of the grid is decomposed into decentralized, frequency-wise geometric tests. Each converter can independently evaluate its feasible stability region, expressed as a set of linear inequalities in its parameter space. The framework provides closed-form geometric characterizations applicable to both grid-following (GFL) and grid-forming (GFM) converters, and validation results confirm its effectiveness.
comment: To be presented at PSCC 2026
Fully Distributed Adaptive Consensus Approach for Economic Dispatch Problem
This research presents a novel approach to solving the economic load dispatch (ELD) problem in smart grid systems by leveraging a multi-agent distributed consensus strategy. The core idea revolves around achieving agreement among generators on their incremental cost values, thereby enabling an optimal allocation of power generation. To enhance convergence and robustness, the study introduces an adaptive coupling weight mechanism within a fully decentralized consensus framework, carefully designed with appropriate initial settings for incremental costs. The proposed distributed control protocol is versatile it functions effectively in both constrained and unconstrained generator capacity scenarios. Importantly, the methodology ensures that total power generation continuously matches dynamic load demands throughout the dispatch process, maintaining system-wide balance. To accommodate fluctuating and time varying load profiles, a dummy node is incorporated into the network architecture, acting as a flexible proxy for real time demand changes. The resilience of the method is further evaluated under communication disruptions, specifically by analyzing generator link failures through a switching network topology. Stability of the system is rigorously established using a Lyapunov-based analysis, assuming an undirected and connected communication graph among agents. To validate the practical efficacy of the proposed technique, comprehensive simulations are conducted on the IEEE 30 bus test system within the MATLAB environment, confirming its accuracy, adaptability, and computational efficiency in realistic smart grid conditions.
Fully distributed consensus control for stochastic multi-agent systems under undirected and directed topologies
This work aims to address the design of fully distributed control protocols for stochastic consensus, and, for the first time, establishes the existence and uniqueness of solutions for the path-dependent and highly nonlinear closed-loop systems under both undirected and directed topologies, bridging a critical gap in the literature. For the case of directed graphs, a unified fully distributed control protocol is designed for the first time to guarantee mean square and almost sure consensus of stochastic multi-agent systems under directed graphs. Moreover, an enhanced fully distributed protocol with additional tunable parameters designed for undirected graphs is proposed, which guarantees stochastic consensus while achieving superior convergence speed. Additionally, our work provides explicit exponential estimates for the corresponding convergence rates of stochastic consensus, elucidating the relationship between the exponential convergence rate and the system parameters. Simulations validate the theoretical results.
comment: 13 pages, 7 figures
Non-trivial consensus on directed signed matrix-weighted networks with compound measurement noises and time-varying topologies
This paper studies non-trivial consensus--a relatively novel and unexplored convergence behavior--on directed signed matrix-weighted networks subject to both additive and multiplicative measurement noises under time-varying topologies. Building upon grounded matrix-weighted Laplacian properties, a stochastic dynamic model is established that simultaneously captures inter-dimensional cooperative and antagonistic interactions, compound measurement noises and time-varying network structures. Based on stochastic differential equations theory, protocols that guarantee mean square and almost sure non-trivial consensus are proposed. Specifically, for any predetermined non-trivial consensus state, all agents are proven to converge toward this non-zero value in the mean-square and almost-sure senses. The design of control gain function in our protocols highlights a balanced consideration of the cumulative effect over time, the asymptotic decay property and the finite energy corresponding to measurement noises. Notably, the conditions on time-varying topologies in our protocols only require boundedness of elements in edge weight matrices, which facilitate the practicality of concept "time-varying topology" in matrix-weighted network consensus algorithms. Furthermore, the proposed protocols operate under milder connectivity conditions and no requirements on structural (un)balance properties. The work in this paper demonstrates that groups with both cooperative and antagonistic inter-dimensional interactions can achieve consensus even in the presence of compound measurement noises and time-varying topologies, challenging the conventional belief that consensus is attainable only in fully cooperative settings.
Peak-Load Pricing and Investment Cost Recovery with Duration-Limited Storage
Energy storage shifts energy from off-peak periods to on-peak periods. Unlike conventional generation, storage is duration-limited: the stored energy capacity constrains the duration over which it can supply power. To understand how these constraints affect optimal pricing and investment decisions, we extend the classic two-period peak-load pricing model to include duration-limited storage. By adopting assumptions typical of solar-dominated systems, we link on- and off-peak prices to storage investment costs, round-trip efficiency, and the duration of the peak period. The bulk of the scarcity premium from on-peak prices is associated with the fixed costs of storage as opposed to variable costs stemming from round-trip efficiency losses. Unlike conventional generators, the binding duration constraints lead storage to recover energy capacity costs on a per-peak-event basis instead of amortizing these costs over total peak hours. A numerical example illustrates the implications for equilibrium prices and capacity investment.
comment: 5 pages, 1 figure. Accepted to the 2026 IEEE Power & Energy Society General Meeting (PESGM)
Physics-Informed Deep B-Spline Networks
Physics-informed machine learning offers a promising framework for solving complex partial differential equations (PDEs) by integrating observational data with governing physical laws. However, learning PDEs with varying parameters and changing initial conditions and boundary conditions (ICBCs) with theoretical guarantees remains an open challenge. In this paper, we propose physics-informed deep B-spline networks, a novel technique that approximates a family of PDEs with different parameters and ICBCs by learning B-spline control points through neural networks. The proposed B-spline representation reduces the learning task from predicting solution values over the entire domain to learning a compact set of control points, enforces strict compliance to initial and Dirichlet boundary conditions by construction, and enables analytical computation of derivatives for incorporating PDE residual losses. While existing approximation and generalization theories are not applicable in this setting - where solutions of parametrized PDE families are represented via B-spline bases - we fill this gap by showing that B-spline networks are universal approximators for such families under mild conditions. We also derive generalization error bounds for physics-informed learning in both elliptic and parabolic PDE settings, establishing new theoretical guarantees. Finally, we demonstrate in experiments that the proposed technique has improved efficiency-accuracy tradeoffs compared to existing techniques in a dynamical system problem with discontinuous ICBCs and can handle nonhomogeneous ICBCs and non-rectangular domains.
A Scalable Design Approach to Resilient Architectures for Interconnected Cyber-Physical Systems: Safety Guarantees under Multiple Attacks
Complex, interconnected cyber-physical systems (CPS) are increasingly prevalent in domains such as power systems. Cyber-resilient architectures have been proposed to recover compromised cyber components of CPS. Recent works have studied tuning the recovery times of such architectures to guarantee safety in single-system settings. Extending these designs to interconnected CPS is more challenging, since solutions must account for attacks on multiple subsystems that can occur in any order and potentially infinite possible temporal overlap. This paper aims to address the aforementioned challenge by developing a scalable framework to assign resilient architectures and to inform the tuning of their recovery times. Our approach introduces a scalar index that quantifies the impact of each subsystem on safety under compromised input. These indices aggregate linearly across subsystems, enabling scalable analysis under arbitrary attack orderings and temporal overlaps. We establish a linear inequality relating each subsystem's index and recovery time that guarantees safety and guides resilient architecture assignment. We also propose a segmentation-based approach to strengthen the previously derived conditions. We then present algorithms to compute the proposed indices and to find a cost-optimal architecture assignment with a safety guarantee. We validate the framework through a case study on temperature regulation in interconnected rooms under different attack scenarios.
On Erlang mixture approximations for differential equations with distributed time delays
In this paper, we propose a general approach for approximate simulation and analysis of delay differential equations (DDEs) with distributed time delays based on methods for ordinary differential equations (ODEs). The key innovation is that we 1) propose an Erlang mixture approximation of the kernel in the DDEs and 2) use the linear chain trick to transform the resulting approximate DDEs to ODEs. Furthermore, we prove that the approximation converges for continuous and bounded kernels and for specific choices of the coefficients if the number of terms increases sufficiently fast. We show that the approximate ODEs can be used to assess the stability of the steady states of the original DDEs and that the solution to the ODEs converges if the kernel is also exponentially bounded. Additionally, we propose an approach based on bisection and least-squares estimation for determining optimal parameter values in the approximation. Finally, we present numerical examples that demonstrate the accuracy and convergence rate obtained with the optimal parameters and the efficacy of the proposed approach for bifurcation analysis and Monte Carlo simulation. The numerical examples involve a modified logistic equation, chemotherapy-induced myelosuppression, and a point reactor kinetics model of a molten salt nuclear fission reactor.
comment: The theoretical results have been generalized and the paper has been heavily revised in response to reviewers' comments
Universal Transient Stability Analysis: A Large Language Model-Enabled Dynamics Prediction Framework
Existing dynamics prediction frameworks for transient stability analysis (TSA) fail to achieve multi-scenario "universality"--the inherent ability of a single, pre-trained architecture to generalize across diverse operating conditions, unseen faults, and heterogeneous systems. To address this, this paper proposes TSA-LLM, a large language model (LLM)-based universal framework that models multi-variate transient dynamics prediction as a univariate generative task with three key innovations: First, a novel data processing pipeline featuring channel independence decomposition to resolve dimensional heterogeneity, sample-wise normalization to eliminate separate stable or unstable pipelines, and temporal patching for efficient long-sequence modeling; Second, a parameter-efficient freeze-and-finetune strategy that augments the LLM's architecture with dedicated input embedding and output projection layers while freezing core transformer blocks to preserve generic feature extraction capabilities; Third, a two-stage fine-tuning scheme that combines teacher forcing, which feeds the model ground-truth data during initial training, with scheduled sampling, which gradually shifts to leveraging model-generated predictions, to mitigate cumulative errors in long-horizon iterative prediction. Comprehensive testing demonstrates the framework's universality, as TSA-LLM trained solely on the New England 39-bus system achieves zero-shot generalization to mixed stability conditions and unseen faults, and matches expert performance on the larger Iceland 189-bus system with only 5% fine-tuning data. This multi-scenario versatility validates a universal framework that eliminates scenario-specific retraining and achieves scalability via large-scale parameters and cross-scenario training data.
Scalable Distributed Nonlinear Control Under Flatness-Preserving Coupling
We study distributed control for a network of nonlinear, differentially flat subsystems subject to dynamic coupling. Although differential flatness simplifies planning and control for isolated subsystems, the presence of coupling can destroy this property for the overall joint system. Focusing on subsystems in pure-feedback form, we identify a class of compatible lower-triangular dynamic couplings that preserve flatness and guarantee that the flat outputs of the subsystems remain the flat outputs of the coupled system. Further, we show that the joint flatness diffeomorphism can be constructed from those of the individual subsystems and, crucially, its sparsity structure reflects that of the coupling. Exploiting this structure, we synthesize a distributed tracking controller that computes control actions from local information only, thereby ensuring scalability. We validate our proposed framework on a simulated example of planar quadrotors dynamically coupled via aerodynamic downwash, and show that the distributed controller achieves accurate trajectory tracking.
Identifying Best Candidates for Busbar Splitting
Rising electricity demand and the growing integration of renewables are intensifying congestion in transmission grids. Grid topology optimization through busbar splitting (BuS) and optimal transmission switching can alleviate grid congestion and reduce the generation costs in a power system. However, BuS optimization requires a large number of binary variables, and analyzing all the substations for potential new topological actions is computationally intractable, particularly in large grids. To tackle this issue, we propose a set of metrics to identify and rank promising candidates for BuS, focusing on finding buses where topology optimization can reduce generation costs. To assess the effect of BuS on the identified buses, we use a combined mixed-integer convex-quadratic BuS model to compute the optimal topology and test it with the non-linear non-convex AC optimal power flow (OPF) simulation to show its AC feasibility. By testing and validating the proposed metrics on test cases of different sizes, we show that they are able to identify busbars that reduce the total generation costs when their topology is optimized. Thus, the metrics enable effective selection of busbars for BuS, with no need to test every busbar in the grid, one at a time.
Machine Learning Detection of Lithium Plating in Lithium-ion Cells: A Gaussian Process Approach
Lithium plating during fast charging is a critical degradation mechanism that accelerates capacity fade and can trigger catastrophic safety failures. Recent work has shown that plating onset can manifest in incremental-capacity analysis as an additional high-voltage feature above 4.0 V, often appearing as a secondary peak or shoulder distinct from the main intercalation peak complex; however, conventional methods for computing dQ/dV rely on finite differencing with filtering, which amplifies sensor noise and introduces bias in feature location. In this paper, we propose a Gaussian Process (GP) framework for lithium plating detection by directly modeling the charge-voltage relationship Q(V) as a stochastic process with calibrated uncertainty. Leveraging the property that derivatives of GPs remain GPs, we infer dQ/dV analytically and probabilistically from the posterior, enabling robust detection without ad hoc smoothing. The framework provides three key benefits: (i) noise-aware inference with hyperparameters learned from data, (ii) closed-form derivatives with credible intervals for uncertainty quantification, and (iii) scalability to online variants suitable for embedded BMS. Experimental validation on Li-ion coin cells across a range of C-rates (0.2C-1C) and temperatures (0-40$^\circ$C) demonstrates that the GP-based method reliably resolves distinct high-voltage secondary peak features under low-temperature, high-rate charging, while correctly reporting no features in non-plating cases. The concurrence of GP-identified differential features, reduced charge throughput, capacity fade measured via reference performance tests, and post-mortem microscopy confirmation supports the interpretation of these signatures as plating-related, establishing a practical pathway for real-time lithium plating detection.
comment: Accepted for presentation at American Control Conference 2026 - ACC 2026 to be held in New Orleans, Louisiana
Privacy-Preserving Uncertainty Disclosure for Facilitating Enhanced Energy Storage Dispatch
This paper proposes a novel privacy-preserving uncertainty disclosure framework, enabling system operators to release marginal value function bounds to reduce the conservativeness of interval forecast and mitigate excessive withholding, thereby enhancing storage dispatch and social welfare. We develop a risk-averse storage arbitrage model based on stochastic dynamic programming, explicitly accounting for uncertainty intervals in value function training. Real-time marginal value function bounds are derived using a rolling-horizon chance-constrained economic dispatch formulation. We rigorously prove that the bounds reliably cap the true opportunity cost and dynamically converge to the hindsight value. We verify that both the marginal value function and its bounds monotonically decrease with the state of charge (SoC) and increase with uncertainty, providing a theoretical basis for risk-averse strategic behaviors and SoC-dependent designs. An adjusted storage dispatch algorithm is further designed using these bounds. We validate the effectiveness of the proposed framework via an agent-based simulation on the ISO-NE test system. Under 50% renewable capacity and 35% storage capacity, the proposed bounds enhance storage response by 38.91% and reduce the optimality gap to 3.91% through improved interval predictions. Additionally, by mitigating excessive withholding, the bounds yield an average system cost reduction of 0.23% and an average storage profit increase of 13.22%. These benefits further scale with higher prediction conservativeness, storage capacity, and system uncertainty.
comment: The authors have conflict of interests about this paper and have to withdrawn it
Risk Aware Safe Control with Multi-Modal Sensing for Dynamic Obstacle Avoidance
Safe control in dynamic traffic environments remains a major challenge for autonomous vehicles (AVs), as ego vehicle and obstacle states are inherently affected by sensing noise and estimation uncertainty. However, existing studies have not sufficiently addressed how uncertain multi-modal sensing information can be systematically incorporated into tail-risk-aware safety-critical control. To address this gap, this paper proposes a risk-aware safe control framework that integrates probabilistic state estimation with a conditional value-at-risk (CVaR) control barrier function (CBF) safety filter. Obstacle detections from cameras, LiDAR, and vehicle-to-everything (V2X) communication are combined using a Wasserstein barycenter (WB) to obtain a probabilistic state estimate. A model predictive controller generates the nominal control, which is then filtered through a CVaR-CBF quadratic program to enforce risk-aware safety constraints. The approach is evaluated through numerical studies and further validated on a full-scale AV. Results demonstrate improved safety and robustness over a baseline MPC-CBF design, with an average improvement of 12.7\% in success rate across the evaluated scenarios.
Optimal Modified Feedback Strategies in LQ Games under Control Imperfections
Game-theoretic approaches and Nash equilibrium have been widely applied across various engineering domains. However, practical challenges such as disturbances, delays, and actuator limitations can hinder the precise execution of Nash equilibrium strategies. This work investigates the impact of such implementation imperfections on game trajectories and players' costs in the context of a two-player finite-horizon linear quadratic (LQ) nonzero-sum game. Specifically, we analyze how small deviations by one player, measured or estimated at each stage affect the state trajectory and the other player's cost. To mitigate these effects, we construct a compensation law for the influenced player by augmenting the nominal game with the measurable deviation dynamics. The resulting policy is shown to be optimal within a causal affine policy class, and, for sufficiently small deviations, it locally outperforms the uncompensated equilibrium-derived feedback. Rigorous analysis and proofs are provided, and the effectiveness of the proposed approach is demonstrated through a representative numerical example.
comment: 8 pages, 2 figures, Manuscript accepted to ACC 2026
Risk-Budgeted Control Framework for Balanced Performance and Safety in Autonomous Vehicles
This paper presents a hybrid control framework with a risk-budgeted monitor for safety-certified autonomous driving. A sliding-window monitor tracks insufficient barrier residuals and triggers switching from a relaxed control barrier function (R-CBF) to a more conservative conditional value-at-risk CBF (CVaR-CBF) when the safety margin deteriorates. Two real-time triggers are considered: feasibility-triggered (FT), which activates CVaR-CBF when the R-CBF problem is reported infeasible, and quality-triggered (QT), which switches when the residual falls below a prescribed safety margin. The framework is evaluated with model predictive control (MPC) under vehicle localization noise and obstacle position uncertainty across multiple AV-pedestrian interaction scenarios with 1,500 Monte Carlo runs. In the most challenging case with 5 m pedestrian detection uncertainty, the proposed method achieves a 94--96\% collision-free success rate over 300 trials while maintaining the lowest mean cross-track error (CTE = 3.2--3.6 m), indicating faster trajectory recovery after obstacle avoidance and a favorable balance between safety and performance.
Robotics
DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation CVPR2026
Vision-and-Language Navigation (VLN) requires agents to follow long-horizon instructions and navigate complex 3D environments. However, existing approaches face two major challenges: constructing an effective long-term memory bank and overcoming the compounding errors problem. To address these issues, we propose DecoVLN, an effective framework designed for robust streaming perception and closed-loop control in long-horizon navigation. First, we formulate long-term memory construction as an optimization problem and introduce adaptive refinement mechanism that selects frames from a historical candidate pool by iteratively optimizing a unified scoring function. This function jointly balances three key criteria: semantic relevance to the instruction, visual diversity from the selected memory, and temporal coverage of the historical trajectory. Second, to alleviate compounding errors, we introduce a state-action pair-level corrective finetuning strategy. By leveraging geodesic distance between states to precisely quantify deviation from the expert trajectory, the agent collects high-quality state-action pairs in the trusted region while filtering out the polluted data with low relevance. This improves both the efficiency and stability of error correction. Extensive experiments demonstrate the effectiveness of DecoVLN, and we have deployed it in real-world environments.
comment: 16 pages, 8 figures, CVPR2026
Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots
Panoramic imagery provides holistic 360° visual coverage for perception in quadruped robots. However, existing occupancy prediction methods are mainly designed for wheeled autonomous driving and rely heavily on RGB cues, limiting their robustness in complex environments. To bridge this gap, (1) we present PanoMMOcc, the first real-world panoramic multimodal occupancy dataset for quadruped robots, featuring four sensing modalities across diverse scenes. (2) We propose a panoramic multimodal occupancy perception framework, VoxelHound, tailored for legged mobility and spherical imaging. Specifically, we design (i) a Vertical Jitter Compensation (VJC) module to mitigate severe viewpoint perturbations caused by body pitch and roll during mobility, enabling more consistent spatial reasoning, and (ii) an effective Multimodal Information Prompt Fusion (MIPF) module that jointly leverages panoramic visual cues and auxiliary modalities to enhance volumetric occupancy prediction. (3) We establish a benchmark based on PanoMMOcc and provide detailed data analysis to enable systematic evaluation of perception methods under challenging embodied scenarios. Extensive experiments demonstrate that VoxelHound achieves state-of-the-art performance on PanoMMOcc (+4.16%} in mIoU). The dataset and code will be publicly released to facilitate future research on panoramic multimodal 3D perception for embodied robotic systems at https://github.com/SXDR/PanoMMOcc, along with the calibration tools released at https://github.com/losehu/CameraLiDAR-Calib.
comment: The dataset and code will be publicly released at https://github.com/SXDR/PanoMMOcc
A Feasibility-Enhanced Control Barrier Function Method for Multi-UAV Collision Avoidance
This paper presents a feasibility-enhanced control barrier function (FECBF) framework for multi-UAV collision avoidance. In dense multi-UAV scenarios, the feasibility of the CBF quadratic program (CBF-QP) can be compromised due to internal incompatibility among multiple CBF constraints. To address this issue, we analyze the internal compatibility of CBF constraints and derive a sufficient condition for internal compatibility. Based on this condition, a sign-consistency constraint is introduced to mitigate internal incompatibility. The proposed constraint is incorporated into a decentralized CBF-QP formulation using worst-case estimates and slack variables. Simulation results demonstrate that the proposed method significantly reduces infeasibility and improves collision avoidance performance compared with existing baselines in dense scenarios. Additional simulations under varying time delays demonstrate the robustness of the proposed method. Real-world experiments validate the practical applicability of the proposed method.
Evaluating VLMs' Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences ICLR 2026
Understanding user instructions and object spatial relations in surrounding environments is crucial for intelligent robot systems to assist humans in various tasks. The natural language and spatial reasoning capabilities of Vision-Language Models (VLMs) have the potential to enhance the generalization of robot planners on new tasks, objects, and motion specifications. While foundation models have been applied to task planning, it is still unclear the degree to which they have the capability of spatial reasoning required to enforce user preferences or constraints on motion, such as desired distances from objects, topological properties, or motion style preferences. In this paper, we evaluate the capability of four state-of-the-art VLMs at spatial reasoning over robot motion, using four different querying methods. Our results show that, with the highest-performing querying method, Qwen2.5-VL achieves 71.4% accuracy zero-shot and 75% on a smaller model after fine-tuning, and GPT-4o leads to lower performance. We evaluate two types of motion preferences (object-proximity and path-style), and we also analyze the trade-off between accuracy and computation cost in number of tokens. This work shows some promise in the potential of VLM integration with robot motion planning pipelines.
comment: Accepted to the First Workshop on Efficient Spatial Reasoning at ICLR 2026
SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design ICRA 2026
We introduce SldprtNet, a large-scale dataset comprising over 242,000 industrial parts, designed for semantic-driven CAD modeling, geometric deep learning, and the training and fine-tuning of multimodal models for 3D design. The dataset provides 3D models in both .step and .sldprt formats to support diverse training and testing. To enable parametric modeling and facilitate dataset scalability, we developed supporting tools, an encoder and a decoder, which support 13 types of CAD commands and enable lossless transformation between 3D models and a structured text representation. Additionally, each sample is paired with a composite image created by merging seven rendered views from different viewpoints of the 3D model, effectively reducing input token length and accelerating inference. By combining this image with the parameterized text output from the encoder, we employ the lightweight multimodal language model Qwen2.5-VL-7B to generate a natural language description of each part's appearance and functionality. To ensure accuracy, we manually verified and aligned the generated descriptions, rendered images, and 3D models. These descriptions, along with the parameterized modeling scripts, rendered images, and 3D model files, are fully aligned to construct SldprtNet. To assess its effectiveness, we fine-tuned baseline models on a dataset subset, comparing image-plus-text inputs with text-only inputs. Results confirm the necessity and value of multimodal datasets for CAD generation. It features carefully selected real-world industrial parts, supporting tools for scalable dataset expansion, diverse modalities, and ensured diversity in model complexity and geometric features, making it a comprehensive multimodal dataset built for semantic-driven CAD modeling and cross-modal learning.
comment: Accept by ICRA 2026
InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing
Text-guided 3D motion editing has seen success in single-person scenarios, but its extension to multi-person settings is less explored due to limited paired data and the complexity of inter-person interactions. We introduce the task of multi-person 3D motion editing, where a target motion is generated from a source and a text instruction. To support this, we propose InterEdit3D, a new dataset with manual two-person motion change annotations, and a Text-guided Multi-human Motion Editing (TMME) benchmark. We present InterEdit, a synchronized classifier-free conditional diffusion model for TMME. It introduces Semantic-Aware Plan Token Alignment with learnable tokens to capture high-level interaction cues and an Interaction-Aware Frequency Token Alignment strategy using DCT and energy pooling to model periodic motion dynamics. Experiments show that InterEdit improves text-to-motion consistency and edit fidelity, achieving state-of-the-art TMME performance. The dataset and code will be released at https://github.com/YNG916/InterEdit.
comment: The dataset and code will be released at https://github.com/YNG916/InterEdit
ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models
A recent trend in vision-language models (VLMs) has been to enhance their spatial cognition for embodied domains. Despite progress, existing evaluations have been limited both in paradigm and in coverage, hindering rapid, iterative model development. To address these limitations, we propose ESPIRE, a diagnostic benchmark for embodied spatial reasoning. ESPIRE offers a simulated world that physically grounds VLMs and evaluates them on spatial-reasoning-centric robotic tasks, thus narrowing the gap between evaluation and real-world deployment. To adapt VLMs to robotic tasks, we decompose each task into localization and execution, and frame both as generative problems, in stark contrast to predominant discriminative evaluations (e.g., via visual-question answering) that rely on distractors and discard execution. This decomposition further enables a fine-grained analysis beyond passive spatial reasoning toward reasoning to act. We systematically design ESPIRE both at the instruction level and at the environment level, ensuring broad coverage of spatial reasoning scenarios. We use ESPIRE to diagnose a range of frontier VLMs and provide in-depth analysis of their spatial reasoning behaviors.
From Passive Monitoring to Active Defence: Resilient Control of Manipulators Under Cyberattacks
Cyber-physical robotic systems are vulnerable to false data injection attacks (FDIAs), in which an adversary corrupts sensor signals while evading residual-based passive anomaly detectors such as the chi-squared test. Such stealthy attacks can induce substantial end-effector deviations without triggering alarms. This paper studies the resilience of redundant manipulators to stealthy FDIAs and advances the architecture from passive monitoring to active defence. We formulate a closed-loop model comprising a feedback-linearized manipulator, a steady-state Kalman filter, and a chi-squared-based anomaly detector. Building on this passive monitoring layer, we propose an active control-level defence that attenuates the control input through a monotone function of an anomaly score generated by a novel actuation-projected, measurement-free state predictor. The proposed design provides probabilistic guarantees on nominal actuation loss and preserves closed-loop stability. From the attacker perspective, we derive a convex QCQP for computing one-step optimal stealthy attacks. Simulations on a 6-DOF planar manipulator show that the proposed defence significantly reduces attack-induced end-effector deviation while preserving nominal task performance in the absence of attacks.
Route Fragmentation Based on Resource-centric Prioritisation for Efficient Multi-Robot Path Planning in Agricultural Environments
Agricultural environments present high proportions of spatially dense navigation bottlenecks for long-term navigation and operational planning of agricultural mobile robots. The existing agent-centric multi-robot path planning (MRPP) approaches resolve conflicts from the perspective of agents, rather than from the resources under contention. Further, the density of such contentions limits the capabilities of spatial interleaving, a concept that many planners rely on to achieve high throughput. In this work, two variants of the priority-based Fragment Planner (FP) are presented as resource-centric MRPP algorithms that leverage route fragmentation to enable partial route progression and limit the impact of binary-based waiting. These approaches are evaluated in lifelong simulation over a 3.6km topological map representing a commercial polytunnel environment. Their performances are contrasted against 5 baseline algorithms with varying robotic fleet sizes. The Fragment Planners achieved significant gains in throughput compared with Prioritised Planning (PP) and Priority-Based Search (PBS) algorithms. They further demonstrated a task throughput of 95% of the optimal task throughput over the same time period. This work shows that, for long-term deployment of agricultural robots in corridor-dominant agricultural environments, resource-centric MRPP approaches are a necessity for high-efficacy operational planning.
comment: This work has been submitted to the IEEE for possible publication
Language-Grounded Decoupled Action Representation for Robotic Manipulation CVPR2026
The heterogeneity between high-level vision-language understanding and low-level action control remains a fundamental challenge in robotic manipulation. Although recent methods have advanced task-specific action alignment, they often struggle to generate robust and accurate actions for novel or semantically related tasks. To address this, we propose the Language-Grounded Decoupled Action Representation (LaDA) framework, which leverages natural language as a semantic bridge to connect perception and control. LaDA introduces a fine-grained intermediate layer of three interpretable action primitives--translation, rotation, and gripper control--providing explicit semantic structure for low-level actions. It further employs a semantic-guided soft-label contrastive learning objective to align similar action primitives across tasks, enhancing generalization and motion consistency. An adaptive weighting strategy, inspired by curriculum learning, dynamically balances contrastive and imitation objectives for stable and effective training. Extensive experiments on simulated benchmarks (LIBERO and MimicGen) and real-world demonstrations validate that LaDA achieves strong performance and generalizes effectively to unseen or related tasks.
comment: Accepted by CVPR2026
Efficient Real-World Autonomous Racing via Attenuated Residual Policy Optimization
Residual policy learning (RPL), in which a learned policy refines a static base policy using deep reinforcement learning (DRL), has shown strong performance across various robotic applications. Its effectiveness is particularly evident in autonomous racing, a domain that serves as a challenging benchmark for real-world DRL. However, deploying RPL-based controllers introduces system complexity and increases inference latency. We address this by introducing an extension of RPL named attenuated residual policy optimization ($α$-RPO). Unlike standard RPL, $α$-RPO yields a standalone neural policy by progressively attenuating the base policy, which initially serves to bootstrap learning. Furthermore, this mechanism enables a form of privileged learning, where the base policy is permitted to use sensor modalities not required for final deployment. We design $α$-RPO to integrate seamlessly with PPO, ensuring that the attenuated influence of the base controller is dynamically compensated during policy optimization. We evaluate $α$-RPO by building a framework for 1:10-scaled autonomous racing around it. In both simulation and zero-shot real-world transfer to Roboracer cars, $α$-RPO not only reduces system complexity but also improves driving performance compared to baselines - demonstrating its practicality for robotic deployment. Our code is available at: https://github.com/raphajaner/arpo_racing.
ReMem-VLA: Empowering Vision-Language-Action Model with Memory via Dual-Level Recurrent Queries
Vision-language-action (VLA) models for closed-loop robot control are typically cast under the Markov assumption, making them prone to errors on tasks requiring historical context. To incorporate memory, existing VLAs either retrieve from a memory bank, which can be misled by distractors, or extend the frame window, whose fixed horizon still limits long-term retention. In this paper, we introduce ReMem-VLA, a Recurrent Memory VLA model equipped with two sets of learnable queries: frame-level recurrent memory queries for propagating information across consecutive frames to support short-term memory, and chunk-level recurrent memory queries for carrying context across temporal chunks for long-term memory. These queries are trained end-to-end to aggregate and maintain relevant context over time, implicitly guiding the model's decisions without additional training or inference cost. Furthermore, to enhance visual memory, we introduce Past Observation Prediction as an auxiliary training objective. Through extensive memory-centric simulation and real-world robot experiments, we demonstrate that ReMem-VLA exhibits strong memory capabilities across multiple dimensions, including spatial, sequential, episodic, temporal, and visual memory. ReMem-VLA significantly outperforms memory-free VLA baselines $π$0.5 and OpenVLA-OFT and surpasses MemoryVLA on memory-dependent tasks by a large margin.
comment: 14 pages, 6 figures
Coordinated Manipulation of Hybrid Deformable-Rigid Objects in Constrained Environments
Coordinated robotic manipulation of deformable linear objects (DLOs), such as ropes and cables, has been widely studied; however, handling hybrid assemblies composed of both deformable and rigid elements in constrained environments remains challenging. This work presents a quasi-static optimization-based manipulation planner that employs a strain-based Cosserat rod model, extending rigid-body formulations to hybrid deformable linear objects (hDLO). The proposed planner exploits the compliance of deformable links to maneuver through constraints while achieving task-space objectives for the object that are unreachable with rigid tools. By leveraging a differentiable model with analytically derived gradients, the method achieves up to a 33x speedup over finite-difference baselines for inverse kinetostatic(IKS) problems. Furthermore, the subsequent trajectory optimization problem, warm-started using the IKS solution, is only practically realizable via analytical derivatives. The proposed algorithm is validated in simulation on various hDLO systems and experimentally on a three-link hDLO manipulated in a constrained environment using a dual-arm robotic system. Experimental results confirm the planner's accuracy, yielding an average deformation error of approximately 3 cm (5% of the deformable link length) between the desired and measured marker positions. Finally, the proposed optimal planner is compared against a sampling-based feasibility planner adapted to the strain-based formulation. The results demonstrate the effectiveness and applicability of the proposed approach for robotic manipulation of hybrid assemblies in constrained environments.
comment: 15 pages, 10 figures
RoboStream: Weaving Spatio-Temporal Reasoning with Memory in Vision-Language Models for Robotics
Enabling reliable long-horizon robotic manipulation is a crucial step toward open-world embodied intelligence. However, VLM-based planners treat each step as an isolated observation-to-action mapping, forcing them to reinfer scene geometry from raw pixels at every decision point while remaining unaware of how prior actions have reshaped the environment. Despite strong short-horizon performance, these systems lack the spatio-temporal reasoning required for persistent geometric anchoring and memory of action-triggered state transitions. Without persistent state tracking, perceptual errors accumulate across the execution horizon, temporarily occluded objects are catastrophically forgotten, and these compounding failures lead to precondition violations that cascade through subsequent steps. In contrast, humans maintain a persistent mental model that continuously tracks spatial relations and action consequences across interactions rather than reconstructing them at each instant. Inspired by this human capacity for causal spatio-temporal reasoning with persistent memory, we propose RoboStream, a training-free framework that achieves geometric anchoring through Spatio-Temporal Fusion Tokens (STF-Tokens), which bind visual evidence to 3D geometric attributes for persistent object grounding, and maintains causal continuity via a Causal Spatio-Temporal Graph (CSTG) that records action-triggered state transitions across steps. This design enables the planner to trace causal chains and preserve object permanence under occlusion without additional training or fine-tuning. RoboStream achieves 90.5% on long-horizon RLBench and 44.4% on challenging real-world block-building tasks, where both SoFar and VoxPoser score 11.1%, demonstrating that spatio-temporal reasoning and causal memory are critical missing components for reliable long-horizon manipulation.
MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins
Converting static 3D meshes into interactable articulated assets is crucial for embodied AI and robotic simulation. However, existing zero-shot pipelines struggle with complex assets due to a critical lack of physical grounding. Specifically, ungrounded Vision-Language Models (VLMs) frequently suffer from kinematic hallucinations, while unconstrained joint estimation inevitably leads to catastrophic mesh inter-penetration during physical simulation. To bridge this gap, we propose MotionAnymesh, an automated zero-shot framework that seamlessly transforms unstructured static meshes into simulation-ready digital twins. Our method features a kinematic-aware part segmentation module that grounds VLM reasoning with explicit SP4D physical priors, effectively eradicating kinematic hallucinations. Furthermore, we introduce a geometry-physics joint estimation pipeline that combines robust type-aware initialization with physics-constrained trajectory optimization to rigorously guarantee collision-free articulation. Extensive experiments demonstrate that MotionAnymesh significantly outperforms state-of-the-art baselines in both geometric precision and dynamic physical executability, providing highly reliable assets for downstream applications.
comment: 5 figures
GoalSwarm: Multi-UAV Semantic Coordination for Open-Vocabulary Object Navigation
Cooperative visual semantic navigation is a foundational capability for aerial robot teams operating in unknown environments. However, achieving robust open-vocabulary object-goal navigation remains challenging due to the computational constraints of deploying heavy perception models onboard and the complexity of decentralized multi-agent coordination. We present GoalSwarm, a fully decentralized multi-UAV framework for zero-shot semantic object-goal navigation. Each UAV collaboratively constructs a shared, lightweight 2D top-down semantic occupancy map by projecting depth observations from aerial vantage points, eliminating the computational burden of full 3D representations while preserving essential geometric and semantic structure. The core contributions of GoalSwarm are threefold: (1) integration of zero-shot foundation model -- SAM3 for open vocabulary detection and pixel-level segmentation, enabling open-vocabulary target identification without task-specific training; (2) a Bayesian Value Map that fuses multi-viewpoint detection confidences into a per-pixel goal-relevance distribution, enabling informed frontier scoring via Upper Confidence Bound (UCB) exploration; and (3) a decentralized coordination strategy combining semantic frontier extraction, cost-utility bidding with geodesic path costs, and spatial separation penalties to minimize redundant exploration across the swarm.
comment: 6 pages, 2 figures
Consistent and Efficient MSCKF-based LiDAR-Inertial Odometry with Inferred Cluster-to-Plane Constraints for UAVs
Robust and accurate navigation is critical for Unmanned Aerial Vehicles (UAVs) especially for those with stringent Size, Weight, and Power (SWaP) constraints. However, most state-of-the-art (SOTA) LiDAR-Inertial Odometry (LIO) systems still suffer from estimation inconsistency and computational bottlenecks when deployed on such platforms. To address these issues, this paper proposes a consistent and efficient tightly-coupled LIO framework tailored for UAVs. Within the efficient Multi-State Constraint Kalman Filter (MSCKF) framework, we build coplanar constraints inferred from planar features observed across a sliding window. By applying null-space projection to sliding-window coplanar constraints, we eliminate the direct dependency on feature parameters in the state vector, thereby mitigating overconfidence and improving consistency. More importantly, to further boost the efficiency, we introduce a parallel voxel-based data association and a novel compact cluster-to-plane measurement model. This compact measurement model losslessly reduces observation dimensionality and significantly accelerating the update process. Extensive evaluations demonstrate that our method outperforms most state-of-the-art (SOTA) approaches by providing a superior balance of consistency and efficiency. It exhibits improved robustness in degenerate scenarios, achieves the lowest memory usage via its map-free nature, and runs in real-time on resource-constrained embedded platforms (e.g., NVIDIA Jetson TX2).
Beyond Imitation: Reinforcement Learning Fine-Tuning for Adaptive Diffusion Navigation Policies
Diffusion-based robot navigation policies trained on large-scale imitation learning datasets, can generate multi-modal trajectories directly from the robot's visual observations, bypassing the traditional localization-mapping-planning pipeline and achieving strong zero-shot generalization. However, their performance remains constrained by the coverage of offline datasets, and when deployed in unseen settings, distribution shift often leads to accumulated trajectory errors and safety-critical failures. Adapting diffusion policies with reinforcement learning is challenging because their iterative denoising structure hinders effective gradient backpropagation, while also making the training of an additional value network computationally expensive and less stable. To address these issues, we propose a reinforcement learning fine-tuning framework tailored for diffusion-based navigation. The method leverages the inherent multi-trajectory sampling mechanism of diffusion models and adopts Group Relative Policy Optimization (GRPO), which estimates relative advantages across sampled trajectories without requiring a separate value network. To preserve pretrained representations while enabling adaptation, we freeze the visual encoder and selectively update the higher decoder layers and action head, enhancing safety-aware behaviors through online environmental feedback. On the PointGoal task in Isaac Sim, our approach improves the Success Rate from 52.0% to 58.7% and SPL from 0.49 to 0.54 on unseen scenes, while reducing collision frequency. Additional experiments show that the fine-tuned policy transfers zero-shot to a real quadruped platform and maintains stable performance in geometrically out-of-distribution environments, suggesting improved adaptability and safe generalization to new domains.
AoI-FusionNet: Age-Aware Tightly Coupled Fusion of UWB-IMU under Sparse Ranging Conditions
Accurate motion tracking of snow particles in avalanche events requires robust localization in global navigation satellite system (GNSS)-denied outdoor environments. This paper introduces AoI-FusionNet, a tightly coupled deep learning-based fusion framework that directly combines raw ultra-wideband (UWB) time-of-flight (ToF) measurements with inertial measurement unit (IMU) data for 3D trajectory estimation. Unlike loose-coupled pipelines based on intermediate trilateration, the proposed approach operates directly on heterogeneous sensor inputs, enabling localization even under insufficient ranging availability. The framework integrates an Age-of-Information (AoI)-aware decay module to reduce the influence of stale UWB ranging measurements and a learned attention gating mechanism that adaptively balances the contribution of UWB and IMU modalities based on measurement availability and temporal freshness. To evaluate robustness under limited data and measurement variability, we apply a diffusion-based residual augmentation strategy during training, producing an augmented variant termed AoI-FusionNet-DGAN. We assess the performance of the proposed model using offline post-processing of real-world measurement data collected in an alpine environment and benchmark it against UWB multilateration and loose-coupled fusion baselines. The results demonstrate that AoI-FusionNet substantially reduces mean and tail localization errors under intermittent and degraded sensing conditions.
SmoothTurn: Learning to Turn Smoothly for Agile Navigation with Quadrupedal Robots
Quadrupedal robots show great potential for valuable real-world applications such as fire rescue and industrial inspection. Such applications often require urgency and the ability to navigate agilely, which in turn demands the capability to change directions smoothly while running in high speed. Existing approaches for agile navigation typically learn a single-goal reaching policy by encouraging the robot to stay at the target position after reaching there. As a result, when the policy is used to reach sequential goals that require changing directions, it cannot anticipate upcoming maneuvers or maintain momentum across the switch of goals, thereby preventing the robot from fully exploiting its agility potential. In this work, we formulate the task as sequential local navigation, extending the single-goal-conditioned local navigation formulation in prior work. We then introduce SmoothTurn, a learning-based control framework that learns to turn smoothly while running rapidly for agile sequential local navigation. The framework adopts a novel sequential goal-reaching reward, an expanded observation space with a lookahead window for future goals, and an automatic goal curriculum that progressively expands the difficulty of sampled goal sequences based on the goal-reaching performance. The trained policy can be directly deployed on real quadrupedal robots with onboard sensors and computation. Both simulation and real-world empirical results show that SmoothTurn learns an agile locomotion policy that performs smooth turning across goals, with emergent behaviors such as controlling momentum when switching goals, facing towards the future goal in advance, and planning efficient paths. We have provided video demos of the learned motions in the supplementary materials. The source code and trained policies will be made available upon acceptance.
Reinforcement Learning for Elliptical Cylinder Motion Control Tasks
The control of devices with limited input always bring attention to solve by research due to its difficulty and non-trival solution. For instance, the inverted pendulum is benchmarking problem in control theory and machine learning. In this work, we are focused on the elliptical cylinder and its motion under limited torque. The inspiration of the problem is from untethered magnetic devices, which due to distance have to operate with limited input torque. In this work, the main goal is to define the control problem of elliptic cylinder with limited input torque and solve it by Reinforcement Learning. As a classical baseline, we evaluate a two-stage controller composed of an energy-shaping swing-up law and a local Linear Quadratic Regulator (LQR) stabilizer around the target equilibrium. The swing-up controller increases the system's mechanical energy to drive the state toward a neighborhood of the desired equilibrium, a linearization of the nonlinear model yields an LQR that regulates the angle and angular-rate states to the target orientation with bounded input. This swing-up + LQR policy is a strong, interpretable reference for underactuated system and serves a point of comparison to the learned policy under identical limits and parameters. The solution shows that the learning is possible however, the different cases like stabilization in upward position or rotating of half turn are very difficult for increasing mass or ellipses with a strongly unequal perimeter ratio.
FLUX: Accelerating Cross-Embodiment Generative Navigation Policies via Rectified Flow and Static-to-Dynamic Learning
Autonomous navigation requires a broad spectrum of skills, from static goal-reaching to dynamic social traversal, yet evaluation remains fragmented across disparate protocols. We introduce DynBench, a dynamic navigation benchmark featuring physically valid crowd simulation. Combined with existing static protocols, it supports comprehensive evaluation across six fundamental navigation tasks. Within this framework, we propose FLUX, the first flow-based unified navigation policy. By linearizing probability flow, FLUX replaces iterative denoising with straight-line trajectories, improving per-step inference efficiency by 47% over prior flow-based methods and 29% over diffusion-based ones. Following a static-to-dynamic curriculum, FLUX initially establishes geometric priors and is subsequently refined through reinforcement learning in dynamic social environments. This regime not only strengthens socially-aware navigation but also enhances static task robustness by capturing recovery behaviors through stochastic action distributions. FLUX achieves state-of-the-art performance across all tasks and demonstrates zero-shot sim-to-real transfer on wheeled, quadrupedal, and humanoid platforms without any fine-tuning.
comment: Project Page at this [Website](https://zeying-gong.github.io/projects/flux/)
Motion-Specific Battery Health Assessment for Quadrotors Using High-Fidelity Battery Models ICRA
Quadrotor endurance is ultimately limited by battery behavior, yet most energy aware planning treats the battery as a simple energy reservoir and overlooks how flight motions induce dynamic current loads that accelerate battery degradation. This work presents an end to end framework for motion aware battery health assessment in quadrotors. We first design a wide range current sensing module to capture motion specific current profiles during real flights, preserving transient features. In parallel, a high fidelity battery model is calibrated using reference performance tests and a metaheuristic based on a degradation coupled electrochemical model.By simulating measured flight loads in the calibrated model, we systematically resolve how different flight motions translate into degradation modes loss of lithium inventory and loss of active material as well as internal side reactions. The results demonstrate that even when two flight profiles consume the same average energy, their transient load structures can drive different degradation pathways, emphasizing the need for motion-aware battery management that balances efficiency with battery degradation.
comment: 8 pages. Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026
PVI: Plug-in Visual Injection for Vision-Language-Action Models
VLA architectures that pair a pretrained VLM with a flow-matching action expert have emerged as a strong paradigm for language-conditioned manipulation. Yet the VLM, optimized for semantic abstraction and typically conditioned on static visual observations, tends to attenuate fine-grained geometric cues and often lacks explicit temporal evidence for the action expert. Prior work mitigates this by injecting auxiliary visual features, but existing approaches either focus on static spatial representations or require substantial architectural modifications to accommodate temporal inputs, leaving temporal information underexplored. We propose Plug-in Visual Injection (PVI), a lightweight, encoder-agnostic module that attaches to a pretrained action expert and injects auxiliary visual representations via zero-initialized residual pathways, preserving pretrained behavior with only single-stage fine-tuning. Using PVI, we obtain consistent gains over the base policy and a range of competitive alternative injection strategies, and our controlled study shows that temporal video features (V-JEPA2) outperform strong static image features (DINOv2), with the largest gains on multi-phase tasks requiring state tracking and coordination. Real-robot experiments on long-horizon bimanual cloth folding further demonstrate the practicality of PVI beyond simulation.
Easy-IIL: Reducing Human Operational Burden in Interactive Imitation Learning via Assistant Experts
Interactive Imitation Learning (IIL) typically relies on extensive human involvement for both offline demonstration and online interaction. Prior work primarily focuses on reducing human effort in passive monitoring rather than active operation. Interestingly, structured model-based imitation approaches achieve comparable performance with significantly fewer demonstrations than end-to-end imitation learning policies in the low-data regime. However, these methods are typically surpassed by end-to-end policies as the data increases. Leveraging this insight, we propose Easy-IIL, a framework that utilizes off-the-shelf model-based imitation methods as an assistant expert to replace active human operation for the majority of data collection. The human expert only provides a single demonstration to initialize the assistant expert and intervenes in critical states where the task is approaching failure. Furthermore, Easy-IIL can maintain IIL performance by preserving both offline and online data quality. Extensive simulation and real-world experiments demonstrate that Easy-IIL significantly reduces human operational burden while maintaining performance comparable to mainstream IIL baselines. User studies further confirm that Easy-IIL reduces subjective workload on the human expert. Project page: https://sites.google.com/view/easy-iil
Show, Don't Tell: Detecting Novel Objects by Watching Human Videos
How can a robot quickly identify and recognize new objects shown to it during a human demonstration? Existing closed-set object detectors frequently fail at this because the objects are out-of-distribution. While open-set detectors (e.g., VLMs) sometimes succeed, they often require expensive and tedious human-in-the-loop prompt engineering to uniquely recognize novel object instances. In this paper, we present a self-supervised system that eliminates the need for tedious language descriptions and expensive prompt engineering by training a bespoke object detector on an automatically created dataset, supervised by the human demonstration itself. In our approach, "Show, Don't Tell," we show the detector the specific objects of interest during the demonstration, rather than telling the detector about these objects via complex language descriptions. By bypassing language altogether, this paradigm enables us to quickly train bespoke detectors tailored to the relevant objects observed in human task demonstrations. We develop an integrated on-robot system to deploy our "Show, Don't Tell" paradigm of automatic dataset creation and novel object-detection on a real-world robot. Empirical results demonstrate that our pipeline significantly outperforms state-of-the-art detection and recognition methods for manipulated objects, leading to improved task completion for the robot.
Conflict Mitigation in Shared Environments using Flow-Aware Multi-Agent Path Finding ICRA 2026
Deploying multi-robot systems in environments shared with dynamic and uncontrollable agents presents significant challenges, especially for large robot fleets. In such environments, individual robot operations can be delayed due to unforeseen conflicts with uncontrollable agents. While existing research primarily focuses on preserving the completeness of Multi-Agent Path Finding (MAPF) solutions considering delays, there is limited emphasis on utilizing additional environmental information to enhance solution quality in the presence of other dynamic agents. To this end, we propose Flow-Aware Multi-Agent Path Finding (FA-MAPF), a novel framework that integrates learned motion patterns of uncontrollable agents into centralized MAPF algorithms. Our evaluation, conducted on a diverse set of benchmark maps with simulated uncontrollable agents and on a real-world map with recorded human trajectories, demonstrates the effectiveness of FA-MAPF compared to state-of-the-art baselines. The experimental results show that FA-MAPF can consistently reduce conflicts with uncontrollable agents, up to 55%, without compromising task efficiency.
comment: To be presented at ICRA 2026
AnchorVLA4D: an Anchor-Based Spatial-Temporal Vision-Language-Action Model for Robotic Manipulation
Since current Vision-Language-Action (VLA) systems suffer from limited spatial perception and the absence of memory throughout manipulation, we investigate visual anchors as a means to enhance spatial and temporal reasoning within VLA policies for robotic manipulation. Conventional VLAs generate actions by conditioning on a single current frame together with a language instruction. However, since the frame is encoded as a 2D image, it does not contain detailed spatial information, and the VLA similarly lacks any means to incorporate past context. As a result, it frequently forgets objects under occlusion and becomes spatially disoriented during the manipulation process. Thus, we propose AnchorVLA4D, a simple spatial-temporal VLA that augments the visual input with an anchor image to preserve the initial scene context throughout execution, and adds a lightweight spatial encoder that jointly processes the anchor and current frames to expose geometric relationships within an episode. Built on a Qwen2.5-VL backbone with a diffusion-based action head, AnchorVLA4D requires no additional sensing modalities (e.g., depth or point clouds) and introduces negligible inference overhead. Combining anchoring with a frozen pretrained spatial encoder yields further gains, realizing a 13.6% improvement on the Simpler WidowX benchmark and confirming the approach on real-world tasks, where it achieved an average success rate of 80%.
Altered Thoughts, Altered Actions: Probing Chain-of-Thought Vulnerabilities in VLA Robotic Manipulation
Recent Vision-Language-Action (VLA) models increasingly adopt chain-of-thought (CoT) reasoning, generating a natural-language plan before decoding motor commands. This internal text channel between the reasoning module and the action decoder has received no adversarial scrutiny. We ask: which properties of this intermediate plan does the action decoder actually rely on, and can targeted corruption of the reasoning trace alone -- with all inputs left intact -- degrade a robot's physical task performance? We design a taxonomy of seven text corruptions organized into three attacker tiers (blind noise, mechanical-semantic, and LLM-adaptive) and apply them to a state-of-the-art reasoning VLA across 40 LIBERO tabletop manipulation tasks. Our results reveal a striking asymmetry: substituting object names in the reasoning trace reduces overall success rate by 8.3~percentage points (pp) -- reaching $-$19.3~pp on goal-conditioned tasks and $-$45~pp on individual tasks -- whereas sentence reordering, spatial-direction reversal, token noise, and even a 70B-parameter LLM crafting plausible-but-wrong plans all have negligible impact (within $\pm$4~pp). This asymmetry indicates that the action decoder depends on entity-reference integrity rather than reasoning quality or sequential structure. Notably, a sophisticated LLM-based attacker underperforms simple mechanical object-name substitution, because preserving plausibility inadvertently retains the entity-grounding structure the decoder needs. A cross-architecture control using a non-reasoning VLA confirms the vulnerability is exclusive to reasoning-augmented models, while instruction-level attacks degrade both architectures -- establishing that the internal reasoning trace is a distinct and stealthy threat vector invisible to input-validation defenses.
HaltNav: Reactive Visual Halting over Lightweight Topological Priors for Robust Vision-Language Navigation
Vision-and-Language Navigation (VLN) is shifting from rigid, step-by-step instruction following toward open-vocabulary, goal-oriented autonomy. Achieving this transition without exhaustive routing prompts requires agents to leverage structural priors. While prior work often assumes computationally heavy 2D/3D metric maps, we instead exploit a lightweight, text-based osmAG (OpenStreetMap Area Graph), a floorplan-level topological representation that is easy to obtain and maintain. However, global planning over a prior map alone is brittle in real-world deployments, where local connectivity can change (e.g., closed doors or crowded passages), leading to execution-time failures. To address this gap, we propose a hierarchical navigation framework HaltNav that couples the robust global planning of osmAG with the local exploration and instruction-grounding capability of VLN. Our approach features an MLLM-based brain module, which is capable of high-level task grounding and obstruction awareness. Conditioned on osmAG, the brain converts the global route into a sequence of localized execution snippets, providing the VLN executor with prior-grounded, goal-centric sub-instructions. Meanwhile, it detects local anomalies via a mechanism we term Reactive Visual Halting (RVH), which interrupts the local control loop, updates osmAG by invalidating the corresponding topology, and triggers replanning to orchestrate a viable detour. To train this halting capability efficiently, we introduce a data synthesis pipeline that leverages generative models to inject realistic obstacles into otherwise navigable scenes, substantially enriching hard negative samples. Extensive experiments demonstrate that our hierarchical framework outperforms several baseline methods without tedious language instructions, and significantly improves robustness for long-horizon vision-language navigation under environmental changes.
Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data
Human athletes demonstrate versatile and highly-dynamic tennis skills to successfully conduct competitive rallies with a high-speed tennis ball. However, reproducing such behaviors on humanoid robots is difficult, partially due to the lack of perfect humanoid action data or human kinematic motion data in tennis scenarios as reference. In this work, we propose LATENT, a system that Learns Athletic humanoid TEnnis skills from imperfect human motioN daTa. The imperfect human motion data consist only of motion fragments that capture the primitive skills used when playing tennis rather than precise and complete human-tennis motion sequences from real-world tennis matches, thereby significantly reducing the difficulty of data collection. Our key insight is that, despite being imperfect, such quasi-realistic data still provide priors about human primitive skills in tennis scenarios. With further correction and composition, we learn a humanoid policy that can consistently strike incoming balls under a wide range of conditions and return them to target locations, while preserving natural motion styles. We also propose a series of designs for robust sim-to-real transfer and deploy our policy on the Unitree G1 humanoid robot. Our method achieves surprising results in the real world and can stably sustain multi-shot rallies with human players. Project page: https://zzk273.github.io/LATENT/
TacVLA: Contact-Aware Tactile Fusion for Robust Vision-Language-Action Manipulation
Vision-Language-Action (VLA) models have demonstrated significant advantages in robotic manipulation. However, their reliance on vision and language often leads to suboptimal performance in tasks involving visual occlusion, fine-grained manipulation, and physical contact. To address these challenges, we propose TacVLA, a fine-tuned VLA model by incorporating tactile modalities into the transformer-based policy to enhance fine-grained manipulation capabilities. Specifically, we introduce a contact-aware gating mechanism that selectively activates tactile tokens only when contact is detected, enabling adaptive multimodal fusion while avoiding irrelevant tactile interference. The fused visual, language, and tactile tokens are jointly processed within the transformer architecture to strengthen cross-modal grounding during contact-rich interaction. Extensive experiments on constraint-locked disassembly, in-box picking and robustness evaluations demonstrate that our model outperforms baselines, improving the performance by averaging 20% success rate in disassembly, 60% in in-box picking and 2.1x improvement in scenarios with visual occlusion. Videos are available at https://sites.google.com/view/tacvla and code will be released.
comment: 9 pages, 7 figures
Learning Geometric and Photometric Features from Panoramic LiDAR Scans for Outdoor Place Categorization
Semantic place categorization, which is one of the essential tasks for autonomous robots and vehicles, allows them to have capabilities of self-decision and navigation in unfamiliar environments. In particular, outdoor places are more difficult targets than indoor ones due to perceptual variations, such as dynamic illuminance over twenty-four hours and occlusions by cars and pedestrians. This paper presents a novel method of categorizing outdoor places using convolutional neural networks (CNNs), which take omnidirectional depth/reflectance images obtained by 3D LiDARs as the inputs. First, we construct a large-scale outdoor place dataset named Multi-modal Panoramic 3D Outdoor (MPO) comprising two types of point clouds captured by two different LiDARs. They are labeled with six outdoor place categories: coast, forest, indoor/outdoor parking, residential area, and urban area. Second, we provide CNNs for LiDAR-based outdoor place categorization and evaluate our approach with the MPO dataset. Our results on the MPO dataset outperform traditional approaches and show the effectiveness in which we use both depth and reflectance modalities. To analyze our trained deep networks we visualize the learned features.
comment: Published in Advanced Robotics on 31 Jul 2018
Autonomous Integration and Improvement of Robotic Assembly using Skill Graph Representations
Robotic assembly systems traditionally require substantial manual engineering effort to integrate new tasks, adapt to new environments, and improve performance over time. This paper presents a framework for autonomous integration and continuous improvement of robotic assembly systems based on Skill Graph representations. A Skill Graph organizes robot capabilities as verb-based skills, explicitly linking semantic descriptions (verbs and nouns) with executable policies, pre-conditions, post-conditions, and evaluators. We show how Skill Graphs enable rapid system integration by supporting semantic-level planning over skills, while simultaneously grounding execution through well-defined interfaces to robot controllers and perception modules. After initial deployment, the same Skill Graph structure supports systematic data collection and closed-loop performance improvement, enabling iterative refinement of skills and their composition. We demonstrate how this approach unifies system configuration, execution, evaluation, and learning within a single representation, providing a scalable pathway toward adaptive and reusable robotic assembly systems. The code is at https://github.com/intelligent-control-lab/AIDF.
CarPLAN: Context-Adaptive and Robust Planning with Dynamic Scene Awareness for Autonomous Driving
Imitation learning (IL) is widely used for motion planning in autonomous driving due to its data efficiency and access to real-world driving data. For safe and robust real-world driving, IL-based planning requires capturing the complex driving contexts inherent in real-world data and enabling context-adaptive decision-making, rather than relying solely on expert trajectory imitation. In this paper, we propose CarPLAN, a novel IL-based motion planning framework that explicitly enhances driving context understanding and enables adaptive planning across diverse traffic scenarios. Our contributions are twofold: We introduce Displacement-Aware Predictive Encoding (DPE) to improve the model's spatial awareness by predicting future displacement vectors between the Autonomous Vehicle (AV) and surrounding scene elements. This allows the planner to account for relational spacing when generating trajectories. In addition to the standard imitation loss, we incorporate an augmented loss term that captures displacement prediction errors, ensuring planning decisions consider relative distances from other agents. To improve the model's ability to handle diverse driving contexts, we propose Context-Adaptive Multi-Expert Decoder (CMD), which leverages the Mixture of Experts (MoE) framework. CMD dynamically selects the most suitable expert decoders based on scene structure at each Transformer layer, enabling adaptive and context-aware planning in dynamic environments. We evaluate CarPLAN on the nuPlan benchmark and demonstrate state-of-the-art performance across all closed-loop simulation metrics. In particular, CarPLAN exhibits robust performance on challenging scenarios such as Test14-Hard, validating its effectiveness in complex driving conditions. Additional experiments on the Waymax benchmark further demonstrate its generalization capability across different benchmark settings.
comment: 10 pages, 6 figures. Under review at IEEE Transactions on Intelligent Transportation Systems
Early Pruning for Public Transport Routing
Routing algorithms for public transport, particularly the widely used RAPTOR and its variants, often face performance bottlenecks during the transfer relaxation phase, especially on dense transfer graphs, when supporting unlimited transfers. This inefficiency arises from iterating over many potential inter-stop connections (walks, bikes, e-scooters, etc.). To maintain acceptable performance, practitioners often limit transfer distances or exclude certain transfer options, which can reduce path optimality and restrict the multimodal options presented to travellers. This paper introduces Early Pruning, a low-overhead technique that accelerates routing algorithms without compromising optimality. By pre-sorting transfer connections by duration and applying a pruning rule within the transfer loop, the method discards longer transfers at a stop once they cannot yield an earlier arrival than the current best solution. Early Pruning can be integrated with minimal changes to existing codebases and requires only a one-time preprocessing step. Across multiple state-of-the-art RAPTOR-based solutions, including RAPTOR, ULTRA-RAPTOR, McRAPTOR, BM-RAPTOR, ULTRA-McRAPTOR, and UBM-RAPTOR and tested on the Switzerland and London transit networks, we achieved query time reductions of up to 57%. This approach provides a generalizable improvement to the efficiency of transit pathfinding algorithms. Beyond algorithmic performance, Early Pruning has practical implications for transport planning. By reducing computational costs, it enables transit agencies to expand transfer radii and incorporate additional mobility modes into journey planners without requiring extra server infrastructure. This is particularly relevant for passengers in areas with sparse direct transit coverage, such as outer suburbs and smaller towns, where richer multimodal routing can reveal viable alternatives to private car use.
Skill-informed Data-driven Haptic Nudges for High-dimensional Human Motor Learning
In this work, we propose a data-driven skill-informed framework to design optimal haptic nudge feedback for high-dimensional novel motor learning tasks. We first model the stochastic dynamics of human motor learning using an Input-Output Hidden Markov Model (IOHMM), which explicitly decouples latent skill evolution from observable kinematic emissions. Leveraging this predictive model, we formulate the haptic nudge feedback design problem as a Partially Observable Markov Decision Process (POMDP). This allows us to derive an optimal nudging policy that minimizes long-term performance cost, implicitly guiding the learner toward robust regions of the skill space. We validated our approach through a human-subject study ($N=30$) using a high-dimensional hand-exoskeleton task. Results demonstrate that participants trained with the POMDP-derived policy exhibited significantly accelerated task performance compared to groups receiving heuristic-based feedback or no feedback. Furthermore, synergy analysis revealed that the POMDP group discovered efficient low-dimensional motor representations more rapidly.
From Woofs to Words: Towards Intelligent Robotic Guide Dogs with Verbal Communication AAAI 2026
Assistive robotics is an important subarea of robotics that focuses on the well-being of people with disabilities. A robotic guide dog is an assistive quadruped robot that helps visually impaired people in obstacle avoidance and navigation. Enabling language capabilities for robotic guide dogs goes beyond naively adding an existing dialog system onto a mobile robot. The novel challenges include grounding language in the dynamically changing environment and improving spatial awareness for the human handler. To address those challenges, we develop a novel dialog system for robotic guide dogs that uses LLMs to verbalize both navigational plans and scenes. The goal is to enable verbal communication for collaborative decision-making within the handler-robot team. In experiments, we conducted a human study to evaluate different verbalization strategies and a simulation study to assess the efficiency and accuracy in navigation tasks.
comment: 10 pages, 6 figures, AAAI 2026
Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation
Recent world-model-based Vision-Language-Action (VLA) architectures have improved robotic manipulation through predictive visual foresight. However, dense future prediction introduces visual redundancy and accumulates errors, causing long-horizon plan drift. Meanwhile, recent sparse methods typically represent visual foresight using high-level semantic subtasks or implicit latent states. These representations often lack explicit kinematic grounding, weakening the alignment between planning and low-level execution. To address this, we propose StructVLA, which reformulates a generative world model into an explicit structured planner for reliable control. Instead of dense rollouts or semantic goals, StructVLA predicts sparse, physically meaningful structured frames. Derived from intrinsic kinematic cues (e.g., gripper transitions and kinematic turning points), these frames capture spatiotemporal milestones closely aligned with task progress. We implement this approach through a two-stage training paradigm with a unified discrete token vocabulary: the world model is first trained to predict structured frames and subsequently optimized to map the structured foresight into low-level actions. This approach provides clear physical guidance and bridges visual planning and motion control. In our experiments, StructVLA achieves strong average success rates of 75.0% on SimplerEnv-WidowX and 94.8% on LIBERO. Real-world deployments further demonstrate reliable task completion and robust generalization across both basic pick-and-place and complex long-horizon tasks.
PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization
Recent progress in text-conditioned human motion generation has been largely driven by diffusion models trained on large-scale human motion data. Building on this progress, recent methods attempt to transfer such models for character animation and real robot control by applying a Whole-Body Controller (WBC) that converts diffusion-generated motions into executable trajectories. While WBC trajectories become compliant with physics, they may expose substantial deviations from original motion. To address this issue, we here propose PhysMoDPO, a Direct Preference Optimization framework. Unlike prior work that relies on hand-crafted physics-aware heuristics such as foot-sliding penalties, we integrate WBC into our training pipeline and optimize diffusion model such that the output of WBC becomes compliant both with physics and original text instructions. To train PhysMoDPO we deploy physics-based and task-specific rewards and use them to assign preference to synthesized trajectories. Our extensive experiments on text-to-motion and spatial control tasks demonstrate consistent improvements of PhysMoDPO in both physical realism and task-related metrics on simulated robots. Moreover, we demonstrate that PhysMoDPO results in significant improvements when applied to zero-shot motion transfer in simulation and for real-world deployment on a G1 humanoid robot.
Beyond Binary Success: Sample-Efficient and Statistically Rigorous Robot Policy Comparison
Generalist robot manipulation policies are becoming increasingly capable, but are limited in evaluation to a small number of hardware rollouts. This strong resource constraint in real-world testing necessitates both more informative performance measures and reliable and efficient evaluation procedures to properly assess model capabilities and benchmark progress in the field. This work presents a novel framework for robot policy comparison that is sample-efficient, statistically rigorous, and applicable to a broad set of evaluation metrics used in practice. Based on safe, anytime-valid inference (SAVI), our test procedure is sequential, allowing the evaluator to stop early when sufficient statistical evidence has accumulated to reach a decision at a pre-specified level of confidence. Unlike previous work developed for binary success, our unified approach addresses a wide range of informative metrics: from discrete partial credit task progress to continuous measures of episodic reward or trajectory smoothness, spanning both parametric and nonparametric comparison problems. Through extensive validation on simulated and real-world evaluation data, we demonstrate up to 70% reduction in evaluation burden compared to standard batch methods and up to 50% reduction compared to state-of-the-art sequential procedures designed for binary outcomes, with no loss of statistical rigor. Notably, our empirical results show that competing policies can be separated more quickly when using fine-grained task progress than binary success metrics.
comment: 12 + 9 pages, 2 + 5 figures,
Egocentric World Model for Photorealistic Hand-Object Interaction Synthesis
To serve as a scalable data source for embodied AI, world models should act as true simulators that infer interaction dynamics strictly from user actions, rather than mere conditional video generators relying on privileged future object states. In this context, egocentric Human-Object Interaction (HOI) world models are critical for predicting physically grounded first-person rollouts. However, building such models is profoundly challenging due to rapid head motions, severe occlusions, and high-DoF hand articulations that abruptly alter contact topologies. Consequently, existing approaches often circumvent these physics challenges by resorting to conditional video generation with access to known future object trajectories. We introduce EgoHOI, an egocentric HOI world model that breaks away from this shortcut to simulate photorealistic, contact-consistent interactions from action signals alone. To ensure physical accuracy without future-state inputs, EgoHOI distills geometric and kinematic priors from 3D estimates into physics-informed embeddings. These embeddings regularize the egocentric rollouts toward physically valid dynamics. Experiments on the HOT3D dataset demonstrate consistent gains over strong baselines, and ablations validate the effectiveness of our physics-informed design.
Sonar-MASt3R: Real-Time Opti-Acoustic Fusion in Turbid, Unstructured Environments ICRA 2026
Underwater intervention is an important capability in several marine domains, with numerous industrial, scientific, and defense applications. However, existing perception systems used during intervention operations rely on data from optical cameras, which limits capabilities in poor visibility or lighting conditions. Prior work has examined opti-acoustic fusion methods, which use sonar data to resolve the depth ambiguity of the camera data while using camera data to resolve the elevation angle ambiguity of the sonar data. However, existing methods cannot achieve dense 3D reconstructions in real-time, and few studies have reported results from applying these methods in a turbid environment. In this work, we propose the opti-acoustic fusion method Sonar-MASt3R, which uses MASt3R to extract dense correspondences from optical camera data in real-time and pairs it with geometric cues from an acoustic 3D reconstruction to ensure robustness in turbid conditions. Experimental results using data recorded from an opti-acoustic eye-in-hand configuration across turbidity values ranging from <0.5 to >12 NTU highlight this method's improved robustness to turbidity relative to baseline methods.
comment: This paper has been accepted for publication in ICRA 2026. Copyright IEEE
Creating manufacturable blueprints for coarse-grained virtual robots
Over the past three decades, countless embodied yet virtual agents have freely evolved inside computer simulations, but vanishingly few were realized as physical robots. This is because evolution was conducted at a level of abstraction that was convenient for freeform body generation (creation, mutation, recombination) but swept away almost all of the physical details of functional body parts. The resulting designs were crude and underdetermined, requiring considerable effort and expertise to convert into a manufacturable format. Here, we automate this mapping from simplified design spaces that are readily evolvable to complete blueprints that can be directly followed by a builder. The pipeline incrementally resolves manufacturing constraints by embedding the structural and functional semantics of motors, electronics, batteries, and wiring into the abstract virtual design. In lieu of evolution, a user-defined or AI-generated ``sketch'' of a body plan can also be fed as input to the pipeline, providing a versatile framework for accelerating the design of novel robots.
End-to-End O-RAN Testbed for Edge-AI-Enabled 5G/6G Connected Industrial Robotics
Connected robotics is one of the principal use cases driving the transition towards more intelligent and capable 6G mobile cellular networks. Replacing wired connections with highly reliable, high-throughput, and low-latency 5G/6G radio interfaces enables robotic system mobility and the offloading of compute-intensive artificial intelligence (AI) models for robotic perception and control to servers located at the network edge. The transition towards Edge AI as a Service (E-AIaaS) simplifies on-site maintenance of robotic systems and reduces operational costs in industrial environments, while supporting flexible AI model life-cycle management and seamless upgrades of robotic functionalities over time. In this paper, we present a 5G/6G O-RAN-based end-to-end testbed that integrates E-AIaaS for connected industrial robotic applications. The objective is to design and deploy a generic experimental platform based on open technologies and interfaces, demonstrated through an E-AIaaS-enabled autonomous welding scenario. Within this scenario, the testbed is used to investigate trade-offs among different data acquisition, edge processing, and real-time streaming approaches for robotic perception, while supporting emerging paradigms such as semantic and goal-oriented communications.
comment: Submitted to Global 6G Conference 2026
Fabric Pneumatic Artificial Muscle-Based Head-Neck Exosuit: Design, Modeling, and Evaluation
Wearable exosuits assist human movement in tasks ranging from rehabilitation to daily activities; specifically, head-neck support is necessary for patients with certain neurological disorders. Rigid-link exoskeletons have shown to enable head-neck mobility compared to static braces, but their bulkiness and restrictive structure inspire designs using "soft" actuation methods. In this paper, we propose a fabric pneumatic artificial muscle-based exosuit design for head-neck support. We describe the design of our prototype and physics-based model, enabling us to derive actuator pressures required to compensate for gravitational load. Our modeled range of motion and workspace analysis indicate that the limited actuator lengths impose slight limitations (83% workspace coverage), and gravity compensation imposes a more significant limitation (43% workspace coverage). We introduce compression force along the neck as a novel, potentially comfort-related metric. We further apply our model to compare the torque output of various actuator placement configurations, allowing us to select a design with stability in lateral deviation and high axial rotation torques. The model correctly predicts trends in measured data where wrapping the actuators around the neck is not a significant factor. Our test dummy and human user demonstration confirm that the exosuit can provide functional head support and trajectory tracking, underscoring the potential of artificial muscle-based soft actuation for head-neck mobility assistance.
comment: Manuscript (8 pages, 5 tables, 7 figures) accepted to IEEE International Conference on Robotics and Automation 2026. Video attachment: https://youtu.be/iGuEbvCXgJ0?si=WqP2q-P_Mp1Brmfc
Learning Actionable Manipulation Recovery via Counterfactual Failure Synthesis
While recent foundation models have significantly advanced robotic manipulation, these systems still struggle to autonomously recover from execution errors. Current failure-learning paradigms rely on either costly and unsafe real-world data collection or simulator-based perturbations, which introduce a severe sim-to-real gap. Furthermore, existing visual analyzers predominantly output coarse, binary diagnoses rather than the executable, trajectory-level corrections required for actual recovery. To bridge the gap between failure diagnosis and actionable recovery, we introduce Dream2Fix, a framework that synthesizes photorealistic, counterfactual failure rollouts directly from successful real-world demonstrations. By perturbing actions within a generative world model, Dream2Fix creates paired failure-correction data without relying on simulators. To ensure the generated data is physically viable for robot learning, we implement a structured verification mechanism that strictly filters rollouts for task validity, visual coherence, and kinematic safety. This engine produces a high-fidelity dataset of over 120k paired samples. Using this dataset, we fine-tune a vision-language model to jointly predict failure types and precise recovery trajectories, mapping visual anomalies directly to corrective actions. Extensive real-world robotic experiments show our approach achieves state-of-the-art correction accuracy, improving from 19.7% to 81.3% over prior baselines, and successfully enables zero-shot closed-loop failure recovery in physical deployments.
Verification and Forward Invariance of Control Barrier Functions for Differential-Algebraic Systems
Differential-algebraic equations (DAEs) arise in power networks, chemical processes, and multibody systems, where algebraic constraints encode physical conservation laws. The safety of such systems is critical, yet safe control is challenging because algebraic constraints restrict allowable state trajectories. Control barrier functions (CBFs) provide computationally efficient safety filters for ordinary differential equation (ODE) systems. However, existing CBF methods are not directly applicable to DAEs due to potential conflicts between the CBF condition and the constraint manifold. This paper introduces DAE-aware CBFs that incorporate the differential-algebraic structure through projected vector fields. We derive conditions that ensure forward invariance of safe sets while preserving algebraic constraints and extend the framework to higher-index DAEs. A systematic verification framework is developed, establishing necessary and sufficient conditions for geometric correctness and feasibility of DAE-aware CBFs. For polynomial systems, sum-of-squares certificates are provided, while for nonpolynomial and neural network candidates, satisfiability modulo theories are used for falsification. The approach is validated on wind turbine and flexible-link manipulator systems.
Safety-guaranteed and Goal-oriented Semantic Sensing, Communication, and Control for Robotics
Wirelessly-connected robotic system empowers robots with real-time intelligence by leveraging remote computing resources for decision-making. However, the data exchange between robots and base stations often overwhelms communication links, introducing latency that undermines real-time response. To tackle this, goal-oriented semantic communication (GSC) has been introduced into wirelessly-connected robotic systems to extract and transmit only goal-relevant semantic representations, enhancing communication efficiency and task effectiveness. However, existing GSC approaches focused primarily on optimizing effectiveness metrics while overlooking safety requirements, which should be treated as the top priority in real-world robotic systems. To bridge this gap, we propose safety-guaranteed and goal-oriented semantic communication for wirelessly-connected robotic system, aiming to maximize the robotic task effectiveness subject to practical operational safety requirements. We first summarize the general safety requirements and effectiveness metrics across typical robotic tasks, including robot arm grasping, unmanned aerial vehicle (UAV)-assisted tasks, and multi-robot exploration. We then systematically analyze the unique safety and effectiveness challenges faced by wirelessly-connected robotic system in sensing, communication, and control. Based on these, we further present potential safety-guaranteed and goal-oriented sensing, communication, and control solutions. Finally, a UAV target tracking case study validates that our proposed GSC solutions can significantly improve safety rate and tracking success rate by more than 2 times and 4.5 times, respectively.
comment: 7 pages. This paper has been submitted to the IEEE Communications Magazine
Spatially Grounded Long-Horizon Task Planning in the Wild
Recent advances in robot manipulation increasingly leverage Vision-Language Models (VLMs) for high-level reasoning, such as decomposing task instructions into sequential action plans expressed in natural language that guide downstream low-level motor execution. However, current benchmarks do not assess whether these plans are spatially executable, particularly in specifying the exact spatial locations where the robot should interact to execute the plan, limiting evaluation of real-world manipulation capability. To bridge this gap, we define a novel task of grounded planning and introduce GroundedPlanBench, a newly curated benchmark for spatially grounded long-horizon action planning in the wild. GroundedPlanBench jointly evaluates hierarchical sub-action planning and spatial action grounding (where to act), enabling systematic assessment of whether generated sub-actions are spatially executable for robot manipulation. We further introduce Video-to-Spatially Grounded Planning (V2GP), an automated data generation framework that leverages real-world robot video demonstrations to improve spatially grounded long-horizon planning. Our evaluations reveal that spatially grounded long-horizon planning remains a major bottleneck for current VLMs. Our results demonstrate that V2GP provides a promising approach for improving both action planning and spatial grounding performance, validated on our benchmark as well as through real-world robot manipulation experiments, advancing progress toward spatially actionable planning.
comment: 9 pages, 7 figures
Better Safe Than Sorry: Enhancing Arbitration Graphs for Safe and Robust Autonomous Decision-Making
This paper introduces an extension to the arbitration graph framework designed to enhance the safety and robustness of autonomous systems in complex, dynamic environments. Building on the flexibility and scalability of arbitration graphs, the proposed method incorporates a verification step and structured fallback layers in the decision-making process. This ensures that only verified and safe commands are executed while enabling graceful degradation in the presence of unexpected faults or bugs. The approach is demonstrated using a Pac-Man simulation and further validated in the context of autonomous driving, where it shows significant reductions in accident risk and improvements in overall system safety. The bottom-up design of arbitration graphs allows for an incremental integration of new behavior components. The extension presented in this work enables the integration of experimental or immature behavior components while maintaining system safety by clearly and precisely defining the conditions under which behaviors are considered safe. The proposed method is implemented as a ready to use header-only C++ library, published under the MIT License. Together with the Pac-Man demo, it is available at github.com/KIT-MRT/arbitration_graphs.
comment: 7 pages, 5 figures, Presented at 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC), source code available at github.com/KIT-MRT/arbitration_graphs, v2: Added paragraph discussing the differences between arbitration graphs and behavior trees, v3: Updated version as presented at SMC
RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation
The pursuit of robot generalists, agents capable of performing diverse tasks across diverse environments, demands rigorous and scalable evaluation. Yet real-world testing of robot policies remains fundamentally constrained: it is labor-intensive, slow, unsafe at scale, and difficult to reproduce. As policies expand in scope and complexity, these barriers only intensify, since defining "success" in robotics often hinges on nuanced human judgments of execution quality. We introduce RobotArena Infinity, a new benchmarking framework that overcomes these challenges by shifting vision-language-action (VLA) evaluation into large-scale simulated environments augmented with online human feedback. Leveraging advances in vision-language models, 2D-to-3D generative modeling, and differentiable rendering, our approach automatically converts video demonstrations from widely used robot datasets into simulated counterparts. Within these digital twins, we assess VLA policies using both automated vision-language-model-guided scoring and scalable human preference judgments collected from crowdworkers, transforming human involvement from tedious scene setup, resetting, and safety supervision into lightweight preference comparisons. To measure robustness, we systematically perturb simulated environments along multiple axes, including textures and object placements, stress-testing policy generalization under controlled variation. The result is a continuously evolving, reproducible, and scalable benchmark for real-world-trained robot manipulation policies, addressing a critical missing capability in today's robotics landscape.
comment: Website: https://robotarenainf.github.io
Accelerating Residual Reinforcement Learning with Uncertainty Estimation
Residual Reinforcement Learning (RL) is a popular approach for adapting pretrained policies by learning a lightweight residual policy that provides corrective actions. While Residual RL is more sample-efficient than finetuning the entire base policy, existing methods struggle with sparse rewards and are designed for deterministic base policies. We propose two improvements to Residual RL that further enhance its sample efficiency and make it suitable for stochastic base policies. First, we leverage uncertainty estimates of the base policy to focus exploration on regions in which the base policy is not confident. Second, we propose a simple modification to off-policy residual learning that allows it to observe base actions and better handle stochastic base policies. We evaluate our method with both Gaussian-based and Diffusion-based stochastic base policies on tasks from Robosuite and D4RL, and compare against state-of-the-art finetuning methods, demo-augmented RL methods, and other residual RL methods. Our algorithm significantly outperforms existing baselines in a variety of simulation benchmark environments. We also deploy our learned polices in the real world to demonstrate their robustness with zero-shot sim-to-real transfer. Paper homepage : lakshitadodeja.github.io/uncertainty-aware-residual-rl/
SegDAC: Visual Generalization in Reinforcement Learning via Dynamic Object Tokens
Visual reinforcement learning policies trained on pixel observations often struggle to generalize when visual conditions change at test time. Object-centric representations are a promising alternative, but most approaches use fixed-size slot representations, require image reconstruction, or need auxiliary losses to learn object decompositions. As a result, it remains unclear how to learn RL policies directly from object-level inputs without these constraints. We propose SegDAC, a Segmentation-Driven Actor-Critic that operates on a variable-length set of object token embeddings. At each timestep, text-grounded segmentation produces object masks from which spatially aware token embeddings are extracted. A transformer-based actor-critic processes these dynamic tokens, using segment positional encoding to preserve spatial information across objects. We ablate these design choices and show that both segment positional encoding and variable-length processing are individually necessary for strong performance. We evaluate SegDAC on 8 ManiSkill3 manipulation tasks under 12 visual perturbation types across 3 difficulty levels. SegDAC improves over prior visual generalization methods by 15% on easy, 66% on medium, and 88% on the hardest settings. SegDAC matches the sample efficiency of the state-of-the-art visual RL methods while achieving improved generalization under visual changes. Project Page: https://segdac.github.io/
comment: 12 pages
Dynamic Aware: Adaptive Multi-Mode Out-of-Distribution Detection for Trajectory Prediction in Autonomous Vehicles
Trajectory prediction is central to the safe and seamless operation of autonomous vehicles (AVs). In deployment, however, prediction models inevitably face distribution shifts between training data and real-world conditions, where rare or underrepresented traffic scenarios induce out-of-distribution (OOD) cases. While most prior OOD detection research in AVs has concentrated on computer vision tasks such as object detection and segmentation, trajectory-level OOD detection remains largely underexplored. A recent study formulated this problem as a quickest change detection (QCD) task, providing formal guarantees on the trade-off between detection delay and false alarms [1]. Building on this foundation, we propose a new framework that introduces adaptive mechanisms to achieve robust detection in complex driving environments. Empirical analysis across multiple real-world datasets reveals that prediction errors -- even on in-distribution samples -- exhibit mode-dependent distributions that evolve over time with dataset-specific dynamics. By explicitly modeling these error modes, our method achieves substantial improvements in both detection delay and false alarm rates. Comprehensive experiments on established trajectory prediction benchmarks show that our framework significantly outperforms prior UQ- and vision-based OOD approaches in both accuracy and computational efficiency, offering a practical path toward reliable, driving-aware autonomy.
comment: 8 pages, 7 figures
DriveMind: A Dual Visual Language Model-based Reinforcement Learning Framework for Autonomous Driving
End-to-end autonomous driving systems map sensor data directly to control commands, but remain opaque, lack interpretability, and offer no formal safety guarantees. While recent vision-language-guided reinforcement learning (RL) methods introduce semantic feedback, they often rely on static prompts and fixed objectives, limiting adaptability to dynamic driving scenes. We present DriveMind, a unified semantic reward framework that integrates: (i) a contrastive Vision-Language Model (VLM) encoder for stepwise semantic anchoring; (ii) a novelty-triggered VLM encoder-decoder, fine-tuned via chain-of-thought (CoT) distillation, for dynamic prompt generation upon semantic drift; (iii) a hierarchical safety module enforcing kinematic constraints (e.g., speed, lane centering, stability); and (iv) a compact predictive world model to reward alignment with anticipated ideal states. DriveMind achieves 19.4 +/- 2.3 km/h average speed, 0.98 +/- 0.03 route completion, and near-zero collisions in CARLA Town 2, outperforming baselines by over 4% in success rate. Its semantic reward generalizes zero-shot to real dash-cam data with minimal distributional shift, demonstrating robust cross-domain alignment and potential for real-world deployment.
comment: Submitted to IEEE Transactions on Intelligent Vehicles (T-IV)
Safe Interaction via Monte Carlo Linear-Quadratic Games
Safety is critical during human-robot interaction. But -- because people are inherently unpredictable -- it is often difficult for robots to plan safe behaviors. Instead of relying on our ability to anticipate humans, here we identify robot policies that are robust to unexpected human decisions. We achieve this by formulating human-robot interaction as a zero-sum game, where (in the worst case) the human's actions directly conflict with the robot's objective. Solving for the Nash Equilibrium of this game provides robot policies that maximize safety and performance across a wide range of human actions. Existing approaches attempt to find these optimal policies by leveraging Hamilton-Jacobi analysis (which is intractable) or linear-quadratic approximations (which are inexact). By contrast, in this work we propose a computationally efficient and theoretically justified method that converges towards the Nash Equilibrium policy. Our approach (which we call MCLQ) leverages linear-quadratic games to obtain an initial guess at safe robot behavior, and then iteratively refines that guess with a Monte Carlo search. Not only does MCLQ provide real-time safety adjustments, but it also enables the designer to tune how conservative the robot is -- preventing the system from focusing on unrealistic human behaviors. Our simulations and user study suggest that this approach advances safety in terms of both computation time and expected performance. See videos of our experiments here: https://youtu.be/KJuHeiWVuWY.
Continuous Design and Reprogramming of Totimorphic Structures for Space Applications
Recently, a class of mechanical lattices with reconfigurable, zero-stiffness structures has been proposed, called Totimorphic lattices. In this work, we introduce a computational framework that enables continuous reprogramming of a Totimorphic lattice's effective properties, such as mechanical and optical behaviour, through geometric changes alone, demonstrated using computer simulations. Our approach is differentiable and guarantees valid Totimorphic configurations throughout the optimisation process, providing not only target states with desired properties but also continuous trajectories in configuration space that connect them. This enables reprogrammable structures in which actuators are controlled via automatic differentiation on an objective-dependent cost function, continuously adapting the lattice to achieve a given goal. We focus on deep space applications, where harsh and resource-constrained environments demand solutions that combine flexibility, efficiency, and autonomy. As proof of concept, we present two scenarios: a reprogrammable disordered lattice material and a space telescope mirror with adjustable focal length. The introduced framework is adaptable to a wide range of Totimorphic designs and objectives, providing a lightweight model for endowing physical systems with autonomous self-configuration and self-repair capabilities.
comment: Code: https://github.com/esa/LattyMorph/tree/main
Beyond Static Instruction: A Multi-agent AI Framework for Adaptive Augmented Reality Robot Training
Augmented Reality (AR) offers powerful visualization capabilities for industrial robot training, yet current interfaces remain predominantly static, failing to account for learners' diverse cognitive profiles. In this paper, we present an AR application for robot training and propose a multi-agent AI framework for future integration that bridges the gap between static visualization and pedagogical intelligence. We report on the evaluation of the baseline AR interface with 36 participants performing a robotic pick-and-place task. While overall usability was high, notable disparities in task duration and learner characteristics highlighted the necessity for dynamic adaptation. To address this, we propose a multi-agent framework that orchestrates multiple components to perform complex preprocessing of multimodal inputs (e.g., voice, physiology, robot data) and adapt the AR application to the learner's needs. By utilizing autonomous Large Language Model (LLM) agents, the proposed system would dynamically adapt the learning environment based on advanced LLM reasoning in real-time.
Reference-Free Sampling-Based Model Predictive Control
We present a sampling-based model predictive control (MPC) framework that enables emergent locomotion without relying on handcrafted gait patterns or predefined contact sequences. Our method discovers diverse motion patterns, ranging from trotting to galloping, robust standing policies, jumping, and handstand balancing, purely through the optimization of high-level objectives. Building on model predictive path integral (MPPI), we propose a cubic Hermite spline parameterization that operates on position and velocity control points. Our approach enables contact-making and contact-breaking strategies that adapt automatically to task requirements, requiring only a limited number of sampled trajectories. This sample efficiency enables real-time control on standard CPU hardware, eliminating the GPU acceleration typically required by other state-of-the-art MPPI methods. We validate our approach on the Go2 quadrupedal robot, demonstrating a range of emergent gaits and basic jumping capabilities. In simulation, we further showcase more complex behaviors, such as backflips, dynamic handstand balancing and locomotion on a Humanoid, all without requiring reference tracking or offline pre-training.
IROSA: Interactive Robot Skill Adaptation using Natural Language
Foundation models have demonstrated impressive capabilities across diverse domains, while imitation learning provides principled methods for robot skill adaptation from limited data. Combining these approaches holds significant promise for direct application to robotics, yet this combination has received limited attention, particularly for industrial deployment. We present a novel framework that enables open-vocabulary skill adaptation through a tool-based architecture, maintaining a protective abstraction layer between the language model and robot hardware. Our approach leverages pre-trained LLMs to select and parameterize specific tools for adapting robot skills without requiring fine-tuning or direct model-to-robot interaction. We demonstrate the framework on a 7-DoF torque-controlled robot performing an industrial bearing ring insertion task, showing successful skill adaptation through natural language commands for speed adjustment, trajectory correction, and obstacle avoidance while maintaining safety, transparency, and interpretability.
comment: Accepted IEEE Robotics and Automation Letters (RA-L) journal, 8 pages, 5 figures, 3 tables, 1 listing
Guided Policy Optimization under Partial Observability
Reinforcement Learning (RL) in partially observable environments poses significant challenges due to the complexity of learning under uncertainty. While additional information, such as that available in simulations, can enhance training, effectively leveraging it remains an open problem. To address this, we introduce Guided Policy Optimization (GPO), a framework that co-trains a guider and a learner. The guider takes advantage of privileged information while ensuring alignment with the learner's policy that is primarily trained via imitation learning. We theoretically demonstrate that this learning scheme achieves optimality comparable to direct RL, thereby overcoming key limitations inherent in existing approaches. Empirical evaluations show strong performance of GPO across various tasks, including continuous control with partial observability and noise, and memory-based challenges, significantly outperforming existing methods.
NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction
Embodied navigation for long-horizon tasks, guided by complex natural language instructions, remains a formidable challenge in artificial intelligence. Existing agents often struggle with robust long-term planning about unseen environments, leading to high failure rates. To address these limitations, we introduce NavForesee, a novel Vision-Language Model (VLM) that unifies high-level language planning and predictive world model imagination within a single, unified framework. Our approach empowers a single VLM to concurrently perform planning and predictive foresight. Conditioned on the full instruction and historical observations, the model is trained to understand the navigation instructions by decomposing the task, tracking its progress, and formulating the subsequent sub-goal. Simultaneously, it functions as a generative world model, providing crucial foresight by predicting short-term environmental dynamics and long-term navigation milestones. The VLM's structured plan guides its targeted prediction, while the imagined future provides rich context to inform the navigation actions, creating a powerful internal feedback loop of perception-planning/prediction-action. We demonstrate through extensive experiments on the R2R-CE and RxR-CE benchmark that NavForesee achieves highly competitive performance in complex scenarios. Our work highlights the immense potential of fusing explicit language planning with implicit spatiotemporal prediction, paving the way for more intelligent and capable embodied agents.
CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions ICRA 2026
Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety. Yet safety violations can lead to catastrophic outcomes in real-world deployments. Control Barrier Functions (CBFs) offer a principled method to enforce dynamic safety -- traditionally deployed online via safety filters. While the result is safe behavior, the fact that the RL policy does not have knowledge of the CBF can lead to conservative behaviors. This paper proposes CBF-RL, a framework for generating safe behaviors with RL by enforcing CBFs in training. CBF-RL has two key attributes: (1) minimally modifying a nominal RL policy to encode safety constraints via a CBF term, (2) and safety filtering of the policy rollouts in training. Theoretically, we prove that continuous-time safety filters can be deployed via closed-form expressions on discrete-time roll-outs. Practically, we demonstrate that CBF-RL internalizes the safety constraints in the learned policy -- both enforcing safer actions and biasing towards safer rewards -- enabling safe deployment without the need for an online safety filter. We validate our framework through ablation studies on navigation tasks and on the Unitree G1 humanoid robot, where CBF-RL enables safer exploration, faster convergence, and robust performance under uncertainty, enabling the humanoid robot to avoid obstacles and climb stairs safely in real-world settings without a runtime safety filter.
comment: To appear at ICRA 2026
How Safe Will I Be Given What I Saw? Calibrated Prediction of Safety Chances for Image-Controlled Autonomy
Autonomous robots that rely on deep neural network controllers pose critical challenges for safety prediction, especially under partial observability and distribution shift. Traditional model-based verification techniques are limited in scalability and require access to low-dimensional state models, while model-free methods often lack reliability guarantees. This paper addresses these limitations by introducing a framework for calibrated safety prediction in end-to-end vision-controlled systems, where neither the state-transition model nor the observation model is accessible. Building on the foundation of world models, we leverage variational autoencoders and recurrent predictors to forecast future latent trajectories from raw image sequences and estimate the probability of satisfying safety properties. We distinguish between monolithic and composite prediction pipelines and introduce a calibration mechanism to quantify prediction confidence. In long-horizon predictions from high-dimensional observations, the forecasted inputs to the safety evaluator can deviate significantly from the training distribution due to compounding prediction errors and changing environmental conditions, leading to miscalibrated risk estimates. To address this, we incorporate unsupervised domain adaptation to ensure robustness of safety evaluation under distribution shift in predictions without requiring manual labels. Our formulation provides theoretical calibration guarantees and supports practical evaluation across long prediction horizons. Experimental results on three benchmarks show that our UDA-equipped evaluators maintain high accuracy and substantially lower false positive rates under distribution shift. Similarly, world model-based composite predictors outperform their monolithic counterparts on long-horizon tasks, and our conformal calibration provides reliable statistical bounds.
comment: arXiv admin note: text overlap with arXiv:2308.12252
HumDex: Humanoid Dexterous Manipulation Made Easy
This paper investigates humanoid whole-body dexterous manipulation, where the efficient collection of high-quality demonstration data remains a central bottleneck. Existing teleoperation systems often suffer from limited portability, occlusion, or insufficient precision, which hinders their applicability to complex whole-body tasks. To address these challenges, we introduce HumDex, a portable teleoperation system designed for humanoid whole-body dexterous manipulation. Our system leverages IMU-based motion tracking to address the portability-precision trade-off, enabling accurate full-body tracking while remaining easy to deploy. For dexterous hand control, we further introduce a learning-based retargeting method that generates smooth and natural hand motions without manual parameter tuning. Beyond teleoperation, HumDex enables efficient collection of human motion data. Building on this capability, we propose a two-stage imitation learning framework that first pre-trains on diverse human motion data to learn generalizable priors, and then fine-tunes on robot data to bridge the embodiment gap for precise execution. We demonstrate that this approach significantly improves generalization to new configurations, objects, and backgrounds with minimal data acquisition costs. The entire system is fully reproducible and open-sourced at https://github.com/physical-superintelligence-lab/humdex.
MIND-V: Hierarchical World Model for Long-Horizon Robotic Manipulation with RL-based Physical Alignment
Scalable embodied intelligence is constrained by the scarcity of diverse, long-horizon robotic manipulation data. Existing video world models in this domain are limited to synthesizing short clips of simple actions and often rely on manually defined trajectories. To this end, we introduce MIND-V, a cognitive hierarchical world model designed to synthesize physically plausible and logically coherent videos of long-horizon robotic manipulation. Inspired by cognitive science, MIND-V bridges high-level reasoning with pixel-level synthesis through three core components: a Semantic Reasoning Hub (SRH) that leverages a pre-trained vision-language model for task planning; a Behavioral Semantic Bridge (BSB) that translates abstract instructions into domain-invariant representations; and a Motor Video Generator (MVG) for conditional video rendering. MIND-V employs Staged Visual Future Rollouts, a test-time optimization strategy to enhance long-horizon robustness. To enforce adherence to physical laws, we introduce a GRPO reinforcement learning post-training phase guided by a novel Physical Foresight Coherence (PFC) reward. PFC leverages the V-JEPA2 world model as a physics referee to penalize implausible dynamics in the latent feature space. Experiments confirm MIND-V's SOTA performance in long-horizon simulation and its significant value for policy learning, introducing a scalable and fully autonomous framework for embodied data synthesis.
AOMGen: Photoreal, Physics-Consistent Demonstration Generation for Articulated Object Manipulation CVPR
Recent advances in Vision-Language-Action (VLA) and world-model methods have improved generalization in tasks such as robotic manipulation and object interaction. However, Successful execution of such tasks depends on large, costly collections of real demonstrations, especially for fine-grained manipulation of articulated objects. To address this, we present AOMGen, a scalable data generation framework for articulated manipulation which is instantiated from a single real scan, demonstration and a library of readily available digital assets, yielding photoreal training data with verified physical states. The framework synthesizes synchronized multi-view RGB temporally aligned with action commands and state annotations for joints and contacts, and systematically varies camera viewpoints, object styles, and object poses to expand a single execution into a diverse corpus. Experimental results demonstrate that fine-tuning VLA policies on AOMGen data increases the success rate from 0% to 88.7%, and the policies are tested on unseen objects and layouts.
comment: Accepted by CVPR Findings2026
DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving
We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT. Project Page: https://yaoyao-jpg.github.io/dynvla.
comment: 18 pages, 10 figures. Project Page: https://yaoyao-jpg.github.io/dynvla
A Photorealistic Dataset and Vision-Based Algorithm for Anomaly Detection During Proximity Operations in Lunar Orbit ICRA'26
NASA's forthcoming Lunar Gateway space station, which will be uncrewed most of the time, will need to operate with an unprecedented level of autonomy. One key challenge is enabling the Canadarm3, the Gateway's external robotic system, to detect hazards in its environment using its onboard inspection cameras. This task is complicated by the extreme and variable lighting conditions in space. In this paper, we introduce the visual anomaly detection and localization task for the space domain and establish a benchmark based on a synthetic dataset called ALLO (Anomaly Localization in Lunar Orbit). We show that state-of-the-art visual anomaly detection methods often fail in the space domain, motivating the need for new approaches. To address this, we propose MRAD (Model Reference Anomaly Detection), a statistical algorithm that leverages the known pose of the Canadarm3 and a CAD model of the Gateway to generate reference images of the expected scene appearance. Anomalies are then identified as deviations from this model-generated reference. On the ALLO dataset, MRAD surpasses state-of-the-art anomaly detection algorithms, achieving an AP score of 62.9% at the pixel level and an AUROC score of 75.0% at the image level. Given the low tolerance for risk in space operations and the lack of domain-specific data, we emphasize the need for novel, robust, and accurate anomaly detection methods to handle the challenging visual conditions found in lunar orbit and beyond.
comment: In IEEE Robotics and Automation Letters (RA-L) and presented at the IEEE International Conference on Robotics and Automation (ICRA'26), 1-5 Jun. 2026, Vienna, Austria
Real-time Rendering-based Surgical Instrument Tracking via Evolutionary Optimization
Accurate and efficient tracking of surgical instruments is fundamental for Robot-Assisted Minimally Invasive Surgery. Although vision-based robot pose estimation has enabled markerless calibration without tedious physical setups, reliable tool tracking for surgical robots still remains challenging due to partial visibility and specialized articulation design of surgical instruments. Previous works in the field are usually prone to unreliable feature detections under degraded visual quality and data scarcity, whereas rendering-based methods often struggle with computational costs and suboptimal convergence. In this work, we incorporate CMA-ES, an evolutionary optimization strategy, into a versatile tracking pipeline that jointly estimates surgical instrument pose and joint configurations. Using batch rendering to efficiently evaluate multiple pose candidates in parallel, the method significantly reduces inference time and improves convergence robustness. The proposed framework further generalizes to joint angle-free and bi-manual tracking settings, making it suitable for both vision feedback control and online surgery video calibration. Extensive experiments on synthetic and real-world datasets demonstrate that the proposed method significantly outperforms prior approaches in both accuracy and runtime.
VIGS-SLAM: Visual Inertial Gaussian Splatting SLAM
We present VIGS-SLAM, a visual-inertial 3D Gaussian Splatting SLAM system that achieves robust real-time tracking and high-fidelity reconstruction. Although recent 3DGS-based SLAM methods achieve dense and photorealistic mapping, their purely visual design degrades under challenging conditions such as motion blur, low texture, and exposure variations. Our method tightly couples visual and inertial cues within a unified optimization framework, jointly optimizing camera poses, depths, and IMU states. It features robust IMU initialization, time-varying bias modeling, and loop closure with consistent Gaussian updates. Experiments on five challenging datasets demonstrate our superiority over state-of-the-art methods. Project page: https://vigs-slam.github.io
comment: Project page: https://vigs-slam.github.io
From Ellipsoids to Midair Control of Dynamic Hitches
The ability to manipulate and interlace cables using aerial vehicles can greatly improve aerial transportation tasks. Such interlacing cables create hitches by winding two or more cables around each other, which can enclose payloads or can further develop into knots. Dynamic modeling and control of such hitches are key to mastering inter-cable interactions in the context of cable-suspended aerial manipulation. This paper introduces an ellipsoid-based kinematic model to connect the geometric nature of a hitch created by two cables and the dynamics of the hitch driven by four aerial vehicles, which reveals the control-affine form of the system. As the constraint for maintaining tension of a cable is also control-affine, we design a quadratic programming-based controller that combines Control Lyapunov and High-Order Control Barrier Functions (CLF-HOCBF-QP) to precisely track a desired hitch position and system shape while enforcing safety constraints like cable tautness. We convert desired geometric reference configurations into target robot positions and introduce a composite error into the Lyapunov function to ensure a relative degree of one to the input. Numerical simulations validate our approach, demonstrating stable, high-speed tracking of dynamic references.
UMI-on-Air: Embodiment-Aware Guidance for Embodiment-Agnostic Visuomotor Policies
We introduce UMI-on-Air, a framework for embodiment-aware deployment of embodiment-agnostic manipulation policies. Our approach leverages diverse, unconstrained human demonstrations collected with a handheld gripper (UMI) to train generalizable visuomotor policies. A central challenge in transferring these policies to constrained robotic embodiments-such as aerial manipulators-is the mismatch in control and robot dynamics, which often leads to out-of-distribution behaviors and poor execution. To address this, we propose Embodiment-Aware Diffusion Policy (EADP), which couples a high-level UMI policy with a low-level embodiment-specific controller at inference time. By integrating gradient feedback from the controller's tracking cost into the diffusion sampling process, our method steers trajectory generation towards dynamically feasible modes tailored to the deployment embodiment. This enables plug-and-play, embodiment-aware trajectory adaptation at test time. We validate our approach on multiple long-horizon and high-precision aerial manipulation tasks, showing improved success rates, efficiency, and robustness under disturbances compared to unguided diffusion baselines. Finally, we demonstrate deployment in previously unseen environments, using UMI demonstrations collected in the wild, highlighting a practical pathway for scaling generalizable manipulation skills across diverse-and even highly constrained-embodiments. All code, data, checkpoints, and result videos can be found at umi-on-air.github.io.
comment: Result videos can be found at umi-on-air.github.io
Concurrent Prehensile and Nonprehensile Manipulation: A Practical Approach to Multi-Stage Dexterous Tasks
Dexterous hands enable concurrent prehensile and nonprehensile manipulation, such as holding one object while interacting with another, a capability essential for everyday tasks yet underexplored in robotics. Learning such long-horizon, contact-rich multi-stage behaviors is challenging because demonstrations are expensive to collect and end-to-end policies require substantial data to generalize across varied object geometries and placements. We present DexMulti, a sample-efficient approach for real-world dexterous multi-task manipulation that decomposes demonstrations into object-centric skills with well-defined temporal boundaries. Rather than learning monolithic policies, our method retrieves demonstrated skills based on current object geometry, aligns them to the observed object state using an uncertainty-aware estimator that tracks centroid and yaw, and executes them via a retrieve-align-execute paradigm. We evaluate on three multi-stage tasks requiring concurrent manipulation (Grasp + Pull, Grasp + Open, and Grasp + Grasp) across two dexterous hands (Allegro and LEAP) in over 1,000 real-world trials. Our approach achieves an average success rate of 66% on training objects with only 3-4 demonstrations per object, outperforming diffusion policy baselines by 2-3x while requiring far fewer demonstrations. Results demonstrate robust generalization to held-out objects and spatial variations up to +/-25 cm.
comment: 12 pages, 6 figures
A Human-in-the-Loop Confidence-Aware Failure Recovery Framework for Modular Robot Policies
Robots operating in unstructured human environments inevitably encounter failures, especially in robot caregiving scenarios. While humans can often help robots recover, excessive or poorly targeted queries impose unnecessary cognitive and physical workload on the human partner. We present a human-in-the-loop failure-recovery framework for modular robotic policies, where a policy is composed of distinct modules such as perception, planning, and control, any of which may fail and often require different forms of human feedback. Our framework integrates calibrated estimates of module-level uncertainty with models of human intervention cost to decide which module to query and when to query the human. It separates these two decisions: a module selector identifies the module most likely responsible for failure, and a querying algorithm determines whether to solicit human input or act autonomously. We evaluate several module-selection strategies and querying algorithms in controlled synthetic experiments, revealing trade-offs between recovery efficiency, robustness to system and user variables, and user workload. Finally, we deploy the framework on a robot-assisted bite acquisition system and demonstrate, in studies involving individuals with both emulated and real mobility limitations, that it improves recovery success while reducing the workload imposed on users. Our results highlight how explicitly reasoning about both robot uncertainty and human effort can enable more efficient and user-centered failure recovery in collaborative robots. Supplementary materials and videos can be found at: http://emprise.cs.cornell.edu/modularhil
comment: The second and third authors contributed equally. The last two authors advised equally
Multiagent Systems
Conflict Mitigation in Shared Environments using Flow-Aware Multi-Agent Path Finding ICRA 2026
Deploying multi-robot systems in environments shared with dynamic and uncontrollable agents presents significant challenges, especially for large robot fleets. In such environments, individual robot operations can be delayed due to unforeseen conflicts with uncontrollable agents. While existing research primarily focuses on preserving the completeness of Multi-Agent Path Finding (MAPF) solutions considering delays, there is limited emphasis on utilizing additional environmental information to enhance solution quality in the presence of other dynamic agents. To this end, we propose Flow-Aware Multi-Agent Path Finding (FA-MAPF), a novel framework that integrates learned motion patterns of uncontrollable agents into centralized MAPF algorithms. Our evaluation, conducted on a diverse set of benchmark maps with simulated uncontrollable agents and on a real-world map with recorded human trajectories, demonstrates the effectiveness of FA-MAPF compared to state-of-the-art baselines. The experimental results show that FA-MAPF can consistently reduce conflicts with uncontrollable agents, up to 55%, without compromising task efficiency.
comment: To be presented at ICRA 2026
Collaborative Multi-Agent Optimization for Personalized Memory System
Memory systems are crucial to personalized LLMs by mitigating the context window limitation in capturing long-term user-LLM conversations. Typically, such systems leverage multiple agents to handle multi-granular memory construction and personalized memory retrieval tasks. To optimize the system, existing methods focus on specializing agents on their local tasks independently via prompt engineering or fine-tuning. However, they overlook cross-agent collaboration, where independent optimization on local agents hardly guarantees the global system performance. To address this issue, we propose a Collaborative Reinforcement Learning Framework for Multi-Agent Memory Systems (CoMAM), jointly optimizing local agents to facilitate collaboration. Specifically, we regularize agents' execution as a sequential Markov decision process (MDP) to embed inter-agent dependencies into the state transition, yielding both local task rewards (e.g., information coverage for memory construction) and global rewards (i.e., query-answer accuracy). Then, we quantify each agent's contribution via group-level ranking consistency between local and global rewards, treating them as adaptive weights to assign global credit and integrate local-global rewards. Each agent is optimized by these integrated rewards, aligning local improvements with the global performance. Experiments show CoMAM outperforms leading memory systems, validating the efficacy of our proposed collaborative reinforcement learning for joint optimization.
Feynman: Knowledge-Infused Diagramming Agent for Scalable Visual Designs ICLR 2025
Visual design is an essential application of state-of-the-art multi-modal AI systems. Improving these systems requires high-quality vision-language data at scale. Despite the abundance of internet image and text data, knowledge-rich and well-aligned image-text pairs are rare. In this paper, we present a scalable diagram generation pipeline built with our agent, Feynman. To create diagrams, Feynman first enumerates domain-specific knowledge components (''ideas'') and performs code planning based on the ideas. Given the plan, Feynman translates ideas into simple declarative programs and iterates to receives feedback and visually refine diagrams. Finally, the declarative programs are rendered by the Penrose diagramming system. The optimization-based rendering of Penrose preserves the visual semantics while injecting fresh randomness into the layout, thereby producing diagrams with visual consistency and diversity. As a result, Feynman can author diagrams along with grounded captions with very little cost and time. Using Feynman, we synthesized a dataset with more than 100k well-aligned diagram-caption pairs. We also curate a visual-language benchmark, Diagramma, from freshly generated data. Diagramma can be used for evaluating the visual reasoning capabilities of vision-language models. We plan to release the dataset, benchmark, and the full agent pipeline as an open-source project.
comment: A previous version was submitted to ICLR 2025
A Generative Model of Conspicuous Consumption and Status Signaling
Status signaling drives human behavior and the allocation of scarce resources such as mating opportunities, yet the generative mechanisms governing how specific goods, signals, or behaviors acquire prestige remain a puzzle. Classical frameworks, such as Costly Signaling Theory, treat preferences as fixed and struggle to explain how semiotic meaning changes based on context or drifts dynamically over time, occasionally reaching tipping points. In this work, we propose a computational theory of status grounded in the theory of appropriateness, positing that status symbols emerge endogenously through a feedback loop of social observation and predictive pattern completion. We validate this theory using simulations of groups of Large Language Model (LLM)-based agents in the Concordia framework. By experimentally manipulating social visibility within naturalistic agent daily routines, we demonstrate that social interactions transform functional demand into status-seeking behavior. We observe the emergence of price run-ups and positive price elasticity (Veblen effects) for both real-world luxury items and procedurally generated synthetic goods, ruling out pretraining bias as the sole driver. Furthermore, we demonstrate that "influencer" agents can drive the endogenous formation of distinct subcultures through targeted sanctioning, and find that similar social influence effects generalize to non-monetary signaling behaviors. This work provides a generative bridge between micro-level cognition and macro-level economic and sociological phenomena, offering a new methodology for forecasting how cultural conventions emerge from interaction.
comment: 29 pages, 13 figures
LLM Constitutional Multi-Agent Governance
Large Language Models (LLMs) can generate persuasive influence strategies that shift cooperative behavior in multi-agent populations, but a critical question remains: does the resulting cooperation reflect genuine prosocial alignment, or does it mask erosion of agent autonomy, epistemic integrity, and distributional fairness? We introduce Constitutional Multi-Agent Governance (CMAG), a two-stage framework that interposes between an LLM policy compiler and a networked agent population, combining hard constraint filtering with soft penalized-utility optimization that balances cooperation potential against manipulation risk and autonomy pressure. We propose the Ethical Cooperation Score (ECS), a multiplicative composite of cooperation, autonomy, integrity, and fairness that penalizes cooperation achieved through manipulative means. In experiments on scale-free networks of 80 agents under adversarial conditions (70% violating candidates), we benchmark three regimes: full CMAG, naive filtering, and unconstrained optimization. While unconstrained optimization achieves the highest raw cooperation (0.873), it yields the lowest ECS (0.645) due to severe autonomy erosion (0.867) and fairness degradation (0.888). CMAG attains an ECS of 0.741, a 14.9% improvement, while preserving autonomy at 0.985 and integrity at 0.995, with only modest cooperation reduction to 0.770. The naive ablation (ECS = 0.733) confirms that hard constraints alone are insufficient. Pareto analysis shows CMAG dominates the cooperation-autonomy trade-off space, and governance reduces hub-periphery exposure disparities by over 60%. These findings establish that cooperation is not inherently desirable without governance: constitutional constraints are necessary to ensure that LLM-mediated influence produces ethically stable outcomes rather than manipulative equilibria.
comment: Accepted for publication in 20th International Conference on Agents and Multi-Agent Systems: Technologies and Applications (AMSTA 2026), to appear in Springer Nature proceedings (KES Smart Innovation Systems and Technologies). The final authenticated version will be available online at Springer
Design and evaluation of an agentic workflow for crisis-related synthetic tweet datasets
Twitter (now X) has become an important source of social media data for situational awareness during crises. Crisis informatics research has widely used tweets from Twitter to develop and evaluate artificial intelligence (AI) systems for various crisis-relevant tasks, such as extracting locations and estimating damage levels from tweets to support damage assessment. However, recent changes in Twitter's data access policies have made it increasingly difficult to curate real-world tweet datasets related to crises. Moreover, existing curated tweet datasets are limited to past crisis events in specific contexts and are costly to annotate at scale. These limitations constrain the development and evaluation of AI systems used in crisis informatics. To address these limitations, we introduce an agentic workflow for generating crisis-related synthetic tweet datasets. The workflow iteratively generates synthetic tweets conditioned on prespecified target characteristics, evaluates them using predefined compliance checks, and incorporates structured feedback to refine them in subsequent iterations. As a case study, we apply the workflow to generate synthetic tweet datasets relevant to post-earthquake damage assessment. We show that the workflow can generate synthetic tweets that capture their target labels for location and damage level. We further demonstrate that the resulting synthetic tweet datasets can be used to evaluate AI systems on damage assessment tasks like geolocalization and damage level prediction. Our results indicate that the workflow offers a flexible and scalable alternative to real-world tweet data curation, enabling the systematic generation of synthetic social media data across diverse crisis events, societal contexts, and crisis informatics applications.
Hybrid topology control: a dynamic leader-based distributed edge-addition and deletion mechanism
Coordinated operations of multi-robot systems (MRS) require agents to maintain communication connections to accomplish team objectives. However, maintaining the connections imposes costs in terms of restricted robot mobility, resulting in suboptimal team performance. In this work, we consider a realistic MRS framework in which agents are subject to unknown dynamical disturbances and experience communication delays. Most existing works on connectivity maintenance use consensus-based frameworks for graph reconfiguration, where decision-making time scales with the number of nodes and requires multiple rounds of communication, making them ineffective under communication delays. To address this, we propose a novel leader-based decision-making algorithm that uses a central node for efficient real-time reconfiguration, reducing decision-making time to depend on the graph diameter rather than the number of nodes and requiring only one round of information transfer through the network. We propose a novel method for estimating robot locations within the MRS that actively accounts for unknown disturbances and the communication delays. Using these position estimates, the central node selects a set of edges to delete while allowing the formation of new edges, aiming to keep the diameter of the new graph within a threshold. We provide numerous simulation results to showcase the efficacy of the proposed method.
comment: Under review
GT-Space: Enhancing Heterogeneous Collaborative Perception with Ground Truth Feature Space
In autonomous driving, multi-agent collaborative perception enhances sensing capabilities by enabling agents to share perceptual data. A key challenge lies in handling {\em heterogeneous} features from agents equipped with different sensing modalities or model architectures, which complicates data fusion. Existing approaches often require retraining encoders or designing interpreter modules for pairwise feature alignment, but these solutions are not scalable in practice. To address this, we propose {\em GT-Space}, a flexible and scalable collaborative perception framework for heterogeneous agents. GT-Space constructs a common feature space from ground-truth labels, providing a unified reference for feature alignment. With this shared space, agents only need a single adapter module to project their features, eliminating the need for pairwise interactions with other agents. Furthermore, we design a fusion network trained with contrastive losses across diverse modality combinations. Extensive experiments on simulation datasets (OPV2V and V2XSet) and a real-world dataset (RCooper) demonstrate that GT-Space consistently outperforms baselines in detection accuracy while delivering robust performance. Our code will be released at https://github.com/KingScar/GT-Space.
JCAS-MARL: Joint Communication and Sensing UAV Networks via Resource-Constrained Multi-Agent Reinforcement Learning
Multi-UAV networks are increasingly deployed for large-scale inspection and monitoring missions, where operational performance depends on the coordination of sensing reliability, communication quality, and energy constraints. In particular, the rapid increase in overflowing waste bins and illegal dumping sites has created a need for efficient detection of waste hotspots. In this work, we introduce JCAS-MARL, a resource-aware multi-agent reinforcement learning (MARL) framework for joint communication and sensing (JCAS)-enabled UAV networks. Within this framework, multiple UAVs operate in a shared environment where each agent jointly controls its trajectory and the resource allocation of an OFDM waveform used simultaneously for sensing and communication. Battery consumption, charging behavior, and associated CO$_2$ emissions are incorporated into the system state to model realistic operational constraints. Information sharing occurs over a dynamic communication graph determined by UAV positions and wireless channel conditions. Waste hotspot detection requires consensus among multiple UAVs to improve reliability. Using this environment, we investigate how MARL policies exploit the sensing-communication-energy trade-off in JCAS-enabled UAV networks. Simulation results demonstrate that adaptive pilot-density control learned by the agents can outperform static configurations, particularly in scenarios where sensing accuracy and communication connectivity vary across the environment.
comment: 6 pages, 8 figures, submitted to the conference
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in LLMs, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbf{decision-theoretic view of steganography}. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents' observable actions. To formalise this perspective, we introduce generalised $\mathcal{V}$-information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbf{steganographic gap} -- a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.
comment: First two authors contributed equally
Multi-Agent Guided Policy Optimization
Due to practical constraints such as partial observability and limited communication, Centralized Training with Decentralized Execution (CTDE) has become the dominant paradigm in cooperative Multi-Agent Reinforcement Learning (MARL). However, existing CTDE methods often underutilize centralized training or lack theoretical guarantees. We propose Multi-Agent Guided Policy Optimization (MAGPO), a novel framework that better leverages centralized training by integrating centralized guidance with decentralized execution. MAGPO uses an autoregressive joint policy for scalable, coordinated exploration and explicitly aligns it with decentralized policies to ensure deployability under partial observability. We provide theoretical guarantees of monotonic policy improvement and empirically evaluate MAGPO on 43 tasks across 6 diverse environments. Results show that MAGPO consistently outperforms strong CTDE baselines and matches or surpasses fully centralized approaches, offering a principled and practical solution for decentralized multi-agent learning. Our code and experimental data can be found in https://github.com/liyheng/MAGPO.
Integration of TinyML and LargeML: A Survey of 6G and Beyond
The evolution from fifth-generation (5G) to sixth-generation (6G) networks is driving an unprecedented demand for advanced machine learning (ML) solutions. Deep learning has already demonstrated significant impact across mobile networking and communication systems, enabling intelligent services such as smart healthcare, smart grids, autonomous vehicles, aerial platforms, digital twins, and the metaverse. At the same time, the rapid proliferation of resource-constrained Internet-of-Things (IoT) devices has accelerated the adoption of tiny machine learning (TinyML) for efficient on-device intelligence, while large machine learning (LargeML) models continue to require substantial computational resources to support large-scale IoT services and ML-generated content. These trends highlight the need for a unified framework that integrates TinyML and LargeML to achieve seamless connectivity, scalable intelligence, and efficient resource management in future 6G systems. This survey provides a comprehensive review of recent advances enabling the integration of TinyML and LargeML in next-generation wireless networks. In particular, we (i) provide an overview of TinyML and LargeML, (ii) analyze the motivations and requirements for unifying these paradigms within the 6G context, (iii) examine efficient bidirectional integration approaches, (iv) review state-of-the-art solutions and their applicability to emerging 6G services, and (v) identify key challenges related to performance optimization, deployment feasibility, resource orchestration, and security. Finally, we outline promising research directions to guide the holistic integration of TinyML and LargeML for intelligent, scalable, and energy-efficient 6G networks and beyond.
comment: This work has been accepted for publication in IEEE Internet of Things Journal under ID: IoT-56661-2025
AutoClimDS: Climate Data Science Agentic AI -- A Knowledge Graph is All You Need
Climate data science remains constrained by fragmented data sources, heterogeneous formats, and steep technical expertise requirements. These barriers slow discovery, limit participation, and undermine reproducibility. We present AutoClimDS, a Minimum Viable Product (MVP) Agentic AI system that addresses these challenges by integrating a curated climate knowledge graph (KG) with a set of Agentic AI workflows designed for cloud-native scientific analysis. The KG unifies datasets, metadata, tools, and workflows into a machine-interpretable structure, while AI agents, powered by generative models, enable natural-language query interpretation, automated data discovery, programmatic data acquisition, and end-to-end climate analysis. A key result is that AutoClimDS can reproduce published scientific figures and analyses from natural-language instructions alone, completing the entire workflow from dataset selection to preprocessing to modeling. When given the same tasks, state-of-the-art general-purpose LLMs (e.g., ChatGPT GPT-5.1) cannot independently identify authoritative datasets or construct valid retrieval workflows using standard web access. This highlights the necessity of structured scientific memory for agentic scientific reasoning. By encoding procedural workflow knowledge into a KG and integrating it with existing technologies (cloud APIs, LLMs, sandboxed execution), AutoClimDS demonstrates that the KG serves as the essential enabling component, the irreplaceable structural foundation, for autonomous climate data science. This approach provides a pathway toward democratizing climate research through human-AI collaboration.
comment: Accepted to IEEE CAI 2026
Context Engineering: From Prompts to Corporate Multi-Agent Architecture
As artificial intelligence (AI) systems evolve from stateless chatbots to autonomous multi-step agents, prompt engineering (PE), the discipline of crafting individual queries, proves necessary but insufficient. This paper introduces context engineering (CE) as a standalone discipline concerned with designing, structuring, and managing the entire informational environment in which an AI agent makes decisions. Drawing on vendor architectures (Google ADK, Anthropic, LangChain), current academic work (ACE framework, Google DeepMind's intelligent delegation), enterprise research (Deloitte, 2026; KPMG, 2026), and the author's experience building a multi-agent system, the paper proposes five context quality criteria: relevance, sufficiency, isolation, economy, and provenance, and frames context as the agent's operating system. Two higher-order disciplines follow. Intent engineering (IE) encodes organizational goals, values, and trade-off hierarchies into agent infrastructure. Specification engineering (SE) creates a machine-readable corpus of corporate policies and standards enabling autonomous operation of multi-agent systems at scale. Together these four disciplines form a cumulative pyramid maturity model of agent engineering, in which each level subsumes the previous one as a necessary foundation. Enterprise data reveals a gap: while 75% of enterprises plan agentic AI deployment within two years (Deloitte, 2026), deployment has surged and retreated as organizations confront scaling complexity (KPMG, 2026). The Klarna case illustrates a dual deficit, contextual and intentional. Whoever controls the agent's context controls its behavior; whoever controls its intent controls its strategy; whoever controls its specifications controls its scale.
comment: 25 pages, 1 figure
Systems and Control (EESS)
Unifying Decision Making and Trajectory Planning in Automated Driving through Time-Varying Potential Fields
This paper proposes a unified decision making and local trajectory planning framework based on Time-Varying Artificial Potential Fields (TVAPFs). The TVAPF explicitly models the predicted motion via bounded uncertainty of dynamic obstacles over the planning horizon, using information from perception and V2X sources when available. TVAPFs are embedded into a finite horizon optimal control problem that jointly selects the driving maneuver and computes a feasible, collision free trajectory. The effectiveness and real-time suitability of the approach are demonstrated through a simulation test in a multi-actor scenario with real road topology, highlighting the advantages of the unified TVAPF-based formulation.
EMT and RMS Modeling of Thyristor Rectifiers for Stability Analysis of Converter-Based Systems
Thyristor rectifiers are a well-established and cost-effective solution for controlled high-power rectification, commonly used for hydrogen electrolysis and HVDC transmission. However, small-signal modeling and analysis of thyristor rectifiers remain challenging due to their line-commutated operation and nonlinear switching dynamics. This paper first revisits conventional RMS-based modeling of thyristor rectifiers and subsequently proposes a novel nonlinear state-space EMT model in the dq domain that can be linearized for small-signal analysis. The proposed model accurately captures all the relevant dynamic phenomena, including PLL dynamics, the commutation process, and switching delays. It is derived in polar coordinates, offering novel insights into the impact of the PLL and commutation angle on the thyristor rectifier dynamics. We verify the RMS and EMT models against a detailed switching model and demonstrate their applicability through small-signal stability analysis of a modified IEEE 39-bus test system that incorporates thyristor rectifier-interfaced hydrogen electrolyzers, synchronous generators, and grid-forming converters.
From Passive Monitoring to Active Defence: Resilient Control of Manipulators Under Cyberattacks
Cyber-physical robotic systems are vulnerable to false data injection attacks (FDIAs), in which an adversary corrupts sensor signals while evading residual-based passive anomaly detectors such as the chi-squared test. Such stealthy attacks can induce substantial end-effector deviations without triggering alarms. This paper studies the resilience of redundant manipulators to stealthy FDIAs and advances the architecture from passive monitoring to active defence. We formulate a closed-loop model comprising a feedback-linearized manipulator, a steady-state Kalman filter, and a chi-squared-based anomaly detector. Building on this passive monitoring layer, we propose an active control-level defence that attenuates the control input through a monotone function of an anomaly score generated by a novel actuation-projected, measurement-free state predictor. The proposed design provides probabilistic guarantees on nominal actuation loss and preserves closed-loop stability. From the attacker perspective, we derive a convex QCQP for computing one-step optimal stealthy attacks. Simulations on a 6-DOF planar manipulator show that the proposed defence significantly reduces attack-induced end-effector deviation while preserving nominal task performance in the absence of attacks.
A Physics-Based Digital Human Twin for Galvanic-Coupling Wearable Communication Links
This paper presents a systematic characterization of wearable galvanic coupling (GC) channels under narrowband and wideband operation. A physics-consistent digital human twin maps anatomical properties, propagation geometry, and electrode-skin interfaces into complex transfer functions directly usable for communication analysis. Attenuation, phase delay, and group delay are evaluated for longitudinal and radial configurations, and dispersion-induced variability is quantified through attenuation ripple and delay standard deviation metrics versus bandwidth. Results confirm electro-quasistatic, weakly dispersive behavior over 10 kHz-1 MHz. Attenuation is primarily geometry-driven, whereas amplitude ripple and delay variability increase with bandwidth, tightening equalization and synchronization constraints. Interface conditioning (gel and foam) significantly improves amplitude and phase stability, while propagation geometry governs link budget and baseline delay. Overall, the framework quantitatively links tissue electromagnetics to waveform distortion, enabling informed trade-offs among bandwidth, interface design, and transceiver complexity in wearable GC systems.
From AI Weather Prediction to Infrastructure Resilience: A Correction-Downscaling Framework for Tropical Cyclone Impacts
This paper addresses a missing capability in infrastructure resilience: turning fast, global AI weather forecasts into asset-scale, actionable risk. We introduce the AI-based Correction-Downscaling Framework (ACDF), which transforms coarse AI weather prediction (AIWP) into 500-m, unbiased wind fields and transmission tower/line failure probabilities for tropical cyclones. ACDF separates storm-scale bias correction from terrain-aware downscaling, preventing error propagation while restoring sub-kilometer variability that governs structural loading. Tested on 11 typhoons affecting Zhejiang, China under leave-one-storm-out evaluation, ACDF reduces station-scale wind-speed MAE by 38.8% versus Pangu-Weather, matches observation-assimilated mesoscale analyses, yet runs in 25 s per 12-h cycle on a single GPU. In the Typhoon Hagupit case, ACDF reproduced observed high-wind tails, isolated a coastal high-risk corridor, and flagged the line that failed, demonstrating actionable guidance at tower and line scales. ACDF provides an end-to-end pathway from AI global forecasts to operational, impact-based early warning for critical infrastructure.
Reinforcement Learning for Elliptical Cylinder Motion Control Tasks
The control of devices with limited input always bring attention to solve by research due to its difficulty and non-trival solution. For instance, the inverted pendulum is benchmarking problem in control theory and machine learning. In this work, we are focused on the elliptical cylinder and its motion under limited torque. The inspiration of the problem is from untethered magnetic devices, which due to distance have to operate with limited input torque. In this work, the main goal is to define the control problem of elliptic cylinder with limited input torque and solve it by Reinforcement Learning. As a classical baseline, we evaluate a two-stage controller composed of an energy-shaping swing-up law and a local Linear Quadratic Regulator (LQR) stabilizer around the target equilibrium. The swing-up controller increases the system's mechanical energy to drive the state toward a neighborhood of the desired equilibrium, a linearization of the nonlinear model yields an LQR that regulates the angle and angular-rate states to the target orientation with bounded input. This swing-up + LQR policy is a strong, interpretable reference for underactuated system and serves a point of comparison to the learned policy under identical limits and parameters. The solution shows that the learning is possible however, the different cases like stabilization in upward position or rotating of half turn are very difficult for increasing mass or ellipses with a strongly unequal perimeter ratio.
On the strict-feedback form of hyperbolic distributed-parameter systems
The paper is concerned with the strict-feedback form of hyperbolic distributed-parameter systems. Such a system structure is well known to be the basis for the recursive backstepping control design for nonlinear ODEs and is also reflected in the Volterra integral transformation used in the backstepping-based stabilization of parabolic PDEs. Although such integral transformations also proved very helpful in deriving state feedback controllers for hyperbolic PDEs, they are not necessarily related to a strict-feedback form. Therefore, the paper looks at structural properties of hyperbolic systems in the context of controllability. By combining and extending existing backstepping results, exactly controllable heterodirectional hyperbolic PDEs as well as PDE-ODE systems are mapped into strict-feedback form. While stabilization is not the objective in this paper, the obtained system structure is the basis for a recursive backstepping design and provides new insights into coupling structures of distributed-parameter systems that allow for a simple control design. In that sense, the paper aims to take backstepping for PDEs back to its ODE origin.
comment: Accepted at European Control Conference (ECC 2026)
Dual-Laws Model for a theory of artificial consciousness
Objectively verifying the generative mechanism of consciousness is extremely difficult because of its subjective nature. As long as theories of consciousness focus solely on its generative mechanism, developing a theory remains challenging. We believe that broadening the theoretical scope and enhancing theoretical unification are necessary to establish a theory of consciousness. This study proposes seven questions that theories of consciousness should address: phenomena, self, causation, state, function, contents, and universality. The questions were designed to examine the functional aspects of consciousness and its applicability to system design. Next, we will examine how our proposed Dual-Laws Model (DLM) can address these questions. Based on our theory, we anticipate two unique features of a conscious system: autonomy in constructing its own goals and cognitive decoupling from external stimuli. We contend that systems with these capabilities differ fundamentally from machines that merely follow human instructions. This makes a design theory that enables high moral behavior indispensable.
Skill-informed Data-driven Haptic Nudges for High-dimensional Human Motor Learning
In this work, we propose a data-driven skill-informed framework to design optimal haptic nudge feedback for high-dimensional novel motor learning tasks. We first model the stochastic dynamics of human motor learning using an Input-Output Hidden Markov Model (IOHMM), which explicitly decouples latent skill evolution from observable kinematic emissions. Leveraging this predictive model, we formulate the haptic nudge feedback design problem as a Partially Observable Markov Decision Process (POMDP). This allows us to derive an optimal nudging policy that minimizes long-term performance cost, implicitly guiding the learner toward robust regions of the skill space. We validated our approach through a human-subject study ($N=30$) using a high-dimensional hand-exoskeleton task. Results demonstrate that participants trained with the POMDP-derived policy exhibited significantly accelerated task performance compared to groups receiving heuristic-based feedback or no feedback. Furthermore, synergy analysis revealed that the POMDP group discovered efficient low-dimensional motor representations more rapidly.
As Language Models Scale, Low-order Linear Depth Dynamics Emerge
Large language models are often viewed as high-dimensional nonlinear systems and treated as black boxes. Here, we show that transformer depth dynamics admit accurate low-order linear surrogates within context. Across tasks including toxicity, irony, hate speech and sentiment, a 32-dimensional linear surrogate reproduces the layerwise sensitivity profile of GPT-2-large with near-perfect agreement, capturing how the final output shifts under additive injections at each layer. We then uncover a surprising scaling principle: for a fixed-order linear surrogate, agreement with the full model improves monotonically with model size across the GPT-2 family. This linear surrogate also enables principled multi-layer interventions that require less energy than standard heuristic schedules when applied to the full model. Together, our results reveal that as language models scale, low-order linear depth dynamics emerge within contexts, offering a systems-theoretic foundation for analyzing and controlling them.
A Lyapunov Characterization of Robust D-Stability with Application to Decentralized Integral Control of LTI Systems
The concept of matrix D-stability plays an important role in applications, ranging from economic and biological system models to decentralized control. Here we provide necessary and sufficient Lyapunov-type conditions for the robust (block) D-stability property. We leverage this characterization as part of a novel Lyapunov analysis of decentralized integral control for MIMO LTI systems, providing sufficient conditions guaranteeing stability under low-gain and under arbitrary connection and disconnection of individual control loops.
Robust Automatic Differentiation of Square-Root Kalman Filters via Gramian Differentials
Square-root Kalman filters propagate state covariances in Cholesky-factor form for numerical stability, and are a natural target for gradient-based parameter learning in state-space models. Their core operation, triangularization of a matrix $M \in \mathbb{R}^{n \times m}$, is computed via a QR decomposition in practice, but naively differentiating through it causes two problems: the semi-orthogonal factor is non-unique when $m > n$, yielding undefined gradients; and the standard Jacobian formula involves inverses, which diverges when $M$ is rank-deficient. Both are resolved by the observation that all filter outputs relevant to learning depend on the input matrix only through the Gramian $MM^\top$, so the composite loss is smooth in $M$ even where the triangularization is not. We derive a closed-form chain-rule directly from the differential of this Gramian identity, prove it exact for the Kalman log-marginal likelihood and filtered moments, and extend it to rank-deficient inputs via a two-component decomposition: a column-space term based on the Moore--Penrose pseudoinverse, and a null-space correction for perturbations outside the column space of $M$.
comment: 4 pages, documents the mathematics of a bug fix at https://github.com/state-space-models/cuthbert
Hybrid topology control: a dynamic leader-based distributed edge-addition and deletion mechanism
Coordinated operations of multi-robot systems (MRS) require agents to maintain communication connections to accomplish team objectives. However, maintaining the connections imposes costs in terms of restricted robot mobility, resulting in suboptimal team performance. In this work, we consider a realistic MRS framework in which agents are subject to unknown dynamical disturbances and experience communication delays. Most existing works on connectivity maintenance use consensus-based frameworks for graph reconfiguration, where decision-making time scales with the number of nodes and requires multiple rounds of communication, making them ineffective under communication delays. To address this, we propose a novel leader-based decision-making algorithm that uses a central node for efficient real-time reconfiguration, reducing decision-making time to depend on the graph diameter rather than the number of nodes and requiring only one round of information transfer through the network. We propose a novel method for estimating robot locations within the MRS that actively accounts for unknown disturbances and the communication delays. Using these position estimates, the central node selects a set of edges to delete while allowing the formation of new edges, aiming to keep the diameter of the new graph within a threshold. We provide numerous simulation results to showcase the efficacy of the proposed method.
comment: Under review
Verification and Forward Invariance of Control Barrier Functions for Differential-Algebraic Systems
Differential-algebraic equations (DAEs) arise in power networks, chemical processes, and multibody systems, where algebraic constraints encode physical conservation laws. The safety of such systems is critical, yet safe control is challenging because algebraic constraints restrict allowable state trajectories. Control barrier functions (CBFs) provide computationally efficient safety filters for ordinary differential equation (ODE) systems. However, existing CBF methods are not directly applicable to DAEs due to potential conflicts between the CBF condition and the constraint manifold. This paper introduces DAE-aware CBFs that incorporate the differential-algebraic structure through projected vector fields. We derive conditions that ensure forward invariance of safe sets while preserving algebraic constraints and extend the framework to higher-index DAEs. A systematic verification framework is developed, establishing necessary and sufficient conditions for geometric correctness and feasibility of DAE-aware CBFs. For polynomial systems, sum-of-squares certificates are provided, while for nonpolynomial and neural network candidates, satisfiability modulo theories are used for falsification. The approach is validated on wind turbine and flexible-link manipulator systems.
Safety-guaranteed and Goal-oriented Semantic Sensing, Communication, and Control for Robotics
Wirelessly-connected robotic system empowers robots with real-time intelligence by leveraging remote computing resources for decision-making. However, the data exchange between robots and base stations often overwhelms communication links, introducing latency that undermines real-time response. To tackle this, goal-oriented semantic communication (GSC) has been introduced into wirelessly-connected robotic systems to extract and transmit only goal-relevant semantic representations, enhancing communication efficiency and task effectiveness. However, existing GSC approaches focused primarily on optimizing effectiveness metrics while overlooking safety requirements, which should be treated as the top priority in real-world robotic systems. To bridge this gap, we propose safety-guaranteed and goal-oriented semantic communication for wirelessly-connected robotic system, aiming to maximize the robotic task effectiveness subject to practical operational safety requirements. We first summarize the general safety requirements and effectiveness metrics across typical robotic tasks, including robot arm grasping, unmanned aerial vehicle (UAV)-assisted tasks, and multi-robot exploration. We then systematically analyze the unique safety and effectiveness challenges faced by wirelessly-connected robotic system in sensing, communication, and control. Based on these, we further present potential safety-guaranteed and goal-oriented sensing, communication, and control solutions. Finally, a UAV target tracking case study validates that our proposed GSC solutions can significantly improve safety rate and tracking success rate by more than 2 times and 4.5 times, respectively.
comment: 7 pages. This paper has been submitted to the IEEE Communications Magazine
Upper bound of transient growth in accelerating and decelerating wall-driven flows using the Lyapunov method
This work analyzes accelerating and decelerating wall-driven flows by quantifying the upper bound of transient energy growth using a Lyapunov-type approach. By formulating the linearized Navier-Stokes equations as a linear time-varying system and constructing a time-dependent Lyapunov function, we obtain an upper bound on transient energy growth by solving linear matrix inequalities. This Lyapunov method can obtain the upper bound of transient energy growth that closely matches transient growth computed via the singular value decomposition of the state-transition matrix of linear time-varying systems. Our analysis captures that decelerating base flows exhibit significantly larger transient growth compared with accelerating flows. Our Lyapunov method offers the advantages of providing a certificate of uniform stability and an invariant set to bound the solution trajectory.
comment: 6 pages, 8 figures
Stability Analysis of Thermohaline Convection With a Time-Varying Shear Flow Using the Lyapunov Method
This work demonstrates that the Lyapunov method can effectively identify the growth rate of a linear time-periodic system describing cold fresh water on top of hot salty water with a periodically time-varying background shear flow. We employ a time-dependent weighting matrix to construct a Lyapunov function candidate, and the resulting linear matrix inequalities are discretized in time using the forward Euler method. As the number of temporal discretization points increases, the growth rate predicted from the Lyapunov method or the Floquet theory will converge to the same value as that obtained from numerical simulations. Additionally, the Lyapunov method is used to analyze the most dangerous disturbance, and we also compare computational resource usage for the Lyapunov method, numerical simulations, and the Floquet theory.
comment: 6 pages, 5 figures
Generalized Group Selection Strategies for Self-sustainable RIS-aided Communication
Reconfigurable intelligent surface (RIS) is a cutting-edge communication technology that has been proposed as aviable option for beyond fifth-generation wireless communication networks. This paper investigates various group selection strategies in the context of grouping-based self-sustainable RIS-aided device-to-device (D2D) communication with spatially correlated wireless channels. Specifically, we consider both power splitting (PS) and time switching (TS) configurations, of the self-sustainable RIS to analyze the system performance and propose appropriate bounds on the choice of system parameters. The analysis takes into account a simplified linear energy harvesting (EH) model as well as a practical non-linear EH model. Based on the application requirements, we propose various group selection strategies at the RIS. Notably, each strategy schedules the k-th best available group at the RIS based on the end-to-end signal-to-noise ratio (SNR) and also the energy harvested at a particular group of the RIS. Accordingly, by using tools from high order statistics, we derive analytical expressions for the outage probability of each selection strategy. Moreover, by applying the tools from extreme value theory, we also investigate an asymptotic scenario, where the number of groups available for selection at an RIS approaches infinity. The nontrivial insights obtained from this approach is especially beneficial in applications like large intelligent surface-aided wireless communication. Finally, the numerical results demonstrate the importance and benefits of the proposed approaches in terms of metrics such as the data throughput and the outage (both data and energy) performance.
comment: To appear in IEEE Transactions on Communications
Near-Optimal Low-Complexity MIMO Detection via Structured Reduced-Search Enumeration
Maximum-likelihood (ML) detection in high-order MIMO systems is computationally prohibitive due to exponential complexity in the number of transmit layers and constellation size. In this white paper, we demonstrate that for practical MIMO dimensions (up to 8x8) and modulation orders, near-ML hard-decision performance can be achieved using a structured reduced-search strategy with complexity linear in constellation size. Extensive simulations over i.i.d. Rayleigh fading channels show that list sizes of 3|X| for 3x3, 4|X| for 4x4, and 8|X| for 8x8 systems closely match full ML performance, even under high channel condition numbers, |X| being the constellation size. In addition, we provide a trellis based interpretation of the method. We further discuss implications for soft LLR generation and FEC interaction.
comment: 6 pages, 10 figures
Next-Generation Grid Codes: Towards a New Paradigm for Dynamic Ancillary Services
This paper introduces a conceptual foundation for Next Generation Grid Codes (NGGCs) based on stability and performance certificates, enabling the provision of dynamic ancillary services such as fast frequency and voltage regulation through decentralized frequency-domain criteria. The NGGC framework offers two key benefits: (i) rigorous closed-loop stability guarantees, and (ii) explicit performance guarantees for frequency and voltage dynamics in power systems. Regarding (i) stability, we employ loop-shifting and passivity-based techniques to derive local frequency-domain stability certificates for individual device dynamics. These certificates ensure the closed-loop stability of the entire interconnected power system through fully decentralized verification. Concerning (ii) performance, we establish quantitative bounds on critical time-domain indicators of system dynamics, including the average-mode frequency and voltage nadirs, the rate-of-change-of-frequency (RoCoF), steady-state deviations, and oscillation damping capabilities. The bounds are obtained by expressing the performance metrics as frequency-domain conditions on local device behavior. The NGGC framework is non-parametric, model-agnostic, and accommodates arbitrary device dynamics under mild assumptions. It thus provides a unified, decentralized approach to certifying both stability and performance without requiring explicit device-model parameterizations. Moreover, the NGGC framework can be directly used as a set of specifications for control design, offering a principled foundation for future stability- and performance-oriented grid codes in power systems.
comment: 13 pages, 15 figures
Cyqlone: A Parallel, High-Performance Linear Solver for Optimal Control
We present Cyqlone, a solver for linear systems with a stage-wise optimal control structure that fully exploits the various levels of parallelism available in modern hardware. Cyqlone unifies algorithms based on the sequential Riccati recursion, parallel Schur complement methods, and cyclic reduction methods, thereby minimizing the required number of floating-point operations, while allowing parallelization across a configurable number of processors. Given sufficient parallelism, the solver run time scales with the logarithm of the horizon length (in contrast to the linear scaling of sequential Riccati-based methods), enabling real-time solution of long-horizon problems. Beyond multithreading on multi-core processors, implementations of Cyqlone can also leverage vectorization using batched linear algebra routines. Such batched routines exploit data parallelism using single instruction, multiple data (SIMD) operations, and expose a higher degree of instruction-level parallelism than their non-batched counterparts. This enables them to significantly outperform BLAS and BLASFEO for the small matrices that arise in optimal control. Building on this high-performance linear solver, we develop CyQPALM, a parallel and optimal-control-specific variant of the QPALM quadratic programming solver. It combines the parallel and vectorized linear algebra operations from Cyqlone with a parallel line search and parallel factorization updates, resulting in order-of-magnitude speedups over the state-of-the-art HPIPM solver. Open-source C++ implementations of Cyqlone and CyQPALM are available at https://github.com/kul-optec/cyqlone
Reference-Free Sampling-Based Model Predictive Control
We present a sampling-based model predictive control (MPC) framework that enables emergent locomotion without relying on handcrafted gait patterns or predefined contact sequences. Our method discovers diverse motion patterns, ranging from trotting to galloping, robust standing policies, jumping, and handstand balancing, purely through the optimization of high-level objectives. Building on model predictive path integral (MPPI), we propose a cubic Hermite spline parameterization that operates on position and velocity control points. Our approach enables contact-making and contact-breaking strategies that adapt automatically to task requirements, requiring only a limited number of sampled trajectories. This sample efficiency enables real-time control on standard CPU hardware, eliminating the GPU acceleration typically required by other state-of-the-art MPPI methods. We validate our approach on the Go2 quadrupedal robot, demonstrating a range of emergent gaits and basic jumping capabilities. In simulation, we further showcase more complex behaviors, such as backflips, dynamic handstand balancing and locomotion on a Humanoid, all without requiring reference tracking or offline pre-training.
Distributed State Estimation for Discrete-Time Linear Systems over Directed Graphs: A Measurement Perspective
This paper proposes a novel consensus-based distributed filter over directed graphs under the collectively observability condition. The distributed filter is designed using an augmented leader-following information fusion strategy, and the gain parameter is determined exclusively using local information. Additionally, the lower bound of the fusion step number is derived to ensure that the estimation error covariance remains uniformly upper-bounded. Furthermore, the lower bounds for the convergence rates of the steady-state performance gap between the proposed filter and the centralized filter are provided as the fusion step number approaches infinity. The analysis demonstrates that the convergence rate is at least as fast as exponential convergence, provided the communication topology satisfies the spectral norm condition. Finally, the theoretical results are validated through two simulation examples.
Learnable Koopman-Enhanced Transformer-Based Time Series Forecasting with Spectral Control
This paper proposes a unified family of learnable Koopman operator parameterizations that integrate linear dynamical systems theory with modern deep learning forecasting architectures. We introduce four learnable Koopman variants-scalar-gated, per-mode gated, MLP-shaped spectral mapping, and low-rank Koopman operators which generalize and interpolate between strictly stable Koopman operators and unconstrained linear latent dynamics. Our formulation enables explicit control over the spectrum, stability, and rank of the linear transition operator while retaining compatibility with expressive nonlinear backbones such as Patchtst, Autoformer, and Informer. We evaluate the proposed operators in a large-scale benchmark that also includes LSTM, DLinear, and simple diagonal State-Space Models (SSMs), as well as lightweight transformer variants. Experiments across multiple horizons and patch lengths show that learnable Koopman models provide a favorable bias-variance trade-off, improved conditioning, and more interpretable latent dynamics. We provide a full spectral analysis, including eigenvalue trajectories, stability envelopes, and learned spectral distributions. Our results demonstrate that learnable Koopman operators are effective, stable, and theoretically principled components for deep forecasting.
CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions ICRA 2026
Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety. Yet safety violations can lead to catastrophic outcomes in real-world deployments. Control Barrier Functions (CBFs) offer a principled method to enforce dynamic safety -- traditionally deployed online via safety filters. While the result is safe behavior, the fact that the RL policy does not have knowledge of the CBF can lead to conservative behaviors. This paper proposes CBF-RL, a framework for generating safe behaviors with RL by enforcing CBFs in training. CBF-RL has two key attributes: (1) minimally modifying a nominal RL policy to encode safety constraints via a CBF term, (2) and safety filtering of the policy rollouts in training. Theoretically, we prove that continuous-time safety filters can be deployed via closed-form expressions on discrete-time roll-outs. Practically, we demonstrate that CBF-RL internalizes the safety constraints in the learned policy -- both enforcing safer actions and biasing towards safer rewards -- enabling safe deployment without the need for an online safety filter. We validate our framework through ablation studies on navigation tasks and on the Unitree G1 humanoid robot, where CBF-RL enables safer exploration, faster convergence, and robust performance under uncertainty, enabling the humanoid robot to avoid obstacles and climb stairs safely in real-world settings without a runtime safety filter.
comment: To appear at ICRA 2026
Dual Filter: A Transformer-like Inference Architecture for Hidden Markov Models
This paper presents a mathematical framework for causal nonlinear prediction in settings where observations are generated from an underlying hidden Markov model (HMM). Both the problem formulation and the proposed solution are motivated by the decoder-only transformer architecture, in which a finite sequence of observations (tokens) is mapped to the conditional probability of the next token. Our objective is not to construct a mathematical model of a transformer. Rather, our interest lies in deriving, from first principles, transformer-like architectures that solve the prediction problem for which the transformer is designed. The proposed framework is based on an original optimal control approach, where the prediction objective (MMSE) is reformulated as an optimal control problem. An analysis of the optimal control problem is presented leading to a fixed-point equation on the space of probability measures. To solve the fixed-point equation, we introduce the dual filter, an iterative algorithm that closely parallels the architecture of decoder-only transformers. These parallels are discussed in detail along with the relationship to prior work on mathematical modeling of transformers as transport on the space of probability measures. Numerical experiments are provided to illustrate the performance of the algorithm using parameter values typical of research-scale transformer models.
comment: 50 pages, 9 figures
Optimal Control of an Epidemic with Intervention Design
This paper investigates the optimal control of an epidemic governed by a SEIR model with an operational delay in vaccination. We address the mathematical challenge of imposing hard healthcare capacity constraints (e.g., ICU limits) over an infinite time horizon. To rigorously bridge the gap between theoretical constraints and numerical tractability, we employ a variational framework based on Moreau--Yosida regularization and establish the connection between finite- and infinite-horizon solutions via $Γ$-convergence. The necessary conditions for optimality are derived using the Pontryagin Maximum Principle, allowing for the characterization of boundary-maintenance arcs where the optimal strategy maintains the infection level precisely at the capacity boundary. Numerical simulations illustrate these theoretical findings, quantifying the shadow prices of infection and costs associated with intervention delays.
comment: For code and computational details in Python, please refer to \url{https://github.com/BehroozMoosavi/Codes/blob/main/Epidemic\%20With\%20Intervention/Epidemic.ipynb}
DiffOPF: Diffusion Solver for Optimal Power Flow
The optimal power flow (OPF) is a multi-valued, non-convex mapping from loads to dispatch setpoints. The variability of system parameters (e.g., admittances, topology) further contributes to the multiplicity of dispatch setpoints for a given load. Existing deep learning OPF solvers are single-valued and thus fail to capture the variability of system parameters unless fully represented in the feature space, which is prohibitive. To solve this problem, we introduce a diffusion-based OPF solver, termed \textit{DiffOPF}, that treats OPF as a conditional sampling problem. The solver learns the joint distribution of loads and dispatch setpoints from operational history, and returns the marginal dispatch distributions conditioned on loads. Unlike single-valued solvers, DiffOPF enables sampling statistically credible warm starts with favorable cost and constraint satisfaction trade-offs. We explore the sample complexity of DiffOPF to ensure the OPF solution within a prescribed distance from the optimization-based solution, and verify this experimentally on power system benchmarks.
comment: 8 pages, 4 figures, 2 tables
Artificial Transmission Line Synthesis Tailored for Traveling-Wave Parametric Processes
Artificial transmission lines built with lumped-element inductors and capacitors form the backbone of broadband, nearly quantum-limited traveling-wave parametric amplifiers (TWPAs). When tailoring these transmission lines for parametric processes, nonlinear elements are added, typically nonlinear inductances in superconducting circuits, and energy and momentum conservation between interacting tones must be enforced through careful design of the ATL dispersion relation. However, a unified theoretical framework describing achievable dispersion relations is lacking. Here, I develop such a framework, borrowing from periodic structure theory and passive network synthesis. These complementary approaches divide the design space: periodic loading synthesis employs spatial modulation of frequency-independent components, while filter synthesis employs frequency-dependent responses in spatially-uniform components. The framework reveals fundamental constraints and enables the discovery of novel TWPA architectures. In particular, I design a kinetic inductance TWPA with a novel phase-matching architecture, and a backward-pumped Josephson TWPA exploiting an ambidextrous i.e., right-left-handed transmission line.
comment: 25 pages, 11 figures
Conformalized Data-Driven Reachability Analysis with PAC Guarantees
Data-driven reachability analysis computes over-approximations of reachable sets directly from noisy data. Existing deterministic methods require either known noise bounds or system-specific structural parameters such as Lipschitz constants. We propose Conformalized Data-Driven Reachability (CDDR), a framework that provides Probably Approximately Correct (PAC) coverage guarantees through the Learn Then Test (LTT) calibration procedure, requiring only that calibration and test trajectories be independently and identically distributed. CDDR is developed for three settings: linear time-invariant (LTI) systems with unknown process noise distributions, LTI systems with bounded measurement noise, and general nonlinear systems including non-Lipschitz dynamics. Experiments on a 5-dimensional LTI system under Gaussian and heavy-tailed Student-t noise and on a 2-dimensional non-Lipschitz system with fractional damping demonstrate that CDDR achieves valid coverage where deterministic methods do not provide formal guarantees. Under anisotropic noise, a normalized score function reduces the reachable set volume while preserving the PAC guarantee.
comment: Submitted to IEEE Control Systems Letters (L-CSS) with IEEE Conference on Decision and Control (CDC), 6 pages, 3 figures, 3 tables
Robotics
CRAFT: A Tendon-Driven Hand with Hybrid Hard-Soft Compliance
We introduce CRAFT hand, a tendon-driven anthropomorphic hand with hybrid hard-soft compliance for contact-rich manipulation. The design is based on a simple idea: contact is not uniform across the hand. Impacts concentrate at joints, while links carry most of the load. CRAFT places soft material at joints and keeps links rigid, and uses rollingcontact joint surfaces to keep flexion on repeatable motion paths. Fifteen motors mounted on the fingers drive the hand through tendons, keeping the form factor compact and the fingers light. In structural tests, CRAFT improves strength and endurance while maintaining comparable repeatability. In teleoperation, CRAFT improves handling of fragile and low-friction items, and the hand covers 33/33 grasps in the Feix taxonomy. The full design costs under $600 and will be released open-source with visionbased teleoperation and simulation integration. Project page: http://craft-hand.github.io/
Towards Dynamic Model Identification and Gravity Compensation for the dVRK-Si Patient Side Manipulator
The da Vinci Research Kit (dVRK) is widely used for research in robot-assisted surgery, but most modeling and control methods target the first-generation dVRK Classic. The recently introduced dVRK-Si, built from da Vinci Si hardware, features a redesigned Patient Side Manipulator (PSM) with substantially larger gravity loading, which can degrade control if unmodeled. This paper presents the first complete kinematic and dynamic modeling framework for the dVRK-Si PSM. We derive a modified DH kinematic model that captures the closed-chain parallelogram mechanism, formulate dynamics via the Euler-Lagrange method, and express inverse dynamics in a linear-in-parameters regressor form. Dynamic parameters are identified from data collected on a periodic excitation trajectory optimized for numerical conditioning and estimated by convex optimization with physical feasibility constraints. Using the identified model, we implement real-time gravity compensation and computed-torque feedforward in the dVRK control stack. Experiments on a physical dVRK-Si show that the gravity compensation reduces steady-state joint errors by 68-84% and decreases end-effector tip drift during static holds from 4.2 mm to 0.7 mm. Computed-torque feedforward further improves transient and position tracking accuracy. For sinusoidal trajectory tracking, computed-torque feedforward reduces position errors by 35% versus gravity-only feedforward and by 40% versus PID-only. The proposed pipeline supports reliable control, high-fidelity simulation, and learning-based automation on the dVRK-Si.
comment: Submitted to IEEE Transactions on Medical Robotics and Bionics (T-MRB), under review. Open-source GitHub Repo: https://github.com/jhu-dvrk/dvrk_psm_dynamics_identification
Towards Universal Computational Aberration Correction in Photographic Cameras: A Comprehensive Benchmark Analysis CVPR 2026
Prevalent Computational Aberration Correction (CAC) methods are typically tailored to specific optical systems, leading to poor generalization and labor-intensive re-training for new lenses. Developing CAC paradigms capable of generalizing across diverse photographic lenses offers a promising solution to these challenges. However, efforts to achieve such cross-lens universality within consumer photography are still in their early stages due to the lack of a comprehensive benchmark that encompasses a sufficiently wide range of optical aberrations. Furthermore, it remains unclear which specific factors influence existing CAC methods and how these factors affect their performance. In this paper, we present comprehensive experiments and evaluations involving 24 image restoration and CAC algorithms, utilizing our newly proposed UniCAC, a large-scale benchmark for photographic cameras constructed via automatic optical design. The Optical Degradation Evaluator (ODE) is introduced as a novel framework to objectively assess the difficulty of CAC tasks, offering credible quantification of optical aberrations and enabling reliable evaluation. Drawing on our comparative analysis, we identify three key factors -- prior utilization, network architecture, and training strategy -- that most significantly influence CAC performance, and further investigate their respective effects. We believe that our benchmark, dataset, and observations contribute foundational insights to related areas and lay the groundwork for future investigations. Benchmarks, codes, and Zemax files will be available at https://github.com/XiaolongQian/UniCAC.
comment: Accepted to CVPR 2026. Benchmarks, codes, and Zemax files will be available at https://github.com/XiaolongQian/UniCAC
Decentralized Cooperative Localization for Multi-Robot Systems with Asynchronous Sensor Fusion
Decentralized cooperative localization (DCL) is a promising approach for nonholonomic mobile robots operating in GPS-denied environments with limited communication infrastructure. This paper presents a DCL framework in which each robot performs localization locally using an Extended Kalman Filter, while sharing measurement information during update stages only when communication links are available and companion robots are successfully detected by LiDAR. The framework preserves cross-correlation consistency among robot state estimates while handling asynchronous sensor data with heterogeneous sampling rates and accommodating accelerations during dynamic maneuvers. Unlike methods that require pre-aligned coordinate systems, the proposed approach allows robots to initialize with arbitrary reference-frame orientations and achieves automatic alignment through transformation matrices in both the prediction and update stages. To improve robustness in feature-sparse environments, we introduce a dual-landmark evaluation framework that exploits both static environmental features and mobile robots as dynamic landmarks. The proposed framework enables reliable detection and feature extraction during sharp turns, while prediction accuracy is improved through information sharing from mutual observations. Experimental results in both Gazebo simulation and real-world basement environments show that DCL outperforms centralized cooperative localization (CCL), achieving a 34% reduction in RMSE, while the dual-landmark variant yields an improvement of 56%. These results demonstrate the applicability of DCL to challenging domains such as enclosed spaces, underwater environments, and feature-sparse terrains where conventional localization methods are ineffective.
comment: Presented at the 13th RSI International Conference on Robotics and Mechatronics (ICRoM 2025)
Flight through Narrow Gaps with Morphing-Wing Drones
The size of a narrow gap traversable by a fixed-wing drone is limited by its wingspan. Inspired by birds, here, we enable the traversal of a gap of sub-wingspan width and height using a morphing-wing drone capable of temporarily sweeping in its wings mid-flight. This maneuver poses control challenges due to sudden lift loss during gap-passage at low flight speeds and the need for precisely timed wing-sweep actuation ahead of the gap. To address these challenges, we first develop an aerodynamic model for general wing-sweep morphing drone flight including low flight speeds and post-stall angles of attack. We integrate longitudinal drone dynamics into an optimal reference trajectory generation and Nonlinear Model Predictive Control framework with runtime adaptive costs and constraints. Validated on a 130 g wing-sweep-morphing drone, our method achieves an average altitude error of 5 cm during narrow-gap passage at forward speeds between 5 and 7 m/s, whilst enforcing fully swept wings near the gap across variable threshold distances. Trajectory analysis shows that the drone can compensate for lift loss during gap-passage by accelerating and pitching upwards ahead of the gap to an extent that differs between reference trajectory optimization objectives. We show that our strategy also allows for accurate gap passage on hardware whilst maintaining a constant forward flight speed reference and near-constant altitude.
Sim-to-reality adaptation for Deep Reinforcement Learning applied to an underwater docking application IROS 2026
Deep Reinforcement Learning (DRL) offers a robust alternative to traditional control methods for autonomous underwater docking, particularly in adapting to unpredictable environmental conditions. However, bridging the "sim-to-real" gap and managing high training latencies remain significant bottlenecks for practical deployment. This paper presents a systematic approach for autonomous docking using the Girona Autonomous Underwater Vehicle (AUV) by leveraging a high-fidelity digital twin environment. We adapted the Stonefish simulator into a multiprocessing RL framework to significantly accelerate the learning process while incorporating realistic AUV dynamics, collision models, and sensor noise. Using the Proximal Policy Optimization (PPO) algorithm, we developed a 6-DoF control policy trained in a headless environment with randomized starting positions to ensure generalized performance. Our reward structure accounts for distance, orientation, action smoothness, and adaptive collision penalties to facilitate soft docking. Experimental results demonstrate that the agent achieved a success rate of over 90% in simulation. Furthermore, successful validation in a physical test tank confirmed the efficacy of the sim-to-reality adaptation, with the DRL controller exhibiting emergent behaviors such as pitch-based braking and yaw oscillations to assist in mechanical alignment.
comment: Currently under review by IROS 2026
Learning Visuomotor Policy for Multi-Robot Laser Tag Game
In this paper, we study multi robot laser tag, a simplified yet practical shooting-game-style task. Classic modular approaches on these tasks face challenges such as limited observability and reliance on depth mapping and inter robot communication. To overcome these issues, we present an end-to-end visuomotor policy that maps images directly to robot actions. We train a high performing teacher policy with multi agent reinforcement learning and distill its knowledge into a vision-based student policy. Technical designs, including a permutation-invariant feature extractor and depth heatmap input, improve performance over standard architectures. Our policy outperforms classic methods by 16.7% in hitting accuracy and 6% in collision avoidance, and is successfully deployed on real robots. Code will be released publicly.
Energy Prediction on Sloping Ground for Quadruped Robots
Energy management is a fundamental challenge for legged robots in outdoor environments. Endurance directly constrains mission success, while efficient resource use reduces ecological impact. This paper investigates how terrain slope and heading orientation influence the energetic cost of quadruped locomotion. We introduce a simple energy model that relies solely on standard onboard sensors, avoids specialized instrumentation, and remains applicable in previously unexplored environments. The model is identified from field runs on a commercial quadruped and expressed as a compact function of slope angle and heading. Field validation on natural terrain shows near-linear trends of force-equivalent cost with slope angle, consistently higher lateral costs, and additive behavior across trajectory segments, supporting path-level energy prediction for planning-oriented evaluation.
comment: Presented at 3D-Advice (Advanced 3D Vision for Complex Environments) Workshop, ECMR 2025
RADAR: Closed-Loop Robotic Data Generation via Semantic Planning and Autonomous Causal Environment Reset IROS
The acquisition of large-scale physical interaction data, a critical prerequisite for modern robot learning, is severely bottlenecked by the prohibitive cost and scalability limits of human-in-the-loop collection paradigms. To break this barrier, we introduce Robust Autonomous Data Acquisition for Robotics (RADAR), a fully autonomous, closed-loop data generation engine that completely removes human intervention from the collection cycle. RADAR elegantly divides the cognitive load into a four-module pipeline. Anchored by 2-5 3D human demonstrations as geometric priors, a Vision-Language Model first orchestrates scene-relevant task generation via precise semantic object grounding and skill retrieval. Next, a Graph Neural Network policy translates these subtasks into physical actions via in-context imitation learning. Following execution, the VLM performs automated success evaluation using a structured Visual Question Answering pipeline. Finally, to shatter the bottleneck of manual resets, a Finite State Machine orchestrates an autonomous environment reset and asymmetric data routing mechanism. Driven by simultaneous forward-reverse planning with a strict Last-In, First-Out causal sequence, the system seamlessly restores unstructured workspaces and robustly recovers from execution failures. This continuous brain-cerebellum synergy transforms data collection into a self-sustaining process. Extensive evaluations highlight RADAR's exceptional versatility. In simulation, our framework achieves up to 90% success rates on complex, long-horizon tasks, effortlessly solving challenges where traditional baselines plummet to near-zero performance. In real-world deployments, the system reliably executes diverse, contact-rich skills (e.g., deformable object manipulation) via few-shot adaptation without domain-specific fine-tuning, providing a highly scalable paradigm for robotic data acquisition.
comment: 8 pages, 4 figures. Submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
HiSync: Spatio-Temporally Aligning Hand Motion from Wearable IMU and On-Robot Camera for Command Source Identification in Long-Range HRI
Long-range Human-Robot Interaction (HRI) remains underexplored. Within it, Command Source Identification (CSI) - determining who issued a command - is especially challenging due to multi-user and distance-induced sensor ambiguity. We introduce HiSync, an optical-inertial fusion framework that treats hand motion as binding cues by aligning robot-mounted camera optical flow with hand-worn IMU signals. We first elicit a user-defined (N=12) gesture set and collect a multimodal command gesture dataset (N=38) in long-range multi-user HRI scenarios. Next, HiSync extracts frequency-domain hand motion features from both camera and IMU data, and a learned CSINet denoises IMU readings, temporally aligns modalities, and performs distance-aware multi-window fusion to compute cross-modal similarity of subtle, natural gestures, enabling robust CSI. In three-person scenes up to 34m, HiSync achieves 92.32% CSI accuracy, outperforming the prior SOTA by 48.44%. HiSync is also validated on real-robot deployment. By making CSI reliable and natural, HiSync provides a practical primitive and design guidance for public-space HRI.
Adapting Dijkstra for Buffers and Unlimited Transfers
In recent years, RAPTOR based algorithms have been considered the state-of-the-art for path-finding with unlimited transfers without preprocessing. However, this status largely stems from the evolution of routing research, where Dijkstra-based solutions were superseded by timetable-based algorithms without a systematic comparison. In this work, we revisit classical Dijkstra-based approaches for public transit routing with unlimited transfers and demonstrate that Time-Dependent Dijkstra (TD-Dijkstra) outperforms MR. However, efficient TD-Dijkstra implementations rely on filtering dominated connections during preprocessing, which assumes passengers can always switch to a faster connection. We show that this filtering is unsound when stops have buffer times, as it cannot distinguish between seated passengers who may continue without waiting and transferring passengers who must respect the buffer. To address this limitation, we introduce Transfer Aware Dijkstra (TAD), a modification that scans entire trip sequences rather than individual edges, correctly handling buffer times while maintaining performance advantages over MR. Our experiments on London and Switzerland networks show that we can achieve a greater than two time speed-up over MR while producing optimal results on both networks with and without buffer times.
Coupling Tensor Trains with Graph of Convex Sets: Effective Compression, Exploration, and Planning in the C-Space ICRA2026
We present TANGO (Tensor ANd Graph Optimization), a novel motion planning framework that integrates tensor-based compression with structured graph optimization to enable efficient and scalable trajectory generation. While optimization-based planners such as the Graph of Convex Sets (GCS) offer powerful tools for generating smooth, optimal trajectories, they typically rely on a predefined convex characterization of the high-dimensional configuration space-a requirement that is often intractable for general robotic tasks. TANGO builds further by using Tensor Train decomposition to approximate the feasible configuration space in a compressed form, enabling rapid discovery and estimation of task-relevant regions. These regions are then embedded into a GCS-like structure, allowing for geometry-aware motion planning that respects both system constraints and environmental complexity. By coupling tensor-based compression with structured graph reasoning, TANGO enables efficient, geometry-aware motion planning and lays the groundwork for more expressive and scalable representations of configuration space in future robotic systems. Rigorous simulation studies on planar and real robots reinforce our claims of effective compression and higher quality trajectories.
comment: 8 pages, 10 figures, accepted paper for ICRA2026
Concurrent Prehensile and Nonprehensile Manipulation: A Practical Approach to Multi-Stage Dexterous Tasks
Dexterous hands enable concurrent prehensile and nonprehensile manipulation, such as holding one object while interacting with another, a capability essential for everyday tasks yet underexplored in robotics. Learning such long-horizon, contact-rich multi-stage behaviors is challenging because demonstrations are expensive to collect and end-to-end policies require substantial data to generalize across varied object geometries and placements. We present DexMulti, a sample-efficient approach for real-world dexterous multi-task manipulation that decomposes demonstrations into object-centric skills with well-defined temporal boundaries. Rather than learning monolithic policies, our method retrieves demonstrated skills based on current object geometry, aligns them to the observed object state using an uncertainty-aware estimator that tracks centroid and yaw, and executes them via a retrieve-align-execute paradigm. We evaluate on three multi-stage tasks requiring concurrent manipulation (Grasp + Pull, Grasp + Open, and Grasp + Grasp) across two dexterous hands (Allegro and LEAP) in over 1,000 real-world trials. Our approach achieves an average success rate of 66% on training objects with only 3-4 demonstrations per object, outperforming diffusion policy baselines by 2-3x while requiring far fewer demonstrations. Results demonstrate robust generalization to held-out objects and spatial variations up to +/-25 cm.
Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning
Continual Reinforcement Learning (CRL) for Vision-Language-Action (VLA) models is a promising direction toward self-improving embodied agents that can adapt in openended, evolving environments. However, conventional wisdom from continual learning suggests that naive Sequential Fine-Tuning (Seq. FT) leads to catastrophic forgetting, necessitating complex CRL strategies. In this work, we take a step back and conduct a systematic study of CRL for large pretrained VLAs across three models and five challenging lifelong RL benchmarks. We find that, contrary to established belief, simple Seq. FT with low-rank adaptation (LoRA) is remarkably strong: it achieves high plasticity, exhibits little to no forgetting, and retains strong zero-shot generalization, frequently outperforming more sophisticated CRL methods. Through detailed analysis, we show that this robustness arises from a synergy between the large pretrained model, parameter-efficient adaptation, and on-policy RL. Together, these components reshape the stability-plasticity trade-off, making continual adaptation both stable and scalable. Our results position Sequential Fine-Tuning as a powerful method for continual RL with VLAs and provide new insights into lifelong learning in the large model era. Code is available at github.com/UT-Austin-RobIn/continual-vla-rl.
A Hybrid Neural-Assisted Unscented Kalman Filter for Unmanned Ground Vehicle Navigation
Modern autonomous navigation for unmanned ground vehicles relies on different estimators to fuse inertial sensors and GNSS measurements. However, the constant noise covariance matrices often struggle to account for dynamic real-world conditions. In this work we propose a hybrid estimation framework that bridges classical state estimation foundations with modern deep learning approaches. Instead of altering the fundamental unscented Kalman filter equations, a dedicated deep neural network is developed to predict the process and measurement noise uncertainty directly from raw inertial and GNSS measurements. We present a sim2real approach, with training performed only on simulative data. In this manner, we offer perfect ground truth data and relieves the burden of extensive data recordings. To evaluate our proposed approach and examine its generalization capabilities, we employed a 160-minutes test set from three datasets each with different types of vehicles (off-road vehicle, passenger car, and mobile robot), inertial sensors, road surface, and environmental conditions. We demonstrate across the three datasets a position improvement of $12.7\%$ compared to the adaptive model-based approach. Thus, offering a scalable and a more robust solution for unmanned ground vehicles navigation tasks.
Chunk-Boundary Artifact in Action-Chunked Generative Policies: A Noise-Sensitive Failure Mechanism
Action chunking has become a central design choice for generative visuomotor policies, yet the execution discontinuities that arise at chunk boundaries remain poorly understood. In a frozen pretrained action-chunked policy, we identify chunk-boundary artifact as a noise-sensitive failure mechanism. First, artifact is strongly associated with task failure (p < 1e-4, permutation test) and emerges during the rollout rather than only as a post-hoc symptom. Second, under a fixed observation context, changing only latent noise systematically modulates artifact magnitude. Third, by identifying artifact-related directions in noise space and applying trajectory-level steering, we reliably alter artifact magnitude across all evaluated tasks. In hard-task settings with remaining outcome headroom, the success/failure distribution shifts accordingly; on near-ceiling tasks, positive gains are compressed by policy saturation, while the negative causal effect remains visible. Overall, we recast boundary discontinuity from an unavoidable execution nuisance into an analyzable, noise-dominated, and intervenable failure mechanism.
comment: 13 pages, 5 figures
Learn Structure, Adapt on the Fly: Multi-Scale Residual Learning and Online Adaptation for Aerial Manipulators
Autonomous Aerial Manipulators (AAMs) are inherently coupled, nonlinear systems that exhibit nonstationary and multiscale residual dynamics, particularly during manipulator reconfiguration and abrupt payload variations. Conventional analytical dynamic models rely on fixed parametric structures, while static data-driven model assume stationary dynamics and degrade under configuration changes and payload variations. Moreover, existing learning architectures do not explicitly factorize cross-variable coupling and multi-scale temporal effects, conflating instantaneous inertial dynamics with long-horizon regime evolution. We propose a predictive-adaptive framework for real-time residual modeling and compensation in AAMs. The core of this framework is the Factorized Dynamics Transformer (FDT), which treats physical variables as independent tokens. This design enables explicit cross-variable attention while structurally separating short-horizon inertial dependencies from long-horizon aerodynamic effects. To address deployment-time distribution shifts, a Latent Residual Adapter (LRA) performs rapid linear adaptation in the latent space via Recursive Least Squares, preserving the offline nonlinear representation without prohibitive computational overhead. The adapted residual forecast is directly integrated into a residual-compensated adaptive controller. Real-world experiments on an aerial manipulator subjected to unseen payloads demonstrate higher prediction fidelity, accelerated disturbance attenuation, and superior closed-loop tracking precision compared to state-of-the-art learning baselines, all while maintaining strict real-time feasibility.
Diversity You Can Actually Measure: A Fast, Model-Free Diversity Metric for Robotics Datasets
Robotics datasets for imitation learning typically consist of long-horizon trajectories of different lengths over states, actions, and high-dimensional observations (e.g., RGB video), making it non-trivial to quantify diversity in a way that respects the underlying trajectory structure and geometry. We extend Shannon and von Neumann entropy to this setting by defining signature transform-based entropy on the Gram matrix of a signature kernel over demonstrations, yielding entropy and diversity metrics that operate directly on the demonstration dataset. Building on these metrics, we study how dataset diversity affects generalization performance in robot imitation learning and propose a simple, model-free way to curate diverse demonstrations. We introduce FAKTUAL (FAst trajectory Kernel enTropy cUration for imitation Learning), a data curation algorithm that selects a subset of demonstrations maximizing entropy given a subset-size budget. FAKTUAL is fully model-free, requires no access to the imitation policy or rollouts, and adds negligible overhead relative to policy training. We evaluate our approach on image and state-based RoboMimic and MetaWorld benchmarks, as well as four real-world manipulation tasks. Across tasks and architectures, diversity-aware curation with FAKTUAL consistently improves downstream success rates over random selection, while being substantially more computationally efficient compared to recent robot data curation methods. Our results suggest that the entropy of demonstration datasets is a practical tool for understanding and improving dataset diversity in robot imitation learning.
From Pets to Robots: MojiKit as a Data-Informed Toolkit for Affective HRI Design
Designing affective behaviors for animal-inspired social robots often relies on intuition and personal experience, leading to fragmented outcomes. To provide more systematic guidance, we first coded and analyzed human-pet interaction videos, validated insights through literature and interviews, and created structured reference cards that map the design space of pet-inspired affective interactions. Building on this, we developed MojiKit, a toolkit combining reference cards, a zoomorphic robot prototype (MomoBot), and a behavior control studio. We evaluated MojiKit in co-creation workshops with 18 participants, finding that MojiKit helped them design 35 affective interaction patterns beyond their own pet experiences, while the code-free studio lowered the technical barrier and enhanced creative agency. Our contributions include the data-informed structured resource for pet-inspired affective HRI design, an integrated toolkit that bridges reference materials with hands-on prototyping, and empirical evidence showing how MojiKit empowers users to systematically create richer, more diverse affective robot behaviors.
comment: 25 pages, 11 figures, Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26)
Unsupervised LiDAR-Based Multi-UAV Detection and Tracking Under Extreme Sparsity ICMR
Non-repetitive solid-state LiDAR scanning leads to an extremely sparse measurement regime for detecting airborne UAVs: a small quadrotor at 10-25 m typically produces only 1-2 returns per scan, which is far below the point densities assumed by most existing detection approaches and inadequate for robust multi-target data association. We introduce an unsupervised, LiDAR-only pipeline that addresses both detection and tracking without the need for labeled training data. The detector integrates range-adaptive DBSCAN clustering with a three-stage temporal consistency check and is benchmarked on real-world air-to-air flight data under eight different parameter configurations. The best setup attains 0.891 precision, 0.804 recall, and 0.63 m RMSE, and a systematic minPts sweep verifies that most scans contain at most 1-2 target points, directly quantifying the sparsity regime. For multi-target tracking, we compare deterministic Hungarian assignment with joint probabilistic data association (JPDA), each coupled with Interacting Multiple Model filtering, in four simulated scenarios with increasing levels of ambiguity. JPDA cuts identity switches by 64% with negligible impact on MOTA, demonstrating that probabilistic association is advantageous when UAV trajectories approach one another closely. A two-environment evaluation strategy, combining real-world detection with RTK-GPS ground truth and simulation-based tracking with identity-annotated ground truth, overcomes the limitations of GNSS-only evaluation at inter-UAV distances below 2 m.
comment: Presented at the International Conference on Mechatronics and Robotics Engineering (ICMRE2026). To appear in IEEE conference proceedings
SVLL: Staged Vision-Language Learning for Physically Grounded Embodied Task Planning
Embodied task planning demands vision-language models to generate action sequences that are both visually grounded and causally coherent over time. However, existing training paradigms face a critical trade-off: joint end-to-end training often leads to premature temporal binding, while standard reinforcement learning methods suffer from optimization instability. To bridge this gap, we present Staged Vision-Language Learning (SVLL), a unified three-stage framework for robust, physically-grounded embodied planning. In the first two stages, SVLL decouples spatial grounding from temporal reasoning, establishing robust visual dependency before introducing sequential action history. In the final stage, we identify a key limitation of standard Direct Preference Optimization (DPO), its purely relative nature -- optimizing only the preference gap between winning and losing trajectories while neglecting absolute likelihood constraints on optimal path, often yields unsafe or hallucinated behaviors. To address this, we further introduce Bias-DPO, a novel alignment objective that injects an inductive bias toward expert trajectories by explicitly maximizing likelihood on ground-truth actions while penalizing overconfident hallucinations. By anchoring the policy to the expert manifold and mitigating causal misalignment, SVLL, powered by Bias-DPO, ensures strict adherence to environmental affordances and effectively suppresses physically impossible shortcuts. Finally, extensive experiments on the interactive AI2-THOR benchmark and real-world robotic deployments demonstrate that SVLL outperforms both state-of-the-art open-source (e.g., Qwen2.5-VL-7B) and closed-source models (e.g., GPT-4o, Gemini-2.0-flash) in task success rate, while significantly reducing physical constraint violations.
RoboClaw: An Agentic Framework for Scalable Long-Horizon Robotic Tasks
Vision-Language-Action (VLA) systems have shown strong potential for language-driven robotic manipulation. However, scaling them to long-horizon tasks remains challenging. Existing pipelines typically separate data collection, policy learning, and deployment, resulting in heavy reliance on manual environment resets and brittle multi-policy execution. We present RoboClaw, an agentic robotics framework that unifies data collection, policy learning, and task execution under a single VLM-driven controller. At the policy level, RoboClaw introduces Entangled Action Pairs (EAP), which couple forward manipulation behaviors with inverse recovery actions to form self-resetting loops for autonomous data collection. This mechanism enables continuous on-policy data acquisition and iterative policy refinement with minimal human intervention. During deployment, the same agent performs high-level reasoning and dynamically orchestrates learned policy primitives to accomplish long-horizon tasks. By maintaining consistent contextual semantics across collection and execution, RoboClaw reduces mismatch between the two phases and improves multi-policy robustness. Experiments in real-world manipulation tasks demonstrate improved stability and scalability compared to conventional open-loop pipelines, while significantly reducing human effort throughout the robot lifecycle, achieving a 25% improvement in success rate over baseline methods on long-horizon tasks and reducing human time investment by 53.7%.
MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks
Real-world robotic tasks are long-horizon and often span multiple floors, demanding rich spatial reasoning. However, existing embodied benchmarks are largely confined to single-floor in-house environments, failing to reflect the complexity of real-world tasks. We introduce MANSION, the first language-driven framework for generating building-scale, multi-floor 3D environments. Being aware of vertical structural constraints, MANSION generates realistic, navigable whole-building structures with diverse, human-friendly scenes, enabling the development and evaluation of cross-floor long-horizon tasks. Building on this framework, we release MansionWorld, a dataset of over 1,000 diverse buildings ranging from hospitals to offices, alongside a Task-Semantic Scene Editing Agent that customizes these environments using open-vocabulary commands to meet specific user needs. Benchmarking reveals that state-of-the-art agents degrade sharply in our settings, establishing MANSION as a critical testbed for the next generation of spatial reasoning and planning.
MiNI-Q: A Miniature, Wire-Free Quadruped with Unbounded, Independently Actuated Leg Joints
Physical joint limits are common in legged robots and can restrict workspace, constrain gait design, and increase the risk of hardware damage. This paper introduces MiNI-Q^2, a miniature, wire-free quadruped robot with independently actuated, mechanically unbounded 2-DOF leg joints. We present the mechanical design, kinematic analysis, and experimental validation of the proposed robot. The leg mechanism enables both oscillatory gaits and rotary locomotion while allowing the robot to fold to a minimum height of 2.5 cm. Experimentally, MiNI-Q achieves speeds up to 0.46 m/s and demonstrates low-clearance crawling, stair climbing, inverted locomotion, jumping, and backflipping. The wire-free architecture extends our previous Q8bot design, improving assembly reliability at miniature scale. All mechanical and electrical design files are released open source to support reproducibility and further research.
comment: 7 pages, 11 figures. Submitted to the IEEE RAS Conference on Ubiquitous Robots (UR 2026)
SPARK: Skeleton-Parameter Aligned Retargeting on Humanoid Robots with Kinodynamic Trajectory Optimization
Human motion provides rich priors for training general-purpose humanoid control policies, but raw demonstrations are often incompatible with a robot's kinematics and dynamics, limiting their direct use. We present a two-stage pipeline for generating natural and dynamically feasible motion references from task-space human data. First, we convert human motion into a unified robot description format (URDF)-based skeleton representation and calibrate it to the target humanoid's dimensions. By aligning the underlying skeleton structure rather than heuristically modifying task-space targets, this step significantly reduces inverse kinematics error and tuning effort. Second, we refine the retargeted trajectories through progressive kinodynamic trajectory optimization (TO), solved in three stages: kinematic TO, inverse dynamics, and full kinodynamic TO, each warm-started from the previous solution. The final result yields dynamically consistent state trajectories and joint torque profiles, providing high-quality references for learning-based controllers. Together, skeleton calibration and kinodynamic TO enable the generation of natural, physically consistent motion references across diverse humanoid platforms.
NFPO: Stabilized Policy Optimization of Normalizing Flow for Robotic Policy Learning
Deep Reinforcement Learning (DRL) has experienced significant advancements in recent years and has been widely used in many fields. In DRL-based robotic policy learning, however, current de facto policy parameterization is still multivariate Gaussian (with diagonal covariance matrix), which lacks the ability to model multi-modal distribution. In this work, we explore the adoption of a modern network architecture, i.e. Normalizing Flow (NF) as the policy parameterization for its ability of multi-modal modeling, closed form of log probability and low computation and memory overhead. However, naively training NF in online Reinforcement Learning (RL) usually leads to training instability. We provide a detailed analysis for this phenomenon and successfully address it via simple but effective technique. With extensive experiments in multiple simulation environments, we show our method, NFPO could obtain robust and strong performance in widely used robotic learning tasks and successfully transfer into real-world robots.
CoViLLM: An Adaptive Human-Robot Collaborative Assembly Framework Using Large Language Models for Manufacturing
With increasing demand for mass customization, traditional manufacturing robots that rely on rule-based operations lack the flexibility to accommodate customized or new product variants. Human-Robot Collaboration (HRC) has demonstrated potential to improve system adaptability by leveraging human versatility and decision-making capabilities. However, existing HRC frame- works typically depend on predefined perception-manipulation pipelines, limiting their ability to autonomously generate task plans for new product assembly. In this work, we propose CoViLLM, an adaptive human-robot collaborative assembly frame- work that supports the assembly of customized and previously unseen products. CoViLLM combines depth-camera-based localization for object position estimation, human operator classification for identifying new components, and an Large Language Model (LLM) for assembly task planning based on natural language instructions. The framework is validated on the NIST Assembly Task Board for known, customized, and new product cases. Experimental results show that the proposed framework enables flexible collaborative assembly by extending HRC beyond predefined product and task settings.
comment: 7 pages, 7 figures. Accepted to ASME MSEC 2026
Enhancing Lightweight Vision Language Models through Group Competitive Learning for Socially Compliant Navigation
Social robot navigation requires a sophisticated integration of scene semantics and human social norms. Scaling up Vision Language Models (VLMs) generally improves reasoning and decision-making capabilities for socially compliant navigation. However, increased model size incurs substantial computational overhead, limiting suitability for real-time robotic deployment. Conversely, lightweight VLMs enable efficient inference but often exhibit weaker reasoning and decision-making performance in socially complex environments. Achieving both strong reasoning ability and efficiency remains an open challenge. To bridge this gap, we propose Group Competitive Learning (GCL), a strategy designed to amplify the capabilities of lightweight VLMs. Our strategy introduces the Group Competitive Objective (GCO) to harmonize global semantics with distributional regularization, alongside Asymmetric Group Optimization (AGO) to explore the upper limits of model performance. Empirical evaluations on social navigation benchmarks demonstrate that GCL significantly elevates VLM performance. Specifically, GCL enables the Qwen2.5-VL-3B learner model and guide Qwen3-VL-4B to achieve an F1 score of 0.968 and 0.914, representing 40\% and 12\% improvement over vanilla supervised fine-tuning (SFT). Notably, under vanilla SFT, the 3B model initially trails the 8B model (F1: 0.692 vs. 0.755). However, through the GCL, the 3B model outperforms (28\%) the 8B baseline model. These results suggest that GCL provides an effective solution for achieving both high accuracy and computational efficiency in real-world deployment.
A Generalized Theory of Load Distribution in Redundantly-actuated Robotic Systems
This paper presents a generalized theory which describes how applied loads are distributed within rigid bodies handled by redundantly-actuated robotic systems composed of multiple independent closed-loop kinematic chains. The theory fully characterizes the feasible set of manipulating wrench distributions for a given resultant wrench applied to the rigid body and has important implications for the force-control of multifingered grippers, legged robots, cooperating robots, and other overconstrained mechanisms. We also derive explicit solutions to the wrench synthesis and wrench analysis problems. These solutions are computationally efficient and scale linearly with the number of applied wrenches, requiring neither numerical methods nor the inversion of large matrices. Finally, we identify significant shortcomings in current state-of-the-art approaches and propose corrections. These are supported by illustrative examples that demonstrate the advantages of the improved methods.
comment: 20 pages, 11 figures. Submitted to The International Journal of Robotics Research
Grounding Robot Generalization in Training Data via Retrieval-Augmented VLMs
Recent work on robot manipulation has advanced policy generalization to novel scenarios. However, it is often difficult to characterize how different evaluation settings actually represent generalization from the training distribution of a given policy. To work towards more precise evaluation of generalization in robotics, we propose RADAR, a scalable framework for directly comparing test-time evaluation tasks to policy training data, to determine what form of policy generalization is required. RADAR consists of a two-stage pipeline: first, retrieval using generalist policy embeddings identifies which training examples are relevant for a given evaluation task. Next, vision-language models (VLMs) analyze the evaluation task against the retrieved data, outputting interpretable analysis on how they compare along a variety of axes, and an overall classification of what type of policy generalization is required. Through controlled experiments, we demonstrate that VLMs are effective at analyzing data for generalization, and that our retrieval step effectively identifies examples needed to make accurate classifications with respect to the training data. Furthermore, we scale RADAR to large-scale datasets, where we observe agreement with human-defined benchmark conditions from prior work. We provide demonstrations at radar-analysis.github.io.
comment: 12 pages
Real-time Rendering-based Surgical Instrument Tracking via Evolutionary Optimization
Accurate and efficient tracking of surgical instruments is fundamental for Robot-Assisted Minimally Invasive Surgery. Although vision-based robot pose estimation has enabled markerless calibration without tedious physical setups, reliable tool tracking for surgical robots still remains challenging due to partial visibility and specialized articulation design of surgical instruments. Previous works in the field are usually prone to unreliable feature detections under degraded visual quality and data scarcity, whereas rendering-based methods often struggle with computational costs and suboptimal convergence. In this work, we incorporate CMA-ES, an evolutionary optimization strategy, into a versatile tracking pipeline that jointly estimates surgical instrument pose and joint configurations. Using batch rendering to efficiently evaluate multiple pose candidates in parallel, the method significantly reduces inference time and improves convergence robustness. The proposed framework further generalizes to joint angle-free and bi-manual tracking settings, making it suitable for both vision feedback control and online surgery video calibration. Extensive experiments on synthetic and real-world datasets demonstrate that the proposed method significantly outperforms prior approaches in both accuracy and runtime.
Deployment-Time Reliability of Learned Robot Policies
Recent advances in learning-based robot manipulation have produced policies with remarkable capabilities. Yet, reliability at deployment remains a fundamental barrier to real-world use, where distribution shift, compounding errors, and complex task dependencies collectively undermine system performance. This dissertation investigates how the reliability of learned robot policies can be improved at deployment time through mechanisms that operate around them. We develop three complementary classes of deployment-time mechanisms. First, we introduce runtime monitoring methods that detect impending failures by identifying inconsistencies in closed-loop policy behavior and deviations in task progress, without requiring failure data or task-specific supervision. Second, we propose a data-centric framework for policy interpretability that traces deployment-time successes and failures to influential training demonstrations using influence functions, enabling principled diagnosis and dataset curation. Third, we address reliable long-horizon task execution by formulating policy coordination as the problem of estimating and maximizing the success probability of behavior sequences, and we extend this formulation to open-ended, language-specified tasks through feasibility-aware task planning. By centering on core challenges of deployment, these contributions advance practical foundations for the reliable, real-world use of learned robot policies. Continued progress on these foundations will be essential for enabling trustworthy and scalable robot autonomy in the future.
comment: Stanford University PhD dissertation, 2026. 182 pages, 37 figures. Available from Stanford Digital Repository
$Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation
We introduce $Ψ_0$ (Psi-Zero), an open foundation model to address challenging humanoid loco-manipulation tasks. While existing approaches often attempt to address this fundamental problem by co-training on large and diverse human and humanoid data, we argue that this strategy is suboptimal due to the fundamental kinematic and motion disparities between humans and humanoid robots. Therefore, data efficiency and model performance remain unsatisfactory despite the considerable data volume. To address this challenge, \ours\;decouples the learning process to maximize the utility of heterogeneous data sources. Specifically, we propose a staged training paradigm with different learning objectives: First, we autoregressively pre-train a VLM backbone on large-scale egocentric human videos to acquire generalizable visual-action representations. Then, we post-train a flow-based action expert on high-quality humanoid robot data to learn precise robot joint control. Our research further identifies a critical yet often overlooked data recipe: in contrast to approaches that scale with noisy Internet clips or heterogeneous cross-embodiment robot datasets, we demonstrate that pre-training on high-quality egocentric human manipulation data followed by post-training on domain-specific real-world humanoid trajectories yields superior performance. Extensive real-world experiments demonstrate that \ours\ achieves the best performance using only about 800 hours of human video data and 30 hours of real-world robot data, outperforming baselines pre-trained on more than 10$\times$ as much data by over 40\% in overall success rate across multiple tasks. We will open-source the entire ecosystem to the community, including a data processing and training pipeline, a humanoid foundation model, and a real-time action inference engine.
HumDex:Humanoid Dexterous Manipulation Made Easy
This paper investigates humanoid whole-body dexterous manipulation, where the efficient collection of high-quality demonstration data remains a central bottleneck. Existing teleoperation systems often suffer from limited portability, occlusion, or insufficient precision, which hinders their applicability to complex whole-body tasks. To address these challenges, we introduce HumDex, a portable teleoperation system designed for humanoid whole-body dexterous manipulation. Our system leverages IMU-based motion tracking to address the portability-precision trade-off, enabling accurate full-body tracking while remaining easy to deploy. For dexterous hand control, we further introduce a learning-based retargeting method that generates smooth and natural hand motions without manual parameter tuning. Beyond teleoperation, HumDex enables efficient collection of human motion data. Building on this capability, we propose a two-stage imitation learning framework that first pre-trains on diverse human motion data to learn generalizable priors, and then fine-tunes on robot data to bridge the embodiment gap for precise execution. We demonstrate that this approach significantly improves generalization to new configurations, objects, and backgrounds with minimal data acquisition costs. The entire system is fully reproducible and open-sourced at https://github.com/physical-superintelligence-lab/HumDex.
HandelBot: Real-World Piano Playing via Fast Adaptation of Dexterous Robot Policies
Mastering dexterous manipulation with multi-fingered hands has been a grand challenge in robotics for decades. Despite its potential, the difficulty of collecting high-quality data remains a primary bottleneck for high-precision tasks. While reinforcement learning and simulation-to-real-world transfer offer a promising alternative, the transferred policies often fail for tasks demanding millimeter-scale precision, such as bimanual piano playing. In this work, we introduce HandelBot, a framework that combines a simulation policy and rapid adaptation through a two-stage pipeline. Starting from a simulation-trained policy, we first apply a structured refinement stage to correct spatial alignments by adjusting lateral finger joints based on physical rollouts. Next, we use residual reinforcement learning to autonomously learn fine-grained corrective actions. Through extensive hardware experiments across five recognized songs, we demonstrate that HandelBot can successfully perform precise bimanual piano playing. Our system outperforms direct simulation deployment by a factor of 1.8x and requires only 30 minutes of physical interaction data.
comment: Website: https://amberxie88.github.io/handelbot
SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics CVPR 2026
Active perception and manipulation are crucial for robots to interact with complex scenes. Existing methods struggle to unify semantic-driven active perception with robust, viewpoint-invariant execution. We propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Our approach decouples camera and manipulation actions rather than placing them in a shared action space, and follows a bottom-up training strategy: we first train semantic camera control on a large-scale dataset, then jointly optimize both action types using hybrid data. To support this framework, we introduce ActiveViewPose-200K, a dataset of 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We also present ActiveManip-Bench, the first benchmark for evaluating active manipulation beyond fixed-view settings. Extensive experiments in both simulation and real-world environments show that SaPaVe outperforms recent vision-language-action models such as GR00T N1 and \(π_0\), achieving up to 31.25\% higher success rates in real-world tasks. These results show that tightly coupled perception and execution, when trained with decoupled yet coordinated strategies, enable efficient and generalizable active manipulation. Project page: https://lmzpai.github.io/SaPaVe
comment: Accepted to CVPR 2026. See project page at https://lmzpai.github.io/SaPaVe
ComFree-Sim: A GPU-Parallelized Analytical Contact Physics Engine for Scalable Contact-Rich Robotics Simulation and Control
Physics simulation for contact-rich robotics is often bottlenecked by contact resolution: mainstream engines enforce non-penetration and Coulomb friction via complementarity constraints or constrained optimization, requiring per-step iterative solves whose cost grows superlinearly with contact density. We present ComFree-Sim, a GPU-parallelized analytical contact physics engine built on complementarity-free contact modeling. ComFree-Sim computes contact impulses in closed form via an impedance-style prediction--correction update in the dual cone of Coulomb friction. Contact computation decouples across contact pairs and becomes separable across cone facets, mapping naturally to GPU kernels and yielding near-linear runtime scaling with the number of contacts. We further extend the formulation to a unified 6D contact model capturing tangential, torsional, and rolling friction, and introduce a practical dual-cone impedance heuristic. ComFree-Sim is implemented in Warp and exposed through a MuJoCo-compatible interface as a drop-in backend alternative to MuJoCo Warp (MJWarp). Experiments benchmark penetration, friction behaviors, stability, and simulation runtime scaling against MJWarp, demonstrating near-linear scaling and 2--3 times higher throughput in dense contact scenes with comparable physical fidelity. We deploy ComFree-Sim in real-time MPC for in-hand dexterous manipulation on a real-world multi-fingered LEAP hand and in dynamics-aware motion retargeting, demonstrating that low-latency simulation yields higher closed-loop success rates and enables practical high-frequency control in contact-rich tasks.
comment: 9 pages
O3N: Omnidirectional Open-Vocabulary Occupancy Prediction
Understanding and reconstructing the 3D world through omnidirectional perception is an inevitable trend in the development of autonomous agents and embodied intelligence. However, existing 3D occupancy prediction methods are constrained by limited perspective inputs and predefined training distribution, making them difficult to apply to embodied agents that require comprehensive and safe perception of scenes in open world exploration. To address this, we present O3N, the first purely visual, end-to-end Omnidirectional Open-vocabulary Occupancy predictioN framework. O3N embeds omnidirectional voxels in a polar-spiral topology via the Polar-spiral Mamba (PsM) module, enabling continuous spatial representation and long-range context modeling across 360°. The Occupancy Cost Aggregation (OCA) module introduces a principled mechanism for unifying geometric and semantic supervision within the voxel space, ensuring consistency between the reconstructed geometry and the underlying semantic structure. Moreover, Natural Modality Alignment (NMA) establishes a gradient-free alignment pathway that harmonizes visual features, voxel embeddings, and text semantics, forming a consistent "pixel-voxel-text" representation triad. Extensive experiments on multiple models demonstrate that our method not only achieves state-of-the-art performance on QuadOcc and Human360Occ benchmarks but also exhibits remarkable cross-scene generalization and semantic scalability, paving the way toward universal 3D world modeling. The source code will be made publicly available at https://github.com/MengfeiD/O3N.
comment: The source code will be made publicly available at https://github.com/MengfeiD/O3N
Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies
Vision-Language-Action (VLA) models have significant potential to enable general-purpose robotic systems for a range of vision-language tasks. However, the performance of VLA-based robots is highly sensitive to the precise wording of language instructions, and it remains difficult to predict when such robots will fail. To improve the robustness of VLAs to different wordings, we present Q-DIG (Quality Diversity for Diverse Instruction Generation), which performs red-teaming by scalably identifying diverse natural language task descriptions that induce failures while remaining task-relevant. Q-DIG integrates Quality Diversity (QD) techniques with Vision-Language Models (VLMs) to generate a broad spectrum of adversarial instructions that expose meaningful vulnerabilities in VLA behavior. Our results across multiple simulation benchmarks show that Q-DIG finds more diverse and meaningful failure modes compared to baseline methods, and that fine-tuning VLAs on the generated instructions improves task success rates. Furthermore, results from a user study highlight that Q-DIG generates prompts judged to be more natural and human-like than those from baselines. Finally, real-world evaluations of Q-DIG prompts show results consistent with simulation, and fine-tuning VLAs on the generated prompts further success rates on unseen instructions. Together, these findings suggest that Q-DIG is a promising approach for identifying vulnerabilities and improving the robustness of VLA-based robots. Our anonymous project website is at qdigvla.github.io.
Robots that redesign themselves through kinematic self-destruction
Every robot built to date was predesigned by an external process, prior to deployment. Here we show a robot that actively participates in its own design during its lifetime. Starting from a randomly assembled body, and using only proprioceptive feedback, the robot dynamically ``sculpts'' itself into a new design through kinematic self-destruction: identifying redundant links within its body that inhibit its locomotion, and then thrashing those links against the surface until they break at the joint and fall off the body. It does so using a single autoregressive sequence model, a universal controller that learns in simulation when and how to simplify a robot's body through self-destruction and then adaptively controls the reduced morphology. The optimized policy successfully transfers to reality and generalizes to previously unseen kinematic trees, generating forward locomotion that is more effective than otherwise equivalent policies that randomly remove links or cannot remove any. This suggests that self-designing robots may be more successful than predesigned robots in some cases, and that kinematic self-destruction, though reductive and irreversible, could provide a general adaptive strategy for a wide range of robots.
COAD: Constant-Time Planning for Continuous Goal Manipulation with Compressed Library and Online Adaptation
In many robotic manipulation tasks, the robot repeatedly solves motion-planning problems that differ mainly in the location of the goal object and its associated obstacle, while the surrounding workspace remains fixed. Prior works have shown that leveraging experience and offline computation can accelerate repeated planning queries, but they lack guarantees of covering the continuous task space and require storing large libraries of solutions. In this work, we present COAD, a framework that provides constant-time planning over a continuous goal-parameterized task space. COAD discretizes the continuous task space into finitely many Task Coverage Regions. Instead of planning and storing solutions for every region offline, it constructs a compressed library by only solving representative root problems. Other problems are handled through fast adaptation from these root solutions. At query time, the system retrieves a root motion in constant time and adapts it to the desired goal using lightweight adaptation modules such as linear interpolation, Dynamic Movement Primitives, or simple trajectory optimization. We evaluate the framework on various manipulators and environments in simulation and the real world, showing that COAD achieves substantial compression of the motion library while maintaining high success rates and sub-millisecond-level queries, outperforming baseline methods in both efficiency and path quality. The source code is available at https://github.com/elpis-lab/CoAd.
comment: Adil Shiyas and Zhuoyun Zhong contributed equally to this work
One-Step Flow Policy: Self-Distillation for Fast Visuomotor Policies
Generative flow and diffusion models provide the continuous, multimodal action distributions needed for high-precision robotic policies. However, their reliance on iterative sampling introduces severe inference latency, degrading control frequency and harming performance in time-sensitive manipulation. To address this problem, we propose the One-Step Flow Policy (OFP), a from-scratch self-distillation framework for high-fidelity, single-step action generation without a pre-trained teacher. OFP unifies a self-consistency loss to enforce coherent transport across time intervals, and a self-guided regularization to sharpen predictions toward high-density expert modes. In addition, a warm-start mechanism leverages temporal action correlations to minimize the generative transport distance. Evaluations across 56 diverse simulated manipulation tasks demonstrate that a one-step OFP achieves state-of-the-art results, outperforming 100-step diffusion and flow policies while accelerating action generation by over $100\times$. We further integrate OFP into the $π_{0.5}$ model on RoboTwin 2.0, where one-step OFP surpasses the original 10-step policy. These results establish OFP as a practical, scalable solution for highly accurate and low-latency robot control.
Predictive and adaptive maps for long-term visual navigation in changing environments
In this paper, we compare different map management techniques for long-term visual navigation in changing environments. In this scenario, the navigation system needs to continuously update and refine its feature map in order to adapt to the environment appearance change. To achieve reliable long-term navigation, the map management techniques have to (i) select features useful for the current navigation task, (ii) remove features that are obsolete, (iii) and add new features from the current camera view to the map. We propose several map management strategies and evaluate their performance with regard to the robot localisation accuracy in long-term teach-and-repeat navigation. Our experiments, performed over three months, indicate that strategies which model cyclic changes of the environment appearance and predict which features are going to be visible at a particular time and location, outperform strategies which do not explicitly model the temporal evolution of the changes.
Beyond Motion Imitation: Is Human Motion Data Alone Sufficient to Explain Gait Control and Biomechanics?
With the growing interest in motion imitation learning (IL) for human biomechanics and wearable robotics, this study investigates how additional foot-ground interaction measures, used as reward terms, affect human gait kinematics and kinetics estimation within a reinforcement learning-based IL framework. Results indicate that accurate reproduction of forward kinematics alone does not ensure biomechanically plausible joint kinetics. Adding foot-ground contacts and contact forces to the IL reward terms enables the prediction of joint moments in forward walking simulation, which are significantly closer to those computed by inverse dynamics. This finding highlights a fundamental limitation of motion-only IL approaches, which may prioritize kinematics matching over physical consistency. Incorporating kinetic constraints, particularly ground reaction force and center of pressure information, significantly enhances the realism of internal and external kinetics. These findings suggest that, when imitation learning is applied to human-related research domains such as biomechanics and wearable robot co-design, kinetics-based reward shaping is necessary to achieve physically consistent gait representations.
comment: 8 pages, 7 figures
Push, Press, Slide: Mode-Aware Planar Contact Manipulation via Reduced-Order Models IROS 2026
Non-prehensile planar manipulation, including pushing and press-and-slide, is critical for diverse robotic tasks, but notoriously challenging due to hybrid contact mechanics, under-actuation, and asymmetric friction limits that traditionally necessitate computationally expensive iterative control. In this paper, we propose a mode-aware framework for planar manipulation with one or two robotic arms based on contact topology selection and reduced-order kinematic modeling. Our core insight is that complex wrench-twist limit surface mechanics can be abstracted into a discrete library of physically intuitive models. We systematically map various single-arm and bimanual contact topologies to simple non-holonomic formulations, e.g. unicycle for simplified press-and-slide motion. By anchoring trajectory generation to these reduced-order models, our framework computes the required object wrench and distributes feasible, friction-bounded contact forces via a direct algebraic allocator. We incorporate manipulator kinematics to ensure long-horizon feasibility and demonstrate our fast, optimization-free approach in simulation across diverse single-arm and bimanual manipulation tasks.
comment: 8 pages, 13 figures. Submitted to IEEE IROS 2026
GNN-DIP: Neural Corridor Selection for Decomposition-Based Motion Planning
Motion planning through narrow passages remains a core challenge: sampling-based planners rarely place samples inside these narrow but critical regions, and even when samples land inside a passage, the straight-line connections between them run close to obstacle boundaries and are frequently rejected by collision checking. Decomposition-based planners resolve both issues by partitioning free space into convex cells -- every passage is captured exactly as a cell boundary, and any path within a cell is collision-free by construction. However, the number of candidate corridors through the cell graph grows combinatorially with environment complexity, creating a bottleneck in corridor selection. We present GNN-DIP, a framework that addresses this by integrating a Graph Neural Network (GNN) with a two-phase Decomposition-Informed Planner (DIP). The GNN predicts portal scores on the cell adjacency graph to bias corridor search toward near-optimal regions while preserving completeness. In 2D, Constrained Delaunay Triangulation (CDT) with the Funnel algorithm yields exact shortest paths within corridors; in 3D, Slab convex decomposition with portal-face sampling provides near-optimal path evaluation. Benchmarks on 2D narrow-passage scenarios, 3D bottleneck environments with up to 246 obstacles, and dynamic 2D settings show that GNN-DIP achieves 99--100% success rates with 2--280 times speedup over sampling-based baselines.
A Learning-Based Approach for Contact Detection, Localization, and Force Estimation of Continuum Manipulators With Integrated OFDR Optical Fiber
Continuum manipulators (CMs) are widely used in minimally invasive procedures due to their compliant structure and ability to navigate deep and confined anatomical environments. However, their distributed deformation makes force sensing, contact detection, localization, and force estimation challenging, particularly when interactions occur at unknown arc-length locations along the robot. To address this problem, we propose a cascade learning-based framework (CLF) for CMs instrumented with a single distributed Optical Frequency Domain Reflectometry (OFDR) fiber embedded along one side of the robot. The OFDR sensor provides dense strain measurements along the manipulator backbone, capturing strain perturbations caused by external interactions. The proposed CLF first detects contact using a Gradient Boosting classifier and then estimates contact location and interaction force magnitude using a CNN--FiLM model that predicts a spatial force distribution along the manipulator. Experimental validation on a sensorized tendon-driven CM in an obstructed environment demonstrates that a single distributed OFDR fiber provides sufficient information to jointly infer contact occurrence, location, and force in continuum manipulators.
comment: 8 pages, 6 figures
Whleaper: A 10-DOF Flexible Bipedal Wheeled Robot
Wheel-legged robots combine the advantages of both wheeled robots and legged robots, offering versatile locomotion capabilities with excellent stability on challenging terrains and high efficiency on flat surfaces. However, existing wheel-legged robots typically have limited hip joint mobility compared to humans, while hip joint plays a crucial role in locomotion. In this paper, we introduce Whleaper, a novel 10-degree-of-freedom (DOF) bipedal wheeled robot, with 3 DOFs at the hip of each leg. Its humanoid joint design enables adaptable motion in complex scenarios, ensuring stability and flexibility. This paper introduces the details of Whleaper, with a focus on innovative mechanical design, control algorithms and system implementation. Firstly, stability stems from the increased DOFs at the hip, which expand the range of possible postures and improve the robot's foot-ground contact. Secondly, the extra DOFs also augment its mobility. During walking or sliding, more complex movements can be adopted to execute obstacle avoidance tasks. Thirdly, we utilize two control algorithms to implement multimodal motion for walking and sliding. By controlling specific DOFs of the robot, we conducted a series of simulations and practical experiments, demonstrating that a high-DOF hip joint design can effectively enhance the stability and flexibility of wheel-legged robots. Whleaper shows its capability to perform actions such as squatting, obstacle avoidance sliding, and rapid turning in real-world scenarios.
Robust Cooperative Localization in Featureless Environments: A Comparative Study of DCL, StCL, CCL, CI, and Standard-CL
Cooperative localization (CL) enables accurate position estimation in multi-robot systems operating in GPS-denied environments. This paper presents a comparative study of five CL approaches: Centralized Cooperative Localization (CCL), Decentralized Cooperative Localization (DCL), Sequential Cooperative Localization (StCL), Covariance Intersection (CI), and Standard Cooperative Localization (Standard-CL). All methods are implemented in ROS and evaluated through Monte Carlo simulations under two conditions: weak data association and robust detection. Our analysis reveals fundamental trade-offs among the methods. StCL and Standard-CL achieve the lowest position errors but exhibit severe filter inconsistency, making them unsuitable for safety-critical applications. DCL demonstrates remarkable stability under challenging conditions due to its measurement stride mechanism, which provides implicit regularization against outliers. CI emerges as the most balanced approach, achieving near-optimal consistency while maintaining competitive accuracy. CCL provides theoretically optimal estimation but shows sensitivity to measurement outliers. These findings offer practical guidance for selecting CL algorithms based on application requirements.
comment: Accepted and presented at the 2026 12th International Conference on Automation, Robotics and Applications (ICARA); to appear in IEEE conference proceedings
Online Slip Detection and Friction Coefficient Estimation for Autonomous Racing
Accurate knowledge of the tire-road friction coefficient (TRFC) is essential for vehicle safety, stability, and performance, especially in autonomous racing, where vehicles often operate at the friction limit. However, TRFC cannot be directly measured with standard sensors, and existing estimation methods either depend on vehicle or tire models with uncertain parameters or require large training datasets. In this paper, we present a lightweight approach for online slip detection and TRFC estimation. Our approach relies solely on IMU and LiDAR measurements and the control actions, without special dynamical or tire models, parameter identification, or training data. Slip events are detected in real time by comparing commanded and measured motions, and the TRFC is then estimated directly from observed accelerations under no-slip conditions. Experiments with a 1:10-scale autonomous racing car across different friction levels demonstrate that the proposed approach achieves accurate and consistent slip detections and friction coefficients, with results closely matching ground-truth measurements. These findings highlight the potential of our simple, deployable, and computationally efficient approach for real-time slip monitoring and friction coefficient estimation in autonomous driving.
comment: Equal contribution by the first three authors
Parallel-in-Time Nonlinear Optimal Control via GPU-native Sequential Convex Programming
Real-time trajectory optimization for nonlinear constrained autonomous systems is critical and typically performed by CPU-based sequential solvers. Specifically, reliance on global sparse linear algebra or the serial nature of dynamic programming algorithms restricts the utilization of massively parallel computing architectures like GPUs. To bridge this gap, we introduce a fully GPU-native trajectory optimization framework that combines sequential convex programming with a consensus-based alternating direction method of multipliers. By applying a temporal splitting strategy, our algorithm decouples the optimization horizon into independent, per-node subproblems that execute massively in parallel. The entire process runs fully on the GPU, eliminating costly memory transfers and large-scale sparse factorizations. This architecture naturally scales to multi-trajectory optimization. We validate the solver on a quadrotor agile flight task and a Mars powered descent problem using an on-board edge computing platform. Benchmarks reveal a sustained 4x throughput speedup and a 51% reduction in energy consumption over a heavily optimized 12-core CPU baseline. Crucially, the framework saturates the hardware, maintaining over 96% active GPU utilization to achieve planning rates exceeding 100 Hz. Furthermore, we demonstrate the solver's extensibility to robust Model Predictive Control by jointly optimizing dynamically coupled scenarios under stochastic disturbances, enabling scalable and safe autonomy.
Safe and Stylized Trajectory Planning for Autonomous Driving via Diffusion Model
Achieving safe and stylized trajectory planning in complex real-world scenarios remains a critical challenge for autonomous driving systems. This paper proposes the SDD Planner, a diffusion-based framework designed to effectively reconcile safety constraints with driving styles in real time. The framework integrates two core modules: a Multi-Source Style-Aware Encoder, which employs distance-sensitive attention to fuse dynamic agent data and environmental contexts for heterogeneous safety-style perception; and a Style-Guided Dynamic Trajectory Generator, which adaptively modulates priority weights within the diffusion denoising process to generate user-preferred yet safe trajectories. Extensive experiments demonstrate that SDD Planner achieves state-of-the-art performance. On the StyleDrive benchmark, it improves the SM-PDMS metric by 3.9% over WoTE, the strongest baseline. Furthermore, on the NuPlan Test14 and Test14-hard benchmarks, SDD Planner ranks first with overall scores of 91.76 and 80.32, respectively, outperforming leading methods such as PLUTO. Real-vehicle closed-loop tests further confirm that SDD Planner maintains high safety standards while aligning with preset driving styles, validating its practical applicability for real-world deployment.
comment: 12 pages, 7 figures, submitted to IEEE Transactions on Intelligent Transportation Systems
4D Radar-Inertial Odometry based on Gaussian Modeling and Multi-Hypothesis Scan Matching
4D millimeter-wave (mmWave) radars are sensors that provide robustness against adverse weather conditions (rain, snow, fog, etc.), and as such they are increasingly used for odometry and SLAM (Simultaneous Location and Mapping). However, the noisy and sparse nature of the returned scan data proves to be a challenging obstacle for existing registration algorithms, especially those originally intended for more accurate sensors such as LiDAR. Following the success of 3D Gaussian Splatting for vision, in this paper we propose a summarized representation for radar scenes based on global simultaneous optimization of 3D Gaussians as opposed to voxel-based approaches, and leveraging its inherent Probability Density Function (PDF) for registration. Moreover, we propose optimizing multiple registration hypotheses for better protection against local optima of the PDF. We evaluate our modeling and registration system against state of the art techniques, finding that our system provides richer models and more accurate registration results. Finally, we evaluate the effectiveness of our system in a real Radar-Inertial Odometry task. Experiments using publicly available 4D radar datasets show that our Gaussian approach is comparable to existing registration algorithms, outperforming them in several sequences. Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
comment: Our code and results can be publicly accessed at: https://github.com/robotics-upo/gaussian-rio-cpp Accepted for publication in IEEE Robotics and Automation Letters
STONE Dataset: A Scalable Multi-Modal Surround-View 3D Traversability Dataset for Off-Road Robot Navigation ICRA 2026
Reliable off-road navigation requires accurate estimation of traversable regions and robust perception under diverse terrain and sensing conditions. However, existing datasets lack both scalability and multi-modality, which limits progress in 3D traversability prediction. In this work, we introduce STONE, a large-scale multi-modal dataset for off-road navigation. STONE provides (1) trajectory-guided 3D traversability maps generated by a fully automated, annotation-free pipeline, and (2) comprehensive surround-view sensing with synchronized 128-channel LiDAR, six RGB cameras, and three 4D imaging radars. The dataset covers a wide range of environments and conditions, including day and night, grasslands, farmlands, construction sites, and lakes. Our auto-labeling pipeline reconstructs dense terrain surfaces from LiDAR scans, extracts geometric attributes such as slope, elevation, and roughness, and assigns traversability labels beyond the robot's trajectory using a Mahalanobis-distance-based criterion. This design enables scalable, geometry-aware ground-truth construction without manual annotation. Finally, we establish a benchmark for voxel-level 3D traversability prediction and provide strong baselines under both single-modal and multi-modal settings. STONE is available at: https://konyul.github.io/STONE-dataset/
comment: ICRA 2026
FSAG: Enhancing Human-to-Dexterous-Hand Finger-Specific Affordance Grounding via Diffusion Models
Dexterous grasp synthesis must jointly satisfy functional intent and physical feasibility, yet existing pipelines often decouple semantic grounding from refinement, yielding unstable or non-functional contacts under object and pose variations. This challenge is exacerbated by the high dimensionality and kinematic diversity of multi-fingered hands, which makes many methods rely on large, hardware-specific grasp datasets collected in simulation or through costly real-world trials. We propose a data-efficient framework that bypasses robot grasp data collection by exploiting object-centric semantic priors in pretrained generative diffusion models. Temporally aligned and fine-grained grasp affordances are extracted from raw human video demonstrations and fused with 3D scene geometry from depth images to infer semantically grounded contact targets. We further incorporate these affordance regions into the grasp refinement objective, explicitly guiding each fingertip toward its predicted region during optimization. The resulting system produces stable, human-intuitive multi-contact grasps across common objects and tools, while exhibiting strong generalization to previously unseen object instances within a category, pose variations, and multiple hand embodiments.This work (i) introduces a semantic affordance extraction pipeline leveraging vision--language generative priors for dexterous grasping, (ii) demonstrates cross-hand generalization without constructing hardware-specific grasp datasets, and (iii) establishes that a single depth modality suffices for high-performance grasp synthesis when coupled with foundation-model semantics. Our results highlight a path toward scalable, hardware-agnostic dexterous manipulation driven by human demonstrations and pretrained generative models.
ManiVID-3D: Generalizable View-Invariant Reinforcement Learning for Robotic Manipulation via Disentangled 3D Representations
Deploying visual reinforcement learning (RL) policies in real-world manipulation is often hindered by camera viewpoint changes. A policy trained from a fixed front-facing camera may fail when the camera is shifted -- an unavoidable situation in real-world settings where sensor placement is hard to manage appropriately. Existing methods often rely on precise camera calibration or struggle with large perspective changes. To address these limitations, we propose ManiVID-3D, a novel 3D RL architecture designed for robotic manipulation, which learns view-invariant representations through self-supervised disentangled feature learning. The framework incorporates ViewNet, a lightweight yet effective module that automatically aligns point cloud observations from arbitrary viewpoints into a unified spatial coordinate system without the need for extrinsic calibration. Additionally, we develop an efficient GPU-accelerated batch rendering module capable of processing over 5000 frames per second, enabling large-scale training for 3D visual RL at unprecedented speeds. Extensive evaluation across 10 simulated and 5 real-world tasks demonstrates that our approach achieves a 40.6% higher success rate than state-of-the-art methods under viewpoint variations while using 80% fewer parameters. The system's robustness to severe perspective changes and strong sim-to-real performance highlight the effectiveness of learning geometrically consistent representations for scalable robotic manipulation in unstructured environments.
comment: Accepted to RA-L. Project website: https://zheng-joe-lee.github.io/manivid3d/
Enhancing Heterogeneous Multi-Agent Cooperation in Decentralized MARL via GNN-driven Intrinsic Rewards AAMAS 2025
Multi-agent Reinforcement Learning (MARL) is emerging as a key framework for various sequential decision-making and control tasks. Unlike their single-agent counterparts, multi-agent systems necessitate successful cooperation among the agents. The deployment of these systems in real-world scenarios often requires decentralized training, a diverse set of agents, and learning from infrequent environmental reward signals. These challenges become more pronounced under partial observability and the lack of prior knowledge about agent heterogeneity. While notable studies use intrinsic motivation (IM) to address reward sparsity or cooperation in decentralized settings, those dealing with heterogeneity typically assume centralized training, parameter sharing, and agent indexing. To overcome these limitations, we propose the CoHet algorithm, which utilizes a novel Graph Neural Network (GNN) based intrinsic motivation to facilitate the learning of heterogeneous agent policies in decentralized settings, under the challenges of partial observability and reward sparsity. Evaluation of CoHet in the Multi-agent Particle Environment (MPE) and Vectorized Multi-Agent Simulator (VMAS) benchmarks demonstrates superior performance compared to the state-of-the-art in a range of cooperative multi-agent scenarios. Our research is supplemented by an analysis of the impact of the agent dynamics model on the intrinsic motivation module, insights into the performance of different CoHet variants, and its robustness to an increasing number of heterogeneous agents.
comment: Full paper version for AAMAS 2025 (https://ifaamas.org/Proceedings/aamas2025/pdfs/p2681.pdf), 9 pages, 5 figures
XGrasp: Gripper-Aware Grasp Detection with Multi-Gripper Data Generation
Real-world robotic systems frequently require diverse end-effectors for different tasks, however most existing grasp detection methods are optimized for a single gripper type, demanding retraining or optimization for each novel gripper configuration. This gripper-specific retraining paradigm is neither scalable nor practical. We propose XGrasp, a real-time gripper-aware grasp detection framework that generalizes to novel gripper configurations without additional training or optimization. To resolve data scarcity, we augment existing single-gripper datasets with multi-gripper annotations by incorporating the physical characteristics and closing trajectories of diverse grippers. Each gripper is represented as a two-channel 2D image encoding its static shape (Gripper Mask) and dynamic closing trajectory (Gripper Path). XGrasp employs a hierarchical two-stage architecture consisting of a Grasp Point Predictor (GPP) and an Angle-Width Predictor (AWP). In the AWP, contrastive learning with a quality-aware anchor builds a gripper-agnostic embedding space, enabling generalization to novel grippers without additional training. Experimental results demonstrate that XGrasp outperforms existing gripper-aware methods in both grasp success rate and inference speed across diverse gripper types. Project page: https://sites.google.com/view/xgrasp
comment: 9 pages, 10 figures
DRIFT: Dual-Representation Inter-Fusion Transformer for Automated Driving Perception with 4D Radar Point Clouds
4D radars, which provide 3D point cloud data along with Doppler velocity, are attractive components of modern automated driving systems due to their low cost and robustness under adverse weather conditions. However, they provide a significantly lower point cloud density than LiDAR sensors. This makes it important to exploit not only local but also global contextual scene information. This paper proposes DRIFT, a model that effectively captures and fuses both local and global contexts through a dual-path architecture. The model incorporates a point path to aggregate fine-grained local features and a pillar path to encode coarse-grained global features. These two parallel paths are intertwined via novel feature-sharing layers at multiple stages, enabling full utilization of both representations. DRIFT is evaluated on the widely used View-of-Delft (VoD) dataset and a proprietary internal dataset. It outperforms the baselines on the tasks of object detection and/or free road estimation. For example, DRIFT achieves a mean average precision (mAP) of 52.6% (compared to, say, 45.4% of CenterPoint) on the VoD dataset.
Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents
Multimodal Large Language Models (MLLMs) show promising results as decision-making engines for embodied agents operating in complex, physical environments. However, existing benchmarks often prioritize high-level planning or spatial reasoning, leaving the fine-grained action intelligence required for embodied physical interaction underexplored. To address this gap, we introduce CFG-Bench, a new benchmark designed to systematically evaluate this crucial capability. CFG-Bench consists of 1,368 curated videos paired with 19,562 question-answer pairs spanning three evaluation paradigms targeting four cognitive abilities: 1) Physical Interaction, 2) Temporal-Causal Relation, 3) Intentional Understanding, and 4) Evaluative Judgment. Together, these dimensions provide a systematic framework for assessing a model's ability to translate visual observations into actionable knowledge, moving beyond mere surface-level recognition. Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions and exhibit profound limitations in the higher-order reasoning of intention and evaluation. Moreover, supervised fine-tuning (SFT) on our data demonstrates that teaching an MLLMs to articulate fine-grained actions directly translates to significant performance gains on established embodied benchmarks. Our analysis highlights these limitations and offers insights for developing more capable and grounded embodied agents. Project page: https://cfg-bench.github.io/
RoboRouter: Training-Free Policy Routing for Robotic Manipulation
Research on robotic manipulation has developed a diverse set of policy paradigms, including vision-language-action (VLA) models, vision-action (VA) policies, and code-based compositional approaches. Concrete policies typically attain high success rates on specific task distributions but lim-ited generalization beyond it. Rather than proposing an other monolithic policy, we propose to leverage the complementary strengths of existing approaches through intelligent policy routing. We introduce RoboRouter, a training-free framework that maintains a pool of heterogeneous policies and learns to select the best-performing policy for each task through accumulated execution experience. Given a new task, RoboRouter constructs a semantic task representation, retrieves historical records of similar tasks, predicts the optimal policy choice without requiring trial-and-error, and incorporates structured feedback to refine subsequent routing decisions. Integrating a new policy into the system requires only lightweight evaluation and incurs no training overhead. Across simulation benchmark and real-world evaluations, RoboRouter consistently outperforms than in-dividual policies, improving average success rate by more than 3% in simulation and over 13% in real-world settings, while preserving execution efficiency. Our results demonstrate that intelligent routing across heterogeneous, off-the-shelf policies provides a practical and scalable pathway toward building more capable robotic systems.
comment: We need to withdraw the paper as some of the reference papers are incorrect and need to be removed
KnowVal: A Knowledge-Augmented and Value-Guided Autonomous Driving System CVPR 2026
Visual-language reasoning, driving knowledge, and value alignment are essential for advanced autonomous driving systems. However, existing approaches largely rely on data-driven learning, making it difficult to capture the complex logic underlying decision-making through imitation or limited reinforcement rewards. To address this, we propose KnowVal, a new autonomous driving system that enables visual-language reasoning through the synergistic integration of open-world perception and knowledge retrieval. Specifically, we construct a comprehensive driving knowledge graph that encodes traffic laws, defensive driving principles, and ethical norms, complemented by an efficient LLM-based retrieval mechanism tailored for driving scenarios. Furthermore, we develop a human-preference dataset and train a Value Model to guide interpretable, value-aligned trajectory assessment. Experimental results show that our method substantially improves planning performance while remaining compatible with existing architectures. Notably, KnowVal achieves the lowest collision rate on nuScenes and state-of-the-art results on Bench2Drive and NVISIM.
comment: Accepted to CVPR 2026
Hyperbolic Multiview Pretraining for Robotic Manipulation CVPR 2026
3D-aware visual pretraining has proven effective in improving the performance of downstream robotic manipulation tasks. However, existing methods are constrained to Euclidean embedding spaces, whose flat geometry limits their ability to model structural relations among embeddings. As a result, they struggle to learn structured embeddings that are essential for robust spatial perception in robotic applications. To this end, we propose HyperMVP, a self-supervised framework for \underline{Hyper}bolic \underline{M}ulti\underline{V}iew \underline{P}retraining. Hyperbolic space offers geometric properties well suited for capturing structural relations. Methodologically, we extend the masked autoencoder paradigm and design a GeoLink encoder to learn multiview hyperbolic representations. The pretrained encoder is then finetuned with visuomotor policies on manipulation tasks. In addition, we introduce 3D-MOV, a large-scale dataset comprising multiple types of 3D point clouds to support pretraining. We evaluate HyperMVP on COLOSSEUM, RLBench, and real-world scenarios, where it consistently outperforms strong baselines across diverse tasks and perturbation settings. Our results highlight the potential of 3D-aware pretraining in a non-Euclidean space for learning robust and generalizable robotic manipulation policies.
comment: This paper was submitted to CVPR 2026 and was recommended for Findings, but the authors have withdrawn it and are currently adding more content to submit it elsewhere
GUIDES: Guidance Using Instructor-Distilled Embeddings for Pre-trained Robot Policy Enhancement ICRA 2026
Pre-trained robot policies serve as the foundation of many validated robotic systems, which encapsulate extensive embodied knowledge. However, they often lack the semantic awareness characteristic of foundation models, and replacing them entirely is impractical in many situations due to high costs and the loss of accumulated knowledge. To address this gap, we introduce GUIDES, a lightweight framework that augments pre-trained policies with semantic guidance from foundation models without requiring architectural redesign. GUIDES employs a fine-tuned vision-language model (Instructor) to generate contextual instructions, which are encoded by an auxiliary module into guidance embeddings. These embeddings are injected into the policy's latent space, allowing the legacy model to adapt to this new semantic input through brief, targeted fine-tuning. For inference-time robustness, a large language model-based Reflector monitors the Instructor's confidence and, when confidence is low, initiates a reasoning loop that analyzes execution history, retrieves relevant examples, and augments the VLM's context to refine subsequent actions. Extensive validation in the RoboCasa simulation environment across diverse policy architectures shows consistent and substantial improvements in task success rates. Real-world deployment on a UR5 robot further demonstrates that GUIDES enhances motion precision for critical sub-tasks such as grasping. Overall, GUIDES offers a practical and resource-efficient pathway to upgrade, rather than replace, validated robot policies.
comment: IEEE International Conference on Robotics and Automation (ICRA 2026)
Efficient Construction of Implicit Surface Models From a Single Image for Motion Generation ICRA
Implicit representations have been widely applied in robotics for obstacle avoidance and path planning. In this paper, we explore the problem of constructing an implicit distance representation from a single image. Past methods for implicit surface reconstruction, such as NeuS and its variants generally require a large set of multi-view images as input, and require long training times. In this work, we propose Fast Image-to-Neural Surface (FINS), a lightweight framework that can reconstruct high-fidelity surfaces and SDF fields based on a single or a small set of images. FINS integrates a multi-resolution hash grid encoder with lightweight geometry and color heads, making the training via an approximate second-order optimizer highly efficient and capable of converging within a few seconds. Additionally, we achieve the construction of a neural surface requiring only a single RGB image, by leveraging pre-trained foundation models to estimate the geometry inherent in the image. Our experiments demonstrate that under the same conditions, our method outperforms state-of-the-art baselines in both convergence speed and accuracy on surface reconstruction and SDF field estimation. Moreover, we demonstrate the applicability of FINS for robot surface following tasks and show its scalability to a variety of benchmark datasets. Code is publicly available at https://github.com/waynechu1109/FINS.
comment: 9 pages, 6 figures, 2026 IEEE International Conference on Robotics and Automation (ICRA)
RAPID: Redundancy-Aware and Compatibility-Optimal Edge-Cloud Partitioned Inference for Diverse VLA Models
Vision Language Action (VLA) models are mainstream in embodied intelligence but face high inference costs. Edge-Cloud Collaborative (ECC) inference offers an effective fix by easing edge-device computing pressure to meet real-time needs. However, existing ECC frameworks are suboptimal for VLA models due to two challenges: (1) Mainstream environment-oriented edge-cloud partitioning methods are susceptible to interference from visual noise; (2) Existing edge-cloud partitioning methods overlook the step-wise redundancy unique to embodied tasks, thereby disrupting the physical continuity of motion. To address these issues, we propose a novel ECC inference framework, termed RAPID. Specifically, we developed an implementation tailored to the proposed framework. Experiments demonstrate this achieves a speedup of up to 1.73x with only 5%~7% overhead.
When Semantics Connect the Swarm: LLM-Driven Fuzzy Control for Cooperative Multi-Robot Underwater Coverage
Underwater multi-robot cooperative coverage remains challenging due to partial observability, limited communication, environmental uncertainty, and the lack of access to global localization. To address these issues, this paper presents a semantics-guided fuzzy control framework that couples Large Language Models (LLMs) with interpretable control and lightweight coordination. Raw multimodal observations are compressed by the LLM into compact, human-interpretable semantic tokens that summarize obstacles, unexplored regions, and Objects Of Interest (OOIs) under uncertain perception. A fuzzy inference system with pre-defined membership functions then maps these tokens into smooth and stable steering and gait commands, enabling reliable navigation without relying on global positioning. Then, we further coordinate multiple robots by introducing semantic communication that shares intent and local context in linguistic form, enabling agreement on who explores where while avoiding redundant revisits. Extensive simulations in unknown reef-like environments show that, under limited sensing and communication, the proposed framework achieves robust OOI-oriented navigation and cooperative coverage with improved efficiency and adaptability, narrowing the gap between semantic cognition and distributed underwater control in GPS-denied, map-free conditions.
comment: Withdrawal for further improvement. The final version will be released in a few months
ReViP: Mitigating False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance
Vision-Language-Action (VLA) models have advanced robotic manipulation by combining vision, language, and proprioception to predict actions. However, previous methods fuse proprioceptive signals directly with vision-language features, resulting in state-dominant bias and \textbf{false completions} despite visible execution failures. We systematically analyze this failure mode, attributing it to modality imbalance, where policies overly rely on internal state progression and underuse visual evidence. To address this, we introduce the first \textbf{False-Completion Benchmark Suite}, featuring eight tasks with three controlled perturbations (\emph{Object Drop}, \emph{Distractor Swap}, \emph{Relayout}) to comprehensively evaluate false completion. Moreover, we propose \textbf{ReViP}, a novel VLA framework with \textbf{Vi}sion-\textbf{P}roprioception \textbf{Re}balance to enhance visual grounding and robustness under perturbations. The key insight is to introduce auxiliary \emph{progress-aware visual cues} to adaptively modulate the coupling between semantic perception and proprioceptive dynamics. Specifically, progress-aware visual cues are extracted by an external Task-Stage Observer, which performs task-relevant reasoning on real-time observations to drive task-stage feature-wise linear modulation, enhancing environmental awareness and mitigating state-driven errors. Extensive experiments show that ReViP effectively mitigates false completion and improves success rates over strong VLA baselines, achieving a \textbf{26\%} gain over $π_0$ model on our suite, with gains extending to LIBERO, RoboTwin 2.0, and real-world evaluations.
DriveCritic: Towards Context-Aware, Human-Aligned Evaluation for Autonomous Driving with Vision-Language Models ICRA 2026
Benchmarking autonomous driving planners to align with human judgment remains a critical challenge, as state-of-the-art metrics like the Extended Predictive Driver Model Score (EPDMS) lack context awareness in nuanced scenarios. To address this, we introduce DriveCritic, a novel framework featuring two key contributions: the DriveCritic dataset, a curated collection of challenging scenarios where context is critical for correct judgment and annotated with pairwise human preferences, and the DriveCritic model, a Vision-Language Model (VLM) based evaluator. Fine-tuned using a two-stage supervised and reinforcement learning pipeline, the DriveCritic model learns to adjudicate between trajectory pairs by integrating visual and symbolic context. Experiments show DriveCritic significantly outperforms existing metrics and baselines in matching human preferences and demonstrates strong context awareness. Overall, our work provides a more reliable, human-aligned foundation to evaluating autonomous driving systems. The project page for DriveCritic is https://song-jingyu.github.io/DriveCritic
comment: Accepted at ICRA 2026; 8 pages, 3 figures
Decision-Aware Uncertainty Evaluation of Vision-Language Model-Based Early Action Anticipation for Human-Robot Interaction
Robots in shared workspaces must interpret human actions from partial, ambiguous observations, where overconfident early predictions can lead to unsafe or disruptive interaction. This challenge is amplified in egocentric views, where viewpoint changes and occlusions increase perceptual noise and ambiguity. As a result, downstream human-robot interaction modules require not only an action hypothesis but also a trustworthy estimate of confidence under partial observation. Recent vision-language model-based approaches have been proposed for short-term action recognition due to their open-vocabulary and context-aware reasoning, but their uncertainty reliability in the temporal-prefix regime is largely uncharacterized. We present the first systematic evaluation of uncertainty in vision-language model-based short-term action recognition for human-robot interaction. We introduce a temporal-prefix evaluation protocol and metrics for calibration and selective prediction. We also characterize miscalibration patterns and failure modes under partial observations. Our study provides the missing reliability evidence needed to use vision-language model predictions in confidence-gated human-robot interaction modules.
Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction
Leader-follower interaction is an important paradigm in human-robot interaction (HRI). Yet, assigning roles in real time remains challenging for resource-constrained mobile and assistive robots. While large language models (LLMs) have shown promise for natural communication, their size and latency limit on-device deployment. Small language models (SLMs) offer a potential alternative, but their effectiveness for role classification in HRI has not been systematically evaluated. In this paper, we present a benchmark of SLMs for leader-follower communication, introducing a novel dataset derived from a published database and augmented with synthetic samples to capture interaction-specific dynamics. We investigate two adaptation strategies: prompt engineering and fine-tuning, studied under zero-shot and one-shot interaction modes, compared with an untrained baseline. Experiments with Qwen2.5-0.5B reveal that zero-shot fine-tuning achieves robust classification performance (86.66% accuracy) while maintaining low latency (22.2 ms per sample), significantly outperforming baseline and prompt-engineered approaches. However, results also indicate a performance degradation in one-shot modes, where increased context length challenges the model's architectural capacity. These findings demonstrate that fine-tuned SLMs provide an effective solution for direct role assignment, while highlighting critical trade-offs between dialogue complexity and classification reliability on the edge.
Scalable Surface-Based Manipulation Through Modularity and Inter-Module Object Transfer
Robotic Manipulation Surfaces (RMS) manipulate objects by deforming the surface on which they rest, offering safe, parallel handling of diverse and fragile items. However, existing designs face a fundamental tradeoff: achieving fine control typically demands dense actuator arrays that limit scalability. Modular architectures can extend the workspace, but transferring objects reliably across module boundaries on soft, continuously deformable surfaces remains an open challenge. We present a multi-modular soft manipulation platform that achieves coordinated inter-module object transfer and precise positioning across interconnected fabric-based modules. A hierarchical control framework, combining conflict-free Manhattan-based path planning with directional object passing and a geometric PID controller, achieves sub-centimeter positioning and consistent transfer of heterogeneous objects including fragile items. The platform employs shared-boundary actuation, where adjacent modules share edge actuators, reducing the required count from $4n^2$ to $(n + 1)^2$ for an $n \times n$ grid; a $2\times 2$ prototype covers $1\times 1$ m with only 9 actuators. This scaling comes at a cost: shared actuators mechanically couple neighbouring modules, creating interference during simultaneous manipulation. We systematically characterise this coupling across spatial configurations and propose compensation strategies that reduce passive-object displacement by 59--78\%. Together, these contributions establish a scalable foundation for soft manipulation surfaces in applications such as food processing and logistics.
comment: 8 pages
Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment CVPR 2026
We introduce a lifelong imitation learning framework that enables continual policy refinement across sequential tasks under realistic memory and data constraints. Our approach departs from conventional experience replay by operating entirely in a multimodal latent space, where compact representations of visual, linguistic, and robot's state information are stored and reused to support future learning. To further stabilize adaptation, we introduce an incremental feature adjustment mechanism that regularizes the evolution of task embeddings through an angular margin constraint, preserving inter-task distinctiveness. Our method establishes a new state of the art in the LIBERO benchmarks, achieving 10-17 point gains in AUC and up to 65% less forgetting compared to previous leading methods. Ablation studies confirm the effectiveness of each component, showing consistent gains over alternative strategies. The code is available at: https://github.com/yfqi/lifelong_mlr_ifa.
comment: Accepted to CVPR 2026
Robust Attitude Control of Nonlinear UAV Dynamics with LFT Models and $\mathcal{H}_\infty$ Performance
Attitude stabilization of unmanned aerial vehicles (UAVs) in uncertain environments presents significant challenges due to nonlinear dynamics, parameter variations, and sensor limitations. This paper presents a comparative study of $\mathcal{H}_\infty$ and classical PID controllers for multi-rotor attitude regulation in the presence of wind disturbances and gyroscope noise. The flight dynamics are modeled using a linear parameter-varying (LPV) framework, where nonlinearities and parameter variations are systematically represented as structured uncertainties within a linear fractional transformation formulation. A robust controller based on $\mathcal{H}_\infty$ formulation is designed using only gyroscope measurements to ensure guaranteed performance bounds. Nonlinear simulation results demonstrate the effectiveness of the robust controllers compared to classical PID control, showing significant improvement in attitude regulation under severe wind disturbances.
comment: 6 pages, 6 figures, 3 tables, submitted to ACC 2026
Warped Hypertime Representations for Long-term Autonomy of Mobile Robots
This paper presents a novel method for introducing time into discrete and continuous spatial representations used in mobile robotics, by modelling long-term, pseudo-periodic variations caused by human activities. Unlike previous approaches, the proposed method does not treat time and space separately, and its continuous nature respects both the temporal and spatial continuity of the modeled phenomena. The method extends the given spatial model with a set of wrapped dimensions that represent the periodicities of observed changes. By performing clustering over this extended representation, we obtain a model that allows us to predict future states of both discrete and continuous spatial representations. We apply the proposed algorithm to several long-term datasets and show that the method enables a robot to predict future states of representations with different dimensions. The experiments further show that the method achieves more accurate predictions than the previous state of the art.
WHED: A Wearable Hand Exoskeleton for Natural, High-Quality Demonstration Collection
Scalable learning of dexterous manipulation remains bottlenecked by the difficulty of collecting natural, high-fidelity human demonstrations of multi-finger hands due to occlusion, complex hand kinematics, and contact-rich interactions. We present WHED, a wearable hand-exoskeleton system designed for in-the-wild demonstration capture, guided by two principles: wearability-first operation for extended use and a pose-tolerant, free-to-move thumb coupling that preserves natural thumb behaviors while maintaining a consistent mapping to the target robot thumb degrees of freedom. WHED integrates a linkage-driven finger interface with passive fit accommodation, a modified passive hand with robust proprioceptive sensing, and an onboard sensing/power module. We also provide an end-to-end data pipeline that synchronizes joint encoders, AR-based end-effector pose, and wrist-mounted visual observations, and supports post-processing for time alignment and replay. We demonstrate feasibility on representative grasping and manipulation sequences spanning precision pinch and full-hand enclosure grasps, and show qualitative consistency between collected demonstrations and replayed executions.
comment: This manuscript is withdrawn because the work is being substantially revised for submission to a peer-reviewed venue. The current version may be incomplete or misleading
Multiagent Systems
AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling
State-of-the-art cloud-native applications require intelligent schedulers that can effectively balance system stability, resource utilisation, and associated costs. While Kubernetes provides feasibility-based placement by default, recent research efforts have explored the use of reinforcement learning (RL) for more intelligent scheduling decisions. However, current RL-based schedulers have three major limitations. First, most of these schedulers use monolithic centralised agents, which are non-scalable for large heterogeneous clusters. Second, the ones that use multi-objective reward functions assume simple, static, linear combinations of the objectives. Third, no previous work has produced a stress-aware scheduler that can react adaptively to dynamic conditions. To address these gaps in current research, we propose the Adaptive Graph-enhanced Multi-Agent Reinforcement Learning Dynamic Kubernetes Scheduler (AGMARL-DKS). AGMARL-DKS addresses these gaps by introducing three major innovations. First, we construct a scalable solution by treating the scheduling challenge as a cooperative multi-agent problem, where every cluster node operates as an agent, employing centralised training methods before decentralised execution. Second, to be context-aware and yet decentralised, we use a Graph Neural Network (GNN) to build a state representation of the global cluster context at each agent. This represents an improvement over methods that rely solely on local observations. Finally, to make trade-offs between these objectives, we use a stress-aware lexicographical ordering policy instead of a simple, static linear weighting of these objectives. The evaluations in Google Kubernetes Engine (GKE) reveal that AGMARL-DKS significantly outperforms the default scheduler in terms of fault tolerance, utilisation, and cost, especially in scheduling batch and mission-critical workloads.
CogSearch: A Cognitive-Aligned Multi-Agent Framework for Proactive Decision Support in E-Commerce Search
Modern e-commerce search engines, largely rooted in passive retrieval-and-ranking models, frequently fail to support complex decision-making, leaving users overwhelmed by cognitive friction. In this paper, we introduce CogSearch, a novel cognitive-oriented multi-agent framework that reimagines e-commerce search as a proactive decision support system. By synergizing four specialized agents, CogSearch mimics human cognitive workflows: it decomposes intricate user intents, fuses heterogeneous knowledge across internal and external sources, and delivers highly actionable insights. Our offline benchmarks validate CogSearch's excellence in consultative and complex search scenarios. Extensive online A/B testing on JD.com demonstrates the system's transformative impact: it reduced decision costs by 5% and achieved a 0.41% increase in overall UCVR, with a remarkable 30% surge in conversion for decision-heavy queries. CogSearch represents a fundamental shift in information retrieval, moving beyond traditional relevance-centric paradigms toward a future of holistic, collaborative decision intelligence.
The price of decentralization in managing engineering systems through multi-agent reinforcement learning
Inspection and maintenance (I&M) planning involves sequential decision making under uncertainties and incomplete information, and can be modeled as a partially observable Markov decision process (POMDP). While single-agent deep reinforcement learning provides approximate solutions to POMDPs, it does not scale well in multi-component systems. Scalability can be achieved through multi-agent deep reinforcement learning (MADRL), which decentralizes decision-making across multiple agents, locally controlling individual components. However, this decentralization can induce cooperation pathologies that degrade the optimality of the learned policies. To examine these effects in I&M planning, we introduce a set of deteriorating systems in which redundancy is varied systematically. These benchmark environments are designed such that computation of centralized (near-)optimal policies remains tractable, enabling direct comparison of solution methods. We implement and benchmark a broad set of MADRL algorithms spanning fully centralized and decentralized training paradigms, from value-factorization to actor-critic methods. Our results show a clear effect of redundancy on coordination: MADRL algorithms achieve near-optimal performance in series-like settings, whereas increasing redundancy amplifies coordination challenges and can lead to optimality losses. Nonetheless, decentralized agents learn structured policies that consistently outperform optimized heuristic baselines, highlighting both the promise and current limitations of decentralized learning for scalable maintenance planning.
Hybrid Human-Agent Social Dilemmas in Energy Markets
In hybrid populations where humans delegate strategic decision-making to autonomous agents, understanding when and how cooperative behaviors can emerge remains a key challenge. We study this problem in the context of energy load management: consumer agents schedule their appliance use under demand-dependent pricing. This structure can create a social dilemma where everybody would benefit from coordination, but in equilibrium agents often choose to incur the congestion costs that cooperative turn-taking would avoid. To address the problem of coordination, we introduce artificial agents that use globally observable signals to increase coordination. Using evolutionary dynamics, and reinforcement learning experiments, we show that artificial agents can shift the learning dynamics to favour coordination outcomes. An often neglected problem is partial adoption: what happens when the technology of artificial agents is in the early adoption stages? We analyze mixed populations of adopters and non-adopters, demonstrating that unilateral entry is feasible: adopters are not structurally penalized, and partial adoption can still improve aggregate outcomes. However, in some parameter regimes, non-adopters may benefit disproportionately from the cooperation induced by adopters. This asymmetry, while not precluding beneficial entry, warrants consideration in deployment, and highlights strategic issues around the adoption of AI technology in multiagent settings.
comment: 20 pages, 7 figures. Submitted to Proceedings of the Royal Society A, Special Issue on "The evolution of sociality in hybrid human AI populations"
From Debate to Deliberation: Structured Collective Reasoning with Typed Epistemic Acts
Multi-agent LLM systems increasingly tackle complex reasoning, yet their interaction patterns remain limited to voting, unstructured debate, or pipeline orchestration. None model deliberation: a phased process where differentiated participants exchange typed reasoning moves, preserve disagreements, and converge on accountable outcomes. We introduce Deliberative Collective Intelligence (DCI), specifying four reasoning archetypes, 14 typed epistemic acts, a shared workspace, and DCI-CF, a convergent flow algorithm that guarantees termination with a structured decision packet containing the selected option, residual objections, minority report, and reopen conditions. We evaluate on 45 tasks across seven domains using Gemini 2.5 Flash. On non-routine tasks (n=40), DCI significantly improves over unstructured debate (+0.95, 95% CI [+0.41, +1.54]). DCI excels on hidden-profile tasks requiring perspective integration (9.56, highest of any system on any domain) while failing on routine decisions (5.39), confirming task-dependence. DCI produces 100% structured decision packets and 98% minority reports, artifacts absent from all baselines. However, DCI consumes ~62x single-agent tokens, and single-agent generation outperforms DCI on overall quality. DCI's contribution is not that more agents are better, but that consequential decisions benefit from deliberative structure when process accountability justifies the cost.
comment: 26 pages, 6 tables, 2 figures, 2 listings
Multi-Agent Reinforcement Learning for UAV-Based Chemical Plume Source Localization
Undocumented orphaned wells pose significant health and environmental risks to nearby communities by releasing toxic gases and contaminating water sources, with methane emissions being a primary concern. Traditional survey methods such as magnetometry often fail to detect older wells effectively. In contrast, aerial in-situ sensing using unmanned aerial vehicles (UAVs) offers a promising alternative for methane emission detection and source localization. This study presents a robust and efficient framework based on a multi-agent deep reinforcement learning (MARL) algorithm for the chemical plume source localization (CPSL) problem. The proposed approach leverages virtual anchor nodes to coordinate UAV navigation, enabling collaborative sensing of gas concentrations and wind velocities through onboard and shared measurements. Source identification is achieved by analyzing the historical trajectory of anchor node placements within the plume. Comparative evaluations against the fluxotaxis method demonstrate that the MARL framework achieves superior performance in both localization accuracy and operational efficiency.
How Intelligence Emerges: A Minimal Theory of Dynamic Adaptive Coordination
This paper develops a dynamical theory of adaptive coordination in multi-agent systems. Rather than analyzing coordination through equilibrium optimization or agent-centric learning alone, the framework models agents, incentives, and environment as a recursively closed feedback architecture. A persistent environment stores accumulated coordination signals, a distributed incentive field transmits those signals locally, and adaptive agents update in response. Coordination is thus treated as a structural property of coupled dynamics rather than as the solution to a centralized objective. The paper establishes three structural results. First, under dissipativity assumptions, the induced closed-loop system admits a bounded forward-invariant region, ensuring viability without requiring global optimality. Second, when incentive signals depend non-trivially on persistent environmental memory, the resulting dynamics generically cannot be reduced to a static global objective defined solely over the agent state space. Third, persistent environmental state induces history sensitivity unless the system is globally contracting. A minimal linear specification illustrates how coupling, persistence, and dissipation govern local stability and oscillatory regimes through spectral conditions on the Jacobian. The results establish structural conditions under which intelligent coordination dynamics emerge from incentive-mediated adaptive interaction within a persistent environment, without presuming welfare maximization, rational expectations, or centralized design.
Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents
Time Series Event Detection (TSED) has long been an important task with critical applications across many high-stakes domains. Unlike statistical anomalies, events are defined by semantics with complex internal structures, which are difficult to learn inductively from scarce labeled data in real-world settings. In light of this, we introduce Knowledge-Guided TSED, a new setting where a model is given a natural-language event description and must ground it to intervals in multivariate signals with little or no training data. To tackle this challenge, we introduce Event Logic Tree (ELT), a novel knowledge representation framework to bridge linguistic descriptions and physical time series data via modeling the intrinsic temporal-logic structures of events. Based on ELT, we present a neuro-symbolic VLM agent framework that iteratively instantiates primitives from signal visualizations and composes them under ELT constraints, producing both detected intervals and faithful explanations in the form of instantiated trees. To validate the effectiveness of our approach, we release a benchmark based on real-world time series data with expert knowledge and annotations. Experiments and human evaluation demonstrate the superiority of our method compared to supervised fine-tuning baselines and existing zero-shot time series reasoning frameworks based on LLMs/VLMs. We also show that ELT is critical in mitigating VLMs' inherent hallucination in matching signal morphology with event semantics.
comment: Work in progress
Verified Multi-Agent Orchestration: A Plan-Execute-Verify-Replan Framework for Complex Query Resolution ICLR 2026
We present Verified Multi-Agent Orchestration (VMAO), a framework that coordinates specialized LLM-based agents through a verification-driven iterative loop. Given a complex query, our system decomposes it into a directed acyclic graph (DAG) of sub-questions, executes them through domain-specific agents in parallel, verifies result completeness via LLM-based evaluation, and adaptively replans to address gaps. The key contributions are: (1) dependency-aware parallel execution over a DAG of sub-questions with automatic context propagation, (2) verification-driven adaptive replanning that uses an LLM-based verifier as an orchestration-level coordination signal, and (3) configurable stop conditions that balance answer quality against resource usage. On 25 expert-curated market research queries, VMAO improves answer completeness from 3.1 to 4.2 and source quality from 2.6 to 4.1 (1-5 scale) compared to a single-agent baseline, demonstrating that orchestration-level verification is an effective mechanism for multi-agent quality assurance.
comment: ICLR 2026 Workshop on MALGAI
EducaSim: Interactive Simulacra for CS1 Instructional Practice
Role play is a high-impact mode of training that has demonstrated its effectiveness in improving learning outcomes. However, it is difficult to scale to teacher instruction due to its inherent dependency on providing personnel who are both trained and available to facilitate this learning environment. This poses a challenge, especially to massive online courses which may employ and aid hundreds to thousands of novice teachers. In this work, we present EducaSim: a novel framework that uses generative agents to simulate a small-group section for teachers-in-training to practice instruction. EducaSim works by implementing diverse pedagogical-based personas, actual course material, and agent-based architectures constructed for instructional practice to provide a pedagogically rich environment for teachers-in-training to engage in role play learning -- without the costly overhead that comes with it. We share our experiences with constructing and making the tool available for experimental training and preparation in a six-week CS1 course supporting 20,000 students. We found that teachers who engaged generally saw it as a positive experience. We believe that EducaSim is an important step to providing experiential teaching practice at scale for closely-defined settings and has great potential for future applications.
comment: 7 pages, 3 figures, 2 tables. Presents a multi-agent generative architecture for educational simulations intended for instructor training
Language Model Teams as Distributed Systems
Large language models (LLMs) are growing increasingly capable, prompting recent interest in LLM teams. Yet, despite increased deployment of LLM teams at scale, we lack a principled framework for addressing key questions such as when a team is helpful, how many agents to use, how structure impacts performance -- and whether a team is better than a single agent. Rather than designing and testing these possibilities through trial-and-error, we propose using distributed systems as a principled foundation for creating and evaluating LLM teams. We find that many of the fundamental advantages and challenges studied in distributed computing also arise in LLM teams, highlighting the rich practical insights that can come from the cross-talk of these two fields of study.
VQQA: An Agentic Approach for Video Evaluation and Quality Improvement
Despite rapid advancements in video generation models, aligning their outputs with complex user intent remains challenging. Existing test-time optimization methods are typically either computationally expensive or require white-box access to model internals. To address this, we present VQQA (Video Quality Question Answering), a unified, multi-agent framework generalizable across diverse input modalities and video generation tasks. By dynamically generating visual questions and using the resulting Vision-Language Model (VLM) critiques as semantic gradients, VQQA replaces traditional, passive evaluation metrics with human-interpretable, actionable feedback. This enables a highly efficient, closed-loop prompt optimization process via a black-box natural language interface. Extensive experiments demonstrate that VQQA effectively isolates and resolves visual artifacts, substantially improving generation quality in just a few refinement steps. Applicable to both text-to-video (T2V) and image-to-video (I2V) tasks, our method achieves absolute improvements of +11.57% on T2V-CompBench and +8.43% on VBench2 over vanilla generation, significantly outperforming state-of-the-art stochastic search and prompt optimization techniques.
WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning
Recent advancements in Large Language Models (LLMs) have largely focused on depth scaling, where a single agent solves long-horizon problems with multi-turn reasoning and tool use. However, as tasks grow broader, the key bottleneck shifts from individual competence to organizational capability. In this work, we explore a complementary dimension of width scaling with multi-agent systems to address broad information seeking. Existing multi-agent systems often rely on hand-crafted workflows and turn-taking interactions that fail to parallelize work effectively. To bridge this gap, we propose WideSeek-R1, a lead-agent-subagent framework trained via multi-agent reinforcement learning (MARL) to synergize scalable orchestration and parallel execution. By utilizing a shared LLM with isolated contexts and specialized tools, WideSeek-R1 jointly optimizes the lead agent and parallel subagents on a curated dataset of 20k broad information-seeking tasks. Extensive experiments show that WideSeek-R1-4B achieves an item F1 score of 40.0% on the WideSearch benchmark, which is comparable to the performance of single-agent DeepSeek-R1-671B. Furthermore, WideSeek-R1-4B exhibits consistent performance gains as the number of parallel subagents increases, highlighting the effectiveness of width scaling.
comment: https://wideseek-r1.github.io/
Resilient Topology-Aware Coordination for Dynamic 3D UAV Networks under Node Failure
Ensuring continuous service coverage under unexpected hardware failures is a fundamental challenge for 3D Aerial-Ground Integrated Networks. Although Multi-Agent Reinforcement Learning facilitates autonomous coordination, traditional architectures often lack resilience to sudden topology deformations. This paper proposes the Topology-Aware Graph MAPPO (TAG-MAPPO) framework to enhance system survivability through autonomous 3D spatial reconfiguration. Our framework integrates graph-based feature aggregation with a residual ego-state fusion mechanism to capture intricate inter-agent dependencies. To achieve structural robustness, we introduce a Random Observation Shuffling mechanism that fosters strong generalization to agent population fluctuations by breaking coordinate-index dependencies. Extensive simulations across heterogeneous environments, including high-speed mobility at 15 meters per second, demonstrate that TAG-MAPPO significantly outperforms Multi-Layer Perceptron baselines. Specifically, the framework reduces redundant handoffs by up to 50 percent while maintaining superior energy efficiency. Most notably, TAG-MAPPO exhibits exceptional self-healing capabilities, restoring over 90 percent of pre-failure coverage within 15 time steps. In dense urban scenarios, the framework achieves a post-failure fairness index surpassing its original four-UAV configuration by autonomously resolving service overlaps and interference. These findings confirm that topology-aware coordination is essential for resilient 6G aerial networks.
comment: 14 pages, 5 figures. Full research paper providing a resilience-aware RL framework for UAV networks under node failure. A preliminary version has been submitted to IEEE Journal for possible publication
Agentic Design Review System
Evaluating graphic designs involves assessing it from multiple facets like alignment, composition, aesthetics and color choices. Evaluating designs in a holistic way involves aggregating feedback from individual expert reviewers. Towards this, we propose an Agentic Design Review System (AgenticDRS), where multiple agents collaboratively analyze a design, orchestrated by a meta-agent. A novel in-context exemplar selection approach based on graph matching and a unique prompt expansion method plays central role towards making each agent design aware. Towards evaluating this framework, we propose DRS-BENCH benchmark. Thorough experimental evaluation against state-of-the-art baselines adapted to the problem setup, backed-up with critical ablation experiments brings out the efficacy of Agentic-DRS in evaluating graphic designs and generating actionable feedback. We hope that this work will attract attention to this pragmatic, yet under-explored research direction.
comment: Project Page: https://sayannag.github.io/AgenticDRS
Can AI Agents Agree?
Large language models are increasingly deployed as cooperating agents, yet their behavior in adversarial consensus settings has not been systematically studied. We evaluate LLM-based agents on a Byzantine consensus game over scalar values using a synchronous all-to-all simulation. We test consensus in a no-stake setting where agents have no preferences over the final value, so evaluation focuses on agreement rather than value optimality. Across hundreds of simulations spanning model sizes, group sizes, and Byzantine fractions, we find that valid agreement is not reliable even in benign settings and degrades as group size grows. Introducing a small number of Byzantine agents further reduces success. Failures are dominated by loss of liveness, such as timeouts and stalled convergence, rather than subtle value corruption. Overall, the results suggest that reliable agreement is not yet a dependable emergent capability of current LLM-agent groups even in no-stake settings, raising caution for deployments that rely on robust coordination.
Enhancing Heterogeneous Multi-Agent Cooperation in Decentralized MARL via GNN-driven Intrinsic Rewards AAMAS 2025
Multi-agent Reinforcement Learning (MARL) is emerging as a key framework for various sequential decision-making and control tasks. Unlike their single-agent counterparts, multi-agent systems necessitate successful cooperation among the agents. The deployment of these systems in real-world scenarios often requires decentralized training, a diverse set of agents, and learning from infrequent environmental reward signals. These challenges become more pronounced under partial observability and the lack of prior knowledge about agent heterogeneity. While notable studies use intrinsic motivation (IM) to address reward sparsity or cooperation in decentralized settings, those dealing with heterogeneity typically assume centralized training, parameter sharing, and agent indexing. To overcome these limitations, we propose the CoHet algorithm, which utilizes a novel Graph Neural Network (GNN) based intrinsic motivation to facilitate the learning of heterogeneous agent policies in decentralized settings, under the challenges of partial observability and reward sparsity. Evaluation of CoHet in the Multi-agent Particle Environment (MPE) and Vectorized Multi-Agent Simulator (VMAS) benchmarks demonstrates superior performance compared to the state-of-the-art in a range of cooperative multi-agent scenarios. Our research is supplemented by an analysis of the impact of the agent dynamics model on the intrinsic motivation module, insights into the performance of different CoHet variants, and its robustness to an increasing number of heterogeneous agents.
comment: Full paper version for AAMAS 2025 (https://ifaamas.org/Proceedings/aamas2025/pdfs/p2681.pdf), 9 pages, 5 figures
Partially Observable Multi-Agent Reinforcement Learning with Information Sharing ICML 2023
We study provable multi-agent reinforcement learning (RL) in the general framework of partially observable stochastic games (POSGs). To circumvent the known hardness results and the use of computationally intractable oracles, we advocate leveraging the potential \emph{information-sharing} among agents, a common practice in empirical multi-agent RL, and a standard model for multi-agent control systems with communication. We first establish several computational complexity results to justify the necessity of information-sharing, as well as the observability assumption that has enabled quasi-polynomial time and sample single-agent RL with partial observations, for tractably solving POSGs. Inspired by the inefficiency of planning in the ground-truth model, we then propose to further \emph{approximate} the shared common information to construct an approximate model of the POSG, in which an approximate \emph{equilibrium} (of the original POSG) can be found in quasi-polynomial-time, under the aforementioned assumptions. Furthermore, we develop a partially observable multi-agent RL algorithm whose time and sample complexities are \emph{both} quasi-polynomial. Finally, beyond equilibrium learning, we extend our algorithmic framework to finding the \emph{team-optimal solution} in cooperative POSGs, i.e., decentralized partially observable Markov decision processes, a more challenging goal. We establish concrete computational and sample complexities under several structural assumptions of the model. We hope our study could open up the possibilities of leveraging and even designing different \emph{information structures}, a well-studied notion in control theory, for developing both sample- and computation-efficient partially observable multi-agent RL.
comment: Final journal version of the ICML 2023 conference paper, accepted to SIAM Journal on Control and Optimization (SICON)
Human-AI Governance (HAIG): A Trust-Utility Approach
This paper introduces the Human-AI Governance (HAIG) framework, contributing to the AI governance (AIG) field by foregrounding the relational dynamics between human and AI actors rather than treating AI systems as objects of governance alone. Current categorical frameworks (e.g., human-in-the-loop models) inadequately capture how AI systems evolve from tools to partners, particularly as foundation models demonstrate emergent capabilities and multi-agent systems exhibit autonomous goal-setting behaviours. As systems are deployed across contexts, agency redistributes in complex patterns that are better represented as positions along continua rather than discrete categories. The HAIG framework operates across three levels: dimensions (Decision Authority, Process Autonomy, and Accountability Configuration), continua (continuous positional spectra along each dimension), and thresholds (critical points along the continua where governance requirements shift qualitatively). The framework's dimensional architecture is level-agnostic, applicable from individual deployment decisions and organisational governance through to sectorial comparison and national and international regulatory design. Unlike risk-based or principle-based approaches that treat governance primarily as a constraint on AI deployment, HAIG adopts a trust-utility orientation - reframing governance as the condition under which human-AI collaboration can realise its potential, calibrating oversight to specific relational contexts rather than predetermined categories. Case studies in healthcare and European regulation demonstrate how HAIG complements existing frameworks while offering a foundation for adaptive regulatory design that anticipates governance challenges before they emerge.
comment: 35 pages including references and appendix, 28 pages core text, 3 figures, 3 tables
Epistemic diversity across language models mitigates knowledge collapse
As artificial intelligence (AI) becomes more widely used, concerns are growing that model collapse could lead to knowledge collapse, i.e. a degradation to a narrow and inaccurate set of ideas. Prior work has demonstrated single-model collapse, defined as performance decay in an AI model trained on its own outputs. Inspired by ecology, we ask whether increasing AI ecosystem diversity (i.e., the number of distinct models) can mitigate such collapse. To study the effect of diversity on model performance, we extend the single-model approach by segmenting the training data across an increasing number of language models and evaluating the resulting ecosystems of models over ten self-training iterations. We find that training a single model on the entire dataset improves performance only in the short term but amplifies collapse over longer horizons. Specifically, we observe that the optimal diversity level (i.e., the level that maximizes performance) increases monotonically with the number of self-training iterations. The observed effect is robust across various experimental settings, including different model families, parameter sizes, mixing human- and model-generated data, and temperature sampling methods, demonstrating the significance of ecosystem diversity for mitigating collapse. Moreover, our experiments with increased model and dataset sizes indicate that scaling up the system can amplify collapse in highly homogeneous ecosystems, thereby increasing the diversity benefits. In the presence of AI monoculture, our results suggest the need to monitor (dis)agreement among AI systems and to incentivize more domain- and community-specific models to ensure successful knowledge production in the long run.
comment: 30 pages, 21 figures. v2 changelog: added experimental variations, updated theory, writing revisions, updated metadata
Systems and Control (EESS)
Maximum-Entropy Random Walks on Hypergraphs
Random walks are fundamental tools for analyzing complex networked systems, including social networks, biological systems, and communication infrastructures. While classical random walks focus on pairwise interactions, many real-world systems exhibit higher-order interactions naturally modeled by hypergraphs. Existing random walk models on hypergraphs often focus on undirected structures or do not incorporate entropy-based inference, limiting their ability to capture directional flows, uncertainty, or information diffusion in complex systems. In this article, we develop a maximum-entropy random walk framework on directed hypergraphs with two interaction mechanisms: broadcasting where a pivot node activates multiple receiver nodes and merging where multiple pivot nodes jointly influence a receiver node. We infer a transition kernel via a Kullback--Leibler divergence projection onto constraints enforcing stochasticity and stationarity. The resulting optimality conditions yield a multiplicative scaling form, implemented using Sinkhorn--Schrödinger-type iterations with tensor contractions. We further analyze ergodicity, including projected linear kernels for broadcasting and tensor spectral criteria for polynomial dynamics in merging. The effectiveness of our framework is demonstrated with both synthetic and real-world examples.
Decentralized Cooperative Localization for Multi-Robot Systems with Asynchronous Sensor Fusion
Decentralized cooperative localization (DCL) is a promising approach for nonholonomic mobile robots operating in GPS-denied environments with limited communication infrastructure. This paper presents a DCL framework in which each robot performs localization locally using an Extended Kalman Filter, while sharing measurement information during update stages only when communication links are available and companion robots are successfully detected by LiDAR. The framework preserves cross-correlation consistency among robot state estimates while handling asynchronous sensor data with heterogeneous sampling rates and accommodating accelerations during dynamic maneuvers. Unlike methods that require pre-aligned coordinate systems, the proposed approach allows robots to initialize with arbitrary reference-frame orientations and achieves automatic alignment through transformation matrices in both the prediction and update stages. To improve robustness in feature-sparse environments, we introduce a dual-landmark evaluation framework that exploits both static environmental features and mobile robots as dynamic landmarks. The proposed framework enables reliable detection and feature extraction during sharp turns, while prediction accuracy is improved through information sharing from mutual observations. Experimental results in both Gazebo simulation and real-world basement environments show that DCL outperforms centralized cooperative localization (CCL), achieving a 34% reduction in RMSE, while the dual-landmark variant yields an improvement of 56%. These results demonstrate the applicability of DCL to challenging domains such as enclosed spaces, underwater environments, and feature-sparse terrains where conventional localization methods are ineffective.
comment: Presented at the 13th RSI International Conference on Robotics and Mechatronics (ICRoM 2025)
Numerical benchmark for damage identification in Structural Health Monitoring
The availability of a dataset for validation and verification purposes of novel data-driven strategies and/or hybrid physics-data approaches is currently one of the most pressing challenges in the engineering field. Data ownership, security, access and metadata handiness are currently hindering advances across many fields, particularly in Structural Health Monitoring (SHM) applications. This paper presents a simulated SHM dataset, comprised of dynamic and static measurements (i.e., acceleration and displacement), and includes the conceptual framework designed to generate it. The simulated measurements were generated to incorporate the effects of Environmental and Operational Variations (EOVs), different types of damage, measurement noise and sensor faults and malfunctions, in order to account for scenarios that may occur during real acquisitions. A fixed-fixed steel beam structure was chosen as reference for the numerical benchmark. The simulated monitoring was operated under the assumptions of a Single Degree of Freedom (SDOF) for generating acceleration records and of the Euler-Bernoulli beam for the simulated displacement measurements. The generation process involved the use of parallel computation, which is detailed within the provided open-source code. The generated data is also available open-source, thus ensuring reproducibility, repeatability and accessibility for further research. The comprehensive description of data types, formats, and collection methodologies makes this dataset a valuable resource for researchers aiming to develop or refine SHM techniques, fostering advancements in the field through accessible, high-quality synthetic data.
comment: Submitted for peer review to Data Centric Engineering, Cambridge University Press
Flight through Narrow Gaps with Morphing-Wing Drones
The size of a narrow gap traversable by a fixed-wing drone is limited by its wingspan. Inspired by birds, here, we enable the traversal of a gap of sub-wingspan width and height using a morphing-wing drone capable of temporarily sweeping in its wings mid-flight. This maneuver poses control challenges due to sudden lift loss during gap-passage at low flight speeds and the need for precisely timed wing-sweep actuation ahead of the gap. To address these challenges, we first develop an aerodynamic model for general wing-sweep morphing drone flight including low flight speeds and post-stall angles of attack. We integrate longitudinal drone dynamics into an optimal reference trajectory generation and Nonlinear Model Predictive Control framework with runtime adaptive costs and constraints. Validated on a 130 g wing-sweep-morphing drone, our method achieves an average altitude error of 5 cm during narrow-gap passage at forward speeds between 5 and 7 m/s, whilst enforcing fully swept wings near the gap across variable threshold distances. Trajectory analysis shows that the drone can compensate for lift loss during gap-passage by accelerating and pitching upwards ahead of the gap to an extent that differs between reference trajectory optimization objectives. We show that our strategy also allows for accurate gap passage on hardware whilst maintaining a constant forward flight speed reference and near-constant altitude.
Approximate Reduced Lindblad Dynamics via Algebraic and Adiabatic Methods
We present an algebraic framework for approximate model reduction of Markovian open quantum dynamics that guarantees complete positivity and trace preservation by construction. First, we show that projecting a Lindblad generator on its center manifold -- the space spanned by eigenoperators with purely imaginary eigenvalue -- yields an asymptotically exact reduced quantum dynamical semigroup whose dynamics is unitary, with exponentially decaying transient error controlled by the generator's spectral gap. Second, for analytic perturbations of a Lindblad generator with a tractable center manifold, we propose a perturbative reduction that keeps the reduced space fixed at the unperturbed center manifold. The resulting generator is shown to remain a valid Lindbladian for arbitrary perturbation strengths, and explicit finite-time error bounds, that quantify leakage from the unperturbed center sector, are provided. We further clarify the connection to adiabatic elimination methods, by both showing how the algebraic reduction can be directly related to a first-order adiabatic-elimination and by providing sufficient conditions under which the latter method can be applied while preserving complete positivity. We showcase the usefulness of our techniques in dissipative many-body quantum systems exhibiting non-stationary long-time dynamics.
Robust Parametric Microgrid Dispatch Under Endogenous Uncertainty of Operation- and Temperature-Dependent Battery Degradation
Batteries play a critical role in microgrid energy management by ensuring power balance, enhancing renewable utilization, and reducing operational costs. However, battery degradation poses a significant challenge, particularly under extreme temperatures. This paper investigates the optimal trade-off between battery degradation and operational costs in microgrid dispatch to find a robust cost-effective strategy from a full life-cycle perspective. A key challenge arises from the endogenous uncertainty (or decision-dependent uncertainty, DDU) of battery degradation: Dispatch decisions influence the probability distribution of battery degradation, while in turn degradation changes battery operation model and thus affects dispatch. In this paper, we first develop an XGBoost-based probabilistic degradation model trained on experimental data across varying temperature conditions. We then formulate a parametric model predictive control (MPC) framework for microgrid dispatch, where the weight parameters of the battery degradation penalty terms are tuned through long-term simulation of degradation and dispatch interactions. Case studies validate the effectiveness of the proposed approach.
comment: 8 pages, 4 figures
Emergency-Aware and Frequency-Constrained HVDC Planning for A Multi-Area Asynchronously Interconnected Grid
High-voltage direct current (HVDC) technology has played a crucial role for long-distance transmission of renewable power generation. However, the integration of large-capacity HVDC lines introduces significant frequency security challenges during HVDC fault emergencies. This paper proposes an emergency-aware and frequency-constrained HVDC planning method to optimize the capacity of inter-area HVDC tie-lines in a multi-area asynchronously interconnected grid. Firstly, a coordinated emergency frequency control scheme is proposed to allocate the emergency control resources during HVDC faults. Then, an enhanced system frequency response model integrating event-driven emergency frequency control is developed and a weighted oblique decision tree approach is employed to extract frequency nadir security constraints. The proposed planning model considers all potential HVDC fault emergencies while treating candidate HVDC capacities as decision variables. Simulation results demonstrate superior performance in balancing economic efficiency with frequency security requirements, providing a practical solution for inter-area HVDC planning.
Risk-Based Dynamic Thermal Rating in Distribution Transformers via Probabilistic Forecasting SC
Low voltage (LV) distribution transformers face accelerating demand growth while replacement lead times and costs continue to rise, making improved utilisation of existing assets essential. Static and conservative protection devices (PDs) in distribution transformers are inflexible and limit the available headroom of the transformer. This paper presents a probabilistic framework for dynamically forecasting optimal thermal protection settings. The proposed approach directly predicts the day-ahead scale factor which maximises the dynamic thermal rating of the transformer from historical load, temperature, and metadata using clustered quantile regression models trained on 644 UK LV transformers. Probabilistic forecasting quantifies overheating risk directly through the prediction percentile, enabling risk-informed operational decisions. Results show a 10--12\% additional capacity gain compared to static settings, with hotspot temperature risk matching the selected percentile, including under realistic temperature forecast errors. These results demonstrate a practical approach for distribution network operators to take advantage of PDs with adaptive settings to maximise capacity and manage risk on operational time scales.
comment: Submitted to 24th Power Systems Computation Conference (PSCC 2026). 8 pages, 8 figures
Exploiting Parallelism in a QPALM-based Solver for Optimal Control
We discuss the opportunities for parallelization in the recently proposed QPALM-OCP algorithm, a solver tailored to quadratic programs arising in optimal control. A significant part of the computational work can be carried out independently for the different stages in the optimal control problem. We exploit this specific structure to apply parallelization and vectorization techniques in an optimized C++ implementation of the method. Results for optimal control benchmark problems and comparisons to the original QPALM method are provided.
comment: Presented at Robotics: Science and Systems 2024 Workshop: Frontiers of optimization for robotics (RSS 2024), Delft, The Netherlands, July 2024
Rotatable Antenna Enabled Covert Communication
Unlike conventional fixed-antenna architectures, rotatable antenna (RA) has shown great potential in enhancing wireless communication performance by exploiting additional spatial degrees of freedom (DoFs) in a cost-effective manner. In this letter, we propose a novel RA-enabled covert communication system, where an RA array-based transmitter (Alice) sends covert information to a legitimate user (Bob) in the presence of multiple wardens (Willies). To maximize the covert rate, we optimize the transmit beamforming vector and the rotational angles of individual RAs, subject to the constraints on covertness, transmit power, and antenna rotational range. To address the non-convex formulated problem, we decompose it into two subproblems and propose an efficient alternating optimization (AO) algorithm to solve the two subproblems iteratively, where the second-order cone programming (SOCP) method and successive convex approximation (SCA) approach are applied separately. Simulation results demonstrate that the proposed RA-enabled covert communication system can provide significantly superior covertness performance to other benchmark schemes.
Multi-Agent Reinforcement Learning for UAV-Based Chemical Plume Source Localization
Undocumented orphaned wells pose significant health and environmental risks to nearby communities by releasing toxic gases and contaminating water sources, with methane emissions being a primary concern. Traditional survey methods such as magnetometry often fail to detect older wells effectively. In contrast, aerial in-situ sensing using unmanned aerial vehicles (UAVs) offers a promising alternative for methane emission detection and source localization. This study presents a robust and efficient framework based on a multi-agent deep reinforcement learning (MARL) algorithm for the chemical plume source localization (CPSL) problem. The proposed approach leverages virtual anchor nodes to coordinate UAV navigation, enabling collaborative sensing of gas concentrations and wind velocities through onboard and shared measurements. Source identification is achieved by analyzing the historical trajectory of anchor node placements within the plume. Comparative evaluations against the fluxotaxis method demonstrate that the MARL framework achieves superior performance in both localization accuracy and operational efficiency.
Forward and Backward Reachability Analysis of Closed-loop Recurrent Neural Networks via Hybrid Zonotopes
Recurrent neural networks (RNNs) are widely employed to model complex dynamical systems due to their hidden-state structure, which inherently captures temporal dependencies. This work presents a hybrid zonotope-based approach for computing exact forward and backward reachable sets of closed-loop RNN systems with ReLU activation functions. The method formulates state-pair sets to compute reachable sets as hybrid zonotopes without requiring unrolling. To improve scalability, a tunable relaxation scheme is proposed that ranks unstable ReLU units across all layers using a triangle-area score and selectively applies convex relaxations within a fixed binary limit in the hybrid zonotopes. This scheme enables an explicit tradeoff between computational complexity and approximation accuracy, with exact reachability as a special case. In addition, a sufficient condition is derived to certify the safety of closed-loop RNN systems. Numerical examples demonstrate the effectiveness of the proposed approach.
comment: 8 pages. Accepted at the American Control Conference 2026
ISAC-Enabled Multi-UAV Collaborative Target Sensing for Low-Altitude Economy
Integrated sensing and communication (ISAC) has attracted growing research interests to facilitate the large-scale development of the low-altitude economy (LAE). However, the high dynamics of low-altitude targets may overwhelm fixed ISAC systems, particularly at the edge of their coverage or in blind zones. Driven by high flexibility, unmanned aerial vehicle (UAV)-assisted ISAC can provide more freedom of design to enhance communication and sensing abilities. In this paper, we propose an ISAC-enabled multi-UAV dynamic collaborative target sensing scheme, where UAVs can dynamically adjust their flight and resource allocation for cooperative sensing of mobile target through communicating with the terrestrial cellular network with ISAC signals. To achieve the precise sensing of the dynamic target, the posterior Cramer-Rao bound (PCRB) for the target state is derived. Subsequently, the PCRB minimization problem is formulated by jointly optimizing the UAV-BS association, UAVs' trajectories and bandwidth allocation, subject to the communication requirements for the UAVs. However, the problem is challenging since it involves non-convex and implicit objective function with coupled optimization variables. For a fast implementation of sensing and tracking, we propose a low-complexity iterative algorithm that can efficiently obtain a sub-optimal solution to the problem. Specifically, the UAV-BS association is first determined by the communication-optimal solution. Then the UAVs' trajectories and bandwidth allocation are alternatively optimized based on the descent direction search algorithm. Finally, numerical results are provided to validate the superiority of our proposed designs as compared to various benchmarks.
Slack More, Predict Better: Proximal Relaxation for Probabilistic Latent Variable Model-based Soft Sensors
Nonlinear Probabilistic Latent Variable Models (NPLVMs) are a cornerstone of soft sensor modeling due to their capacity for uncertainty delineation. However, conventional NPLVMs are trained using amortized variational inference, where neural networks parameterize the variational posterior. While facilitating model implementation, this parameterization converts the distributional optimization problem within an infinite-dimensional function space to parameter optimization within a finite-dimensional parameter space, which introduces an approximation error gap, thereby degrading soft sensor modeling accuracy. To alleviate this issue, we introduce KProxNPLVM, a novel NPLVM that pivots to relaxing the objective itself and improving the NPLVM's performance. Specifically, we first prove the approximation error induced by the conventional approach. Based on this, we design the Wasserstein distance as the proximal operator to relax the learning objective, yielding a new variational inference strategy derived from solving this relaxed optimization problem. Based on this foundation, we provide a rigorous derivation of KProxNPLVM's optimization implementation, prove the convergence of our algorithm can finally sidestep the approximation error, and propose the KProxNPLVM by summarizing the abovementioned content. Finally, extensive experiments on synthetic and real-world industrial datasets are conducted to demonstrate the efficacy of the proposed KProxNPLVM.
comment: This paper has been provisionally accepted for publication in the "IEEE Transactions on Industrial Informatics"
SliceFed: Federated Constrained Multi-Agent DRL for Dynamic Spectrum Slicing in 6G
Dynamic spectrum slicing is a critical enabler for 6G Radio Access Networks (RANs), allowing the coexistence of heterogeneous services. However, optimizing resource allocation in dense, interference-limited deployments remains challenging due to non-stationary channel dynamics, strict Quality-of-Service (QoS) requirements, and the need for data privacy. In this paper, we propose SliceFed, a novel Federated Constrained Multi-Agent Deep Reinforcement Learning (F-MADRL) framework. SliceFed formulates the slicing problem as a Constrained Markov Decision Process (CMDP) where autonomous gNB agents maximize spectral efficiency while explicitly satisfying inter-cell interference budgets and hard ultra-reliable low-latency communication (URLLC) latency deadlines. We employ a Lagrangian primal-dual approach integrated with Proximal Policy Optimization (PPO) to enforce constraints, while Federated Averaging enables collaborative learning without exchanging raw local data. Extensive simulations in a dense multi-cell environment demonstrate that SliceFed converges to a stable, safety-aware policy. Unlike heuristic and unconstrained baselines, SliceFed achieves nearly 100% satisfaction of 1~ms URLLC latency deadlines and exhibits superior robustness to traffic load variations, verifying its potential for reliable and scalable 6G spectrum management.
comment: 4 figures, 3 algorithms charts
Conformalized Data-Driven Reachability Analysis with PAC Guarantees
Data-driven reachability analysis computes over-approximations of reachable sets directly from noisy data. Existing deterministic methods require either known noise bounds or system-specific structural parameters such as Lipschitz constants. We propose Conformalized Data-Driven Reachability (CDDR), a framework that provides Probably Approximately Correct (PAC) coverage guarantees through the Learn Then Test (LTT) calibration procedure, requiring only that calibration trajectories be independently and identically distributed. CDDR is developed for three settings: linear time-invariant (LTI) systems with unknown process noise distributions, LTI systems with bounded measurement noise, and general nonlinear systems including non-Lipschitz dynamics. Experiments on a 5-dimensional LTI system under Gaussian and heavy-tailed Student-t noise and on a 2-dimensional non-Lipschitz system with fractional damping demonstrate that CDDR achieves valid coverage where deterministic methods do not provide formal guarantees. Under anisotropic noise, a normalized score function reduces the reachable set volume while preserving the PAC guarantee.
Technology configurations for decarbonizing residential heat supply through district heating and implications for the electricity network
District heating networks (DHNs) have significant potential to decarbonize residential heating and accelerate the energy transition. However, designing carbon-neutral DHNs requires balancing several objectives, including economic costs, social acceptance, long-term uncertainties, and grid-integration challenges from electrification. By combining modeling-to-generate-alternatives with power flow simulation techniques, we develop a decision-support method for designing carbon-neutral DHNs that are cost-effective, socially acceptable, robust to future risks, and impose minimal impacts on the electricity grid. Applying our method to a Dutch case, we find substantial diversity in how carbon-neutral DHNs can be designed. The flexibility in technology choice, sizing, and location enables accommodating different real-world needs and achieving high electrification levels without increasing grid loading. For instance, intelligently located heat pumps and thermal storage can limit grid stress even when renewable baseload heat sources and green-fuel boilers are scarce. Using our method, planners can explore diverse carbon-neutral DHN designs and identify the design that best balances stakeholders' preferences.
Integrated Online Monitoring and Adaption of Process Model Predictive Controllers
This paper addresses the design of an event-triggered, data-based, and performance-oriented adaption method for model predictive control (MPC). The performance of such a strategy strongly depends on the accuracy of the prediction model, which may require online adaption to prevent performance degradation under changing operating conditions. Unlike existing methods that continuously update model and control parameters from data, potentially leading to catastrophic forgetting and unnecessary control modifications, we propose a novel approach based on statistical monitoring of closed-loop performance indicators. This framework enables the detection of performance degradation, and, when required, controller adaption is performed via reinforcement learning and identification techniques. The proposed strategy is validated on a high-fidelity simulation of a district heating system benchmark.
comment: 6 pages, 3 figures, submitted to IEEE L-CSS
Physics-Guided Inverse Design of Optical Waveforms for Nonlinear Electromagnetic Dynamics
Structured optical waveforms are emerging as powerful control fields for the next generation of complex photonic and electromagnetic systems, where the temporal structure of light can determine the ultimate performance of scientific instruments. However, identifying optimal optical drive fields in strongly nonlinear regimes remains challenging because the mapping between optical inputs and system response is high-dimensional and typically accessible only through computationally expensive simulations. Here, we present a physics-guided deep learning framework for the inverse design of optical temporal waveforms. By training a light-weighted surrogate model on simulations, the method enables gradient-based synthesis of optical profiles that compensate nonlinear field distortions in driven particle-field systems. As a representative application, we apply the approach to the generation of electron beams used in advanced photon and particle sources. The learned optical waveform actively suppresses extrinsic emittance growth by more than 52% compared with conventional Gaussian operation and by approximately 9% relative to the theoretical flattop limit in simulation. We further demonstrate experimental feasibility by synthesizing the predicted waveform using a programmable pulse-shaping platform; incorporating the measured optical profile into beamline simulations yields a 31% reduction in the extrinsic emittance contribution. Beyond accelerator applications, this work establishes a general way for physics-guided inverse design of optical control fields, enabling structured light to approach fundamental performance limits in nonlinear photonic and high-frequency electromagnetic systems.
comment: In reviewing
Compensation of Input/Output Delays for Retarded Systems by Sequential Predictors: A Lyapunov-Halanay Method
This paper presents a Lyapunov-Halanay method to study global asymptotic stabilization (GAS) of nonlinear retarded systems subject to large constant delays in input/output - a challenging problem due to their inherent destabilizing effects. Under the conditions of global Lipschitz continuity (GLC) and global exponential stabilizability (GES) of the retarded system without input delay, a state feedback controller is designed based on sequential predictors to make the closed-loop retarded system GAS. Moreover, if the retarded system with no output delay permits a global exponential observer, a dynamic output compensator is also constructed based on sequential predictors, achieving GAS of the corresponding closed-loop retarded system with input/output delays. The predictor based state and output feedback stabilization results are then extended to a broader class of nonlinear retarded systems with input/output delays, which may not be GES but satisfy global asymptotic stabilizability/observability and suitable ISS conditions. As an application, a pendulum system with delays in the state, input and output is used to illustrate the effectiveness of the proposed state and output feedback control strategies based on sequential predictors.
Ising-ReRAM: A Low Power Ising Machine ReRAM Crossbar for NP Problems ISCA
Computational workloads are growing exponentially, driving power consumption to unsustainable levels. Efficiently distributing large-scale networks is an NP-Complete problem equivalent to Boolean satisfiability (SAT), making it one of the core challenges in modern computation. To address this, physics and device inspired methods such as Ising systems have been explored for solving SAT more efficiently. In this work, we implement an Ising model equivalence of the 3-SAT problem using a ReRAM crossbar fabricated in the Skywater 130 nm CMOS process. Our ReRAM-based algorithm achieves $91.0\%$ accuracy in matrix representation across iterative reprogramming cycles. Additionally, we establish a foundational energy profile by measuring the energy costs of small sub-matrix structures within the problem space, demonstrating under linear growth trajectory for combining sub-matrices into larger problems. These results demonstrate a promising platform for developing scalable architectures to accelerate NP-Complete problem solving.
comment: 4 pages + 1 page reference, 4 figures, 2 tables, targeting IEEE conference (e.g. ISCAS)
Push, Press, Slide: Mode-Aware Planar Contact Manipulation via Reduced-Order Models IROS 2026
Non-prehensile planar manipulation, including pushing and press-and-slide, is critical for diverse robotic tasks, but notoriously challenging due to hybrid contact mechanics, under-actuation, and asymmetric friction limits that traditionally necessitate computationally expensive iterative control. In this paper, we propose a mode-aware framework for planar manipulation with one or two robotic arms based on contact topology selection and reduced-order kinematic modeling. Our core insight is that complex wrench-twist limit surface mechanics can be abstracted into a discrete library of physically intuitive models. We systematically map various single-arm and bimanual contact topologies to simple non-holonomic formulations, e.g. unicycle for simplified press-and-slide motion. By anchoring trajectory generation to these reduced-order models, our framework computes the required object wrench and distributes feasible, friction-bounded contact forces via a direct algebraic allocator. We incorporate manipulator kinematics to ensure long-horizon feasibility and demonstrate our fast, optimization-free approach in simulation across diverse single-arm and bimanual manipulation tasks.
comment: 8 pages, 13 figures. Submitted to IEEE IROS 2026
Optimizing Task Completion Time Updates Using POMDPs
Managing announced task completion times is a fundamental control problem in project management. While extensive research exists on estimating task durations and task scheduling, the problem of when and how to update completion times communicated to stakeholders remains understudied. Organizations must balance announcement accuracy against the costs of frequent timeline updates, which can erode stakeholder trust and trigger costly replanning. Despite the prevalence of this problem, current approaches rely on static predictions or ad-hoc policies that fail to account for the sequential nature of announcement management. In this paper, we formulate the task announcement problem as a Partially Observable Markov Decision Process (POMDP) where the control policy must decide when to update announced completion times based on noisy observations of true task completion. Since most state variables (current time and previous announcements) are fully observable, we leverage the Mixed Observability MDP (MOMDP) framework to enable more efficient policy optimization. Our reward structure captures the dual costs of announcement errors and update frequency, enabling synthesis of optimal announcement control policies. Using off-the-shelf solvers, we generate policies that act as feedback controllers, adaptively managing announcements based on belief state evolution. Simulation results demonstrate significant improvements in both accuracy and announcement stability compared to baseline strategies, achieving up to 75\% reduction in unnecessary updates while maintaining or improving prediction accuracy.
comment: 7 pages, 6 figures, submitted to American Control Conference 2026
Hybrid Energy-Aware Reward Shaping: A Unified Lightweight Physics-Guided Methodology for Policy Optimization
Deep reinforcement learning excels in continuous control but often requires extensive exploration, while physics-based models demand complete equations and suffer cubic complexity. This study proposes Hybrid Energy-Aware Reward Shaping (H-EARS), unifying potential-based reward shaping with energy-aware action regularization. H-EARS constrains action magnitude while balancing task-specific and energy-based potentials via functional decomposition, achieving linear complexity O(n) by capturing dominant energy components without full dynamics. We establish a theoretical foundation including: (1) functional independence for separate task/energy optimization; (2) energy-based convergence acceleration; (3) convergence guarantees under function approximation; and (4) approximate potential error bounds. Lyapunov stability connections are analyzed as heuristic guides. Experiments across baselines show improved convergence, stability, and energy efficiency. Vehicle simulations validate applicability in safety-critical domains under extreme conditions. Results confirm that integrating lightweight physics priors enhances model-free RL without complete system models, enabling transfer from lab research to industrial applications.
comment: 17 pages, 27 figures
Linear viscoelastic rheological FrBD models
In [1], a new modeling paradigm for developing rate-and-state-dependent, control-oriented friction models was introduced. The framework, termed Friction with Bristle Dynamics (FrBD), combines nonlinear analytical expressions for the friction coefficient with constitutive equations for bristle-like elements. Within the FrBD framework, this letter introduces two novel formulations based on the two most general linear viscoelastic models for solids: the Generalized Maxwell (GM) and Generalized Kelvin-Voigt (GKV) elements. Both are analyzed in terms of boundedness and passivity, revealing that these properties are satisfied for any physically meaningful parametrization. An application of passivity for control design is also illustrated, considering an example from robotics. The findings of this letter systematically integrate rate-and-state dynamic friction models with linear viscoelasticity.
comment: 6 pages, 3 figures. Under review at IEEE LCSS
Identifying Network Structure of Nonlinear Dynamical Systems: Contraction and Kuramoto Oscillators
In this work, we study the identifiability of network structures (i.e., topologies) for networked nonlinear systems when partial measurements of the nodal dynamics are taken. We explore scenarios where different candidate structures can yield similar measurements, thus limiting identifiability. To do so, we apply the contraction theory framework to facilitate comparisons between different networks. We show that semicontraction in the observable space is a sufficient condition for two systems to become indistinguishable from one another based on partial measurements. We apply this framework to study networks of Kuramoto oscillators, and discuss scenarios in which different network structures (both connected and disconnected) become indistinguishable.
comment: To appear 2026 ACC
Identifying Network Structure of Linear Dynamical Systems: Observability and Edge Misclassification
This work studies the limitations of uniquely identifying the structure (i.e., topology) of a networked linear system from partial measurements of its nodal dynamics. In general, many networks can be consistent with these measurements; this is a consideration often neglected by standard network inference methods. We show that the space of these networks are related through the nullspace of the observability matrix for the true network. We establish relevant metrics to investigate this space, including an analytic characterization of the most structurally dissimilar network that can be inferred, as well as the possibility of mis-inferring presence or absence of edges. In simulations, we find that when observing over 6\% of nodes in random network models (e.g., Erd\H os-R\' enyi and Watts-Strogatz), approximately 99\% of edges are correctly classified. Extending this discussion, we construct a family of networks that keep measurements $ε$-close to each other, and connect the identifiability of these networks to the spectral properties of an augmented observability Gramian.
comment: To appear 2026 ACC
Online Slip Detection and Friction Coefficient Estimation for Autonomous Racing
Accurate knowledge of the tire-road friction coefficient (TRFC) is essential for vehicle safety, stability, and performance, especially in autonomous racing, where vehicles often operate at the friction limit. However, TRFC cannot be directly measured with standard sensors, and existing estimation methods either depend on vehicle or tire models with uncertain parameters or require large training datasets. In this paper, we present a lightweight approach for online slip detection and TRFC estimation. Our approach relies solely on IMU and LiDAR measurements and the control actions, without special dynamical or tire models, parameter identification, or training data. Slip events are detected in real time by comparing commanded and measured motions, and the TRFC is then estimated directly from observed accelerations under no-slip conditions. Experiments with a 1:10-scale autonomous racing car across different friction levels demonstrate that the proposed approach achieves accurate and consistent slip detections and friction coefficients, with results closely matching ground-truth measurements. These findings highlight the potential of our simple, deployable, and computationally efficient approach for real-time slip monitoring and friction coefficient estimation in autonomous driving.
comment: Equal contribution by the first three authors
Efficient Interference Graph Estimation via Concurrent Flooding
Traditional wisdom for network management allocates network resources separately for the measurement and data transmission tasks. Heavy measurement tasks may take up resources for data transmission and significantly reduce network performance. It is therefore challenging for interference graphs, deemed as incurring heavy measurement overhead, to be used in practice in wireless networks. To address this challenge in wireless sensor networks, we propose to use power as a new dimension for interference graph estimation (IGE) and integrate IGE with concurrent flooding such that IGE can be done simultaneously with flooding using the same frequency-time resources. With controlled and real-world experiments, we show that it is feasible to efficiently achieve IGE via concurrent flooding on the commercial off-the-shelf (COTS) devices by controlling the transmit powers of nodes. We believe that efficient IGE would be a key enabler for the practical use of the existing scheduling algorithms assuming known interference graphs.
comment: Accepted by International Conference on Embedded Wireless Systems and Networking 2023 (EWSN'23), 7 pages with 9 figures, equal contribution by Haifeng Jia and Yichen Wei
The Epistemic Support-Point Filter: Jaynesian Maximum Entropy Meets Popperian Falsification
This paper proves that the Epistemic Support-Point Filter (ESPF) is the unique optimal recursive estimator within the class of epistemically admissible evidence-only filters. Where Bayesian filters minimize mean squared error and are driven toward an assumed truth, the ESPF minimizes maximum entropy and surfaces what has not been proven impossible -- a fundamentally different epistemic commitment with fundamentally different failure modes. Two results locate this theorem within the broader landscape of estimation theory. The first is a unification: the ESPF's optimality criterion is the log-geometric mean of the alpha-cut volume family in the Holder mean hierarchy. The Popperian minimax bound and the Kalman MMSE criterion occupy the p=+inf and p=0 positions on the same curve. Possibility and probability are not competing frameworks: they are the same ignorance functional evaluated under different alpha-cut geometries. The Kalman filter is the Gaussian specialization of the ESPF's optimality criterion, not a separate invention. The second result is a diagnostic: numerical validation over a 2-day, 877-step Smolyak Level-3 orbital tracking run shows that possibilistic stress manifests through necessity saturation and surprisal escalation rather than MVEE sign change -- a direct consequence of the Holder ordering, not an empirical observation. Three lemmas establish the result: the Possibilistic Entropy Lemma decomposes the ignorance functional; the Possibilistic Cramer-Rao Bound limits entropy reduction per measurement; the Evidence-Optimality Lemma proves minimum-q selection is the unique minimizer and that any rule incorporating prior possibility risks race-to-bottom bias.
Parallel-in-Time Nonlinear Optimal Control via GPU-native Sequential Convex Programming
Real-time trajectory optimization for nonlinear constrained autonomous systems is critical and typically performed by CPU-based sequential solvers. Specifically, reliance on global sparse linear algebra or the serial nature of dynamic programming algorithms restricts the utilization of massively parallel computing architectures like GPUs. To bridge this gap, we introduce a fully GPU-native trajectory optimization framework that combines sequential convex programming with a consensus-based alternating direction method of multipliers. By applying a temporal splitting strategy, our algorithm decouples the optimization horizon into independent, per-node subproblems that execute massively in parallel. The entire process runs fully on the GPU, eliminating costly memory transfers and large-scale sparse factorizations. This architecture naturally scales to multi-trajectory optimization. We validate the solver on a quadrotor agile flight task and a Mars powered descent problem using an on-board edge computing platform. Benchmarks reveal a sustained 4x throughput speedup and a 51% reduction in energy consumption over a heavily optimized 12-core CPU baseline. Crucially, the framework saturates the hardware, maintaining over 96% active GPU utilization to achieve planning rates exceeding 100 Hz. Furthermore, we demonstrate the solver's extensibility to robust Model Predictive Control by jointly optimizing dynamically coupled scenarios under stochastic disturbances, enabling scalable and safe autonomy.
Operator Learning for Robust Stabilization of Linear Markov-Jumping Hyperbolic PDEs
This paper addresses the problem of robust stabilization for linear hyperbolic Partial Differential Equations (PDEs) with Markov-jumping parameter uncertainty. We consider a 2 x 2 heterogeneous hyperbolic PDE and propose a control law using operator learning and the backstepping method. Specifically, the backstepping kernels used to construct the control law are approximated with neural operators (NO) in order to improve computational efficiency. The key challenge lies in deriving the stability conditions with respect to the Markov-jumping parameter uncertainty and NO approximation errors. The mean-square exponential stability of the stochastic system is achieved through Lyapunov analysis, indicating that the system can be stabilized if the random parameters are sufficiently close to the nominal parameters on average, and NO approximation errors are small enough. The theoretical results are applied to freeway traffic control under stochastic upstream demands and then validated through numerical simulations.
When Semantics Connect the Swarm: LLM-Driven Fuzzy Control for Cooperative Multi-Robot Underwater Coverage
Underwater multi-robot cooperative coverage remains challenging due to partial observability, limited communication, environmental uncertainty, and the lack of access to global localization. To address these issues, this paper presents a semantics-guided fuzzy control framework that couples Large Language Models (LLMs) with interpretable control and lightweight coordination. Raw multimodal observations are compressed by the LLM into compact, human-interpretable semantic tokens that summarize obstacles, unexplored regions, and Objects Of Interest (OOIs) under uncertain perception. A fuzzy inference system with pre-defined membership functions then maps these tokens into smooth and stable steering and gait commands, enabling reliable navigation without relying on global positioning. Then, we further coordinate multiple robots by introducing semantic communication that shares intent and local context in linguistic form, enabling agreement on who explores where while avoiding redundant revisits. Extensive simulations in unknown reef-like environments show that, under limited sensing and communication, the proposed framework achieves robust OOI-oriented navigation and cooperative coverage with improved efficiency and adaptability, narrowing the gap between semantic cognition and distributed underwater control in GPS-denied, map-free conditions.
comment: Withdrawal for further improvement. The final version will be released in a few months
Multi-Period Sparse Optimization for Proactive Grid Blackout Diagnosis
Existing or planned power grids need to evaluate survivability under extreme events, like a number of peak load overloading conditions, which could possibly cause system collapses (i.e. blackouts). For realistic extreme events that are correlated or share similar patterns, it is reasonable to expect that the dominant vulnerability or failure sources behind them share the same locations but with different severity. Early warning diagnosis that proactively identifies the key vulnerabilities responsible for a number of system collapses of interest can significantly enhance resilience. This paper proposes a multi-period sparse optimization method, enabling the discovery of persistent failure sources across a sequence of collapsed systems with increasing system stress, such as rising demand or worsening contingencies. This work defines persistency and efficiently integrates persistency constraints to capture the ``hidden'' evolving vulnerabilities. Circuit-theory based power flow formulations and circuit-inspired optimization heuristics are used to facilitate the scalability of the method. Experiments on benchmark systems show that the method reliably tracks persistent vulnerability locations under increasing load stress, and solves with scalability to large systems (on average taking around 200 s per scenario on 2000+ bus systems).
SHIELD: A Host-Independent Framework for Ransomware Detection using Deep Filesystem Features
Ransomware's escalating sophistication necessitates tamper-resistant, off-host detection solutions that capture deep disk activity beyond the reach of a compromised operating system. Existing detection systems use host/kernel signals or rely on coarse block-I/O statistics, which are easy to evade and miss filesystem semantics. The filesystem layer itself remains underexplored as a source of robust indicators for storage-controller-level defense. To address this, we present SHIELD: a Secure Host-Independent Extensible Metric Logging Framework for Tamper-Proof Detection and Real-Time Mitigation of Ransomware Threats. SHIELD parses and logs filesystem-level features that cannot be evaded or obfuscated to expose deep disk activity for real-time ML-based detection and mitigation. We evaluate the efficacy of these metrics through experiments with both binary (benign vs. malicious behavior) and multiclass (ransomware strain identification) classifiers. In evaluations across diverse ransomware families, the best binary classifier achieves 97.29% accuracy in identifying malicious disk behavior. A hardware-only feature set that excludes all transport-layer metrics retains 95.97% accuracy, confirming feasibility for FPGA/ASIC deployment within the storage controller datapath. In a proof-of-concept closed-loop deployment, SHIELD halts disk operations within tens of disk actions, limiting targeted files affected to <0.4% for zero-shot strains at small action-windows, while maintaining low false-positive rates (<3.6%) on unseen benign applications. Results demonstrate that filesystem-aware, off-host telemetry enables accurate, resilient ransomware detection, including intermittent/partial encryption, and is practical for embedded integration in storage controllers or alongside other defense mechanisms.
Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction
Leader-follower interaction is an important paradigm in human-robot interaction (HRI). Yet, assigning roles in real time remains challenging for resource-constrained mobile and assistive robots. While large language models (LLMs) have shown promise for natural communication, their size and latency limit on-device deployment. Small language models (SLMs) offer a potential alternative, but their effectiveness for role classification in HRI has not been systematically evaluated. In this paper, we present a benchmark of SLMs for leader-follower communication, introducing a novel dataset derived from a published database and augmented with synthetic samples to capture interaction-specific dynamics. We investigate two adaptation strategies: prompt engineering and fine-tuning, studied under zero-shot and one-shot interaction modes, compared with an untrained baseline. Experiments with Qwen2.5-0.5B reveal that zero-shot fine-tuning achieves robust classification performance (86.66% accuracy) while maintaining low latency (22.2 ms per sample), significantly outperforming baseline and prompt-engineered approaches. However, results also indicate a performance degradation in one-shot modes, where increased context length challenges the model's architectural capacity. These findings demonstrate that fine-tuned SLMs provide an effective solution for direct role assignment, while highlighting critical trade-offs between dialogue complexity and classification reliability on the edge.
Linearizability of flows by embeddings
We consider the problem of determining the class of continuous-time dynamical systems that can be globally linearized in the sense of admitting an embedding into a linear system on a higher-dimensional Euclidean space. We solve this problem for dynamical systems on connected state spaces that are either compact or contain at least one nonempty compact attractor, obtaining necessary and sufficient conditions for the existence of linearizing $C^k$ embeddings for $k\in \mathbb{N}_{\geq 0}\cup \{\infty\}$. Corollaries include (i) several checkable necessary conditions for global linearizability and (ii) extensions of the Hartman-Grobman and Floquet normal form theorems beyond the classical settings. Our results open new perspectives on linearizability by establishing relationships to symmetry, topology, and invariant manifold theory.
comment: To appear in Selecta Mathematica
Reference Architecture of a Quantum-Centric Supercomputer
Quantum computers have demonstrated utility in simulating quantum systems beyond brute-force classical approaches. As the community builds on these demonstrations to explore using quantum computing for applied research, algorithms and workflows have emerged that require leveraging both quantum computers and classical high-performance computing (HPC) systems to scale applications, especially in chemistry and materials, beyond what either system can simulate alone. Today, these disparate systems operate in isolation, forcing users to manually orchestrate workloads, coordinate job scheduling, and transfer data between systems -- a cumbersome process that hinders productivity and severely limits rapid algorithmic exploration. These challenges motivate the need for flexible and high-performance Quantum-Centric Supercomputing (QCSC) systems that integrate Quantum Processing Units (QPUs), Graphics Processing Units (GPUs), and Central Processing Units (CPUs) to accelerate discovery of such algorithms across applications. These systems will be co-designed across quantum and classical HPC infrastructure, middleware, and application layers to accelerate the adoption of quantum computing for solving critical computational problems. We envision QCSC evolution through three distinct phases: (1) quantum systems as specialized compute offload engines within existing HPC complexes; (2) heterogeneous quantum and classical HPC systems coupled through advanced middleware, enabling seamless execution of hybrid quantum-classical algorithms; and (3) fully co-designed heterogeneous quantum-HPC systems for hybrid computational workflows. This article presents a reference architecture and roadmap for these QCSC systems.
comment: 20 pages, 5 figures, minor fixes
A Variational Latent Equilibrium for Learning in Neuronal Circuits
Brains remain unrivaled in their ability to recognize and generate complex spatiotemporal patterns. While AI is able to reproduce some of these capabilities, deep learning algorithms remain largely at odds with our current understanding of brain circuitry and dynamics. This is prominently the case for backpropagation through time (BPTT), the go-to algorithm for learning complex temporal dependencies. In this work we propose a general formalism to approximate BPTT in a controlled, biologically plausible manner. Our approach builds on, unifies and extends several previous approaches to local, time-continuous, phase-free spatiotemporal credit assignment based on principles of energy conservation and extremal action. Our starting point is a prospective energy function of neuronal states, from which we calculate real-time error dynamics for time-continuous neuronal networks. In the general case, this provides a simple and straightforward derivation of the adjoint method result for neuronal networks, the time-continuous equivalent to BPTT. With a few modifications, we can turn this into a fully local (in space and time) set of equations for neuron and synapse dynamics. Our theory provides a rigorous framework for spatiotemporal deep learning in the brain, while simultaneously suggesting a blueprint for physical circuits capable of carrying out these computations. These results reframe and extend the recently proposed Generalized Latent Equilibrium (GLE) model.
Robust Attitude Control of Nonlinear UAV Dynamics with LFT Models and $\mathcal{H}_\infty$ Performance
Attitude stabilization of unmanned aerial vehicles (UAVs) in uncertain environments presents significant challenges due to nonlinear dynamics, parameter variations, and sensor limitations. This paper presents a comparative study of $\mathcal{H}_\infty$ and classical PID controllers for multi-rotor attitude regulation in the presence of wind disturbances and gyroscope noise. The flight dynamics are modeled using a linear parameter-varying (LPV) framework, where nonlinearities and parameter variations are systematically represented as structured uncertainties within a linear fractional transformation formulation. A robust controller based on $\mathcal{H}_\infty$ formulation is designed using only gyroscope measurements to ensure guaranteed performance bounds. Nonlinear simulation results demonstrate the effectiveness of the robust controllers compared to classical PID control, showing significant improvement in attitude regulation under severe wind disturbances.
comment: 6 pages, 6 figures, 3 tables, submitted to ACC 2026
Safe Landing on Small Celestial Bodies with Gravitational Uncertainty Using Disturbance Estimation and Control Barrier Functions
Soft landing on small celestial bodies (SCBs) poses unique challenges, as gravitational models poorly characterize the higher-order gravitational effects of SCBs. Existing control approaches lack guarantees for safety under gravitational uncertainty. This paper proposes a three-stage control architecture that combines disturbance estimation, trajectory tracking, and safety enforcement. An extended high-gain observer estimates gravitational disturbances online, a feedback-linearizing controller tracks a reference trajectory, and a minimum-intervention quadratic program enforces state and input constraints while remaining close to the nominal control. The proposed approach enables aggressive yet safe maneuvers despite gravitational uncertainty. Numerical simulations demonstrate the effectiveness of the controller in achieving soft-landing on irregularly shaped SCBs, highlighting its potential for autonomous SCB missions.
comment: Accepted for the 2026 American Control Conference (ACC)
ExaModelsPower.jl: A GPU-Compatible Modeling Library for Nonlinear Power System Optimization
As GPU-accelerated mathematical programming techniques mature, there is growing interest in utilizing them to address the computational challenges of power system optimization. This paper introduces ExaModelsPower.jl, an open-source modeling library for creating GPU-compatible nonlinear AC optimal power flow models. Built on ExaModels.jl, ExaModelsPower.jl provides a high-level interface that automatically generates all necessary callback functions for GPU solvers. The library is designed for large-scale problem instances, which may include multiple time periods and security constraints. Using ExaModelsPower.jl, we benchmark GPU and CPU solvers on open-source test cases. Our results show that GPU solvers can deliver up to two orders of magnitude speedups compared to alternative tools on CPU for problems with more than 20,000 variables and a solution precision of up to $10^{-4}$, while performance for smaller instances or tighter tolerances may vary.
Robotics
A gripper for flap separation and opening of sealed bags ICRA2026
Separating thin, flexible layers that must be individually grasped is a common but challenging manipulation primitive for most off-the-shelf grippers. A prominent example arises in clinical settings: the opening of sterile flat pouches for the preparation of the operating room, where the first step is to separate and grasp the flaps. We present a novel gripper design and opening strategy that enables reliable flap separation and robust seal opening. This capability addresses a high-volume repetitive hospital procedure in which nurses manually open up to 240 bags per shift, a physically demanding task linked to musculoskeletal injuries. Our design combines an active dented-roller fingertip with compliant fingers that exploit environmental constraints to robustly grasp thin flexible flaps. Experiments demonstrate that the proposed gripper reliably grasps and separates sealed bag flaps and other thin-layered materials from the hospital, the most sensitive variable affecting performance being the normal force applied. When two copies of the gripper grasp both flaps, the system withstands the forces needed to open the seals robustly. To our knowledge, this is one of the first demonstrations of robotic assistance to automate this repetitive, low-value, but critical hospital task.
comment: 8 pages, Accepted at the 2026 IEEE International Conference on Robotics & Automation (ICRA2026)
RL-Augmented MPC for Non-Gaited Legged and Hybrid Locomotion
We propose a contact-explicit hierarchical architecture coupling Reinforcement Learning (RL) and Model Predictive Control (MPC), where a high-level RL agent provides gait and navigation commands to a low-level locomotion MPC. This offloads the combinatorial burden of contact timing from the MPC by learning acyclic gaits through trial and error in simulation. We show that only a minimal set of rewards and limited tuning are required to obtain effective policies. We validate the architecture in simulation across robotic platforms spanning 50 kg to 120 kg and different MPC implementations, observing the emergence of acyclic gaits and timing adaptations in flat-terrain legged and hybrid locomotion, and further demonstrating extensibility to non-flat terrains. Across all platforms, we achieve zero-shot sim-to-sim transfer without domain randomization, and we further demonstrate zero-shot sim-to-real transfer without domain randomization on Centauro, our 120 kg wheeled-legged humanoid robot. We make our software framework and evaluation results publicly available at https://github.com/AndrePatri/AugMPC.
FG-CLTP: Fine-Grained Contrastive Language Tactile Pretraining for Robotic Manipulation
Recent advancements in integrating tactile sensing into vision-language-action (VLA) models have demonstrated transformative potential for robotic perception. However, existing tactile representations predominantly rely on qualitative descriptors (e.g., texture), neglecting quantitative contact states such as force magnitude, contact geometry, and principal axis orientation, which are indispensable for fine-grained manipulation. To bridge this gap, we propose FG-CLTP, a fine-grained contrastive language tactile pretraining framework. We first introduce a novel dataset comprising over 100k tactile 3D point cloud-language pairs that explicitly capture multidimensional contact states from the sensor's perspective. We then implement a discretized numerical tokenization mechanism to achieve quantitative-semantic alignment, effectively injecting explicit physical metrics into the multimodal feature space. The proposed FG-CLTP model yields a 95.9% classification accuracy and reduces the regression error (MAE) by 52.6% compared to state-of-the-art methods. Furthermore, the integration of 3D point cloud representations establishes a sensor-agnostic foundation with a minimal sim-to-real gap of 3.5%. Building upon this fine-grained representation, we develop a 3D tactile-language-action (3D-TLA) architecture driven by a flow matching policy to enable multimodal reasoning and control. Extensive experiments demonstrate that our framework significantly outperforms strong baselines in contact-rich manipulation tasks, providing a robust and generalizable foundation for tactile-language-action models.
comment: 9 pages, 6 figures
GRACE: A Unified 2D Multi-Robot Path Planning Simulator & Benchmark for Grid, Roadmap, And Continuous Environments ICRA 2026
Advancing Multi-Agent Pathfinding (MAPF) and Multi-Robot Motion Planning (MRMP) requires platforms that enable transparent, reproducible comparisons across modeling choices. Existing tools either scale under simplifying assumptions (grids, homogeneous agents) or offer higher fidelity with less comparable instrumentation. We present GRACE, a unified 2D simulator+benchmark that instantiates the same task at multiple abstraction levels (grid, roadmap, continuous) via explicit, reproducible operators and a common evaluation protocol. Our empirical results on public maps and representative planners enable commensurate comparisons on a shared instance set. Furthermore, we quantify the expected representation-fidelity trade-offs (MRMP solves instances at higher fidelity but lower speed, while grid/roadmap planners scale farther). By consolidating representation, execution, and evaluation, GRACE thereby aims to make cross-representation studies more comparable and provides a means to advance multi-robot planning research and its translation to practice.
comment: ICRA 2026, code will be released soon
Semantic Landmark Particle Filter for Robot Localisation in Vineyards IROS 2026
Reliable localisation in vineyards is hindered by row-level perceptual aliasing: parallel crop rows produce nearly identical LiDAR observations, causing geometry-only and vision-based SLAM systems to converge towards incorrect corridors, particularly during headland transitions. We present a Semantic Landmark Particle Filter (SLPF) that integrates trunk and pole landmark detections with 2D LiDAR within a probabilistic localisation framework. Detected trunks are converted into semantic walls, forming structural row boundaries embedded in the measurement model to improve discrimination between adjacent rows. GNSS is incorporated as a lightweight prior that stabilises localisation when semantic observations are sparse. Field experiments in a 10-row vineyard demonstrate consistent improvements over geometry-only (AMCL), vision-based (RTAB-Map), and GNSS baselines. Compared to AMCL, SLPF reduces Absolute Pose Error by 22% and 65% across two traversal directions; relative to a NoisyGNSS baseline, APE decreases by 65% and 61%. Row correctness improves from 0.67 to 0.73, while mean cross-track error decreases from 1.40 m to 1.26 m. These results show that embedding row-level structural semantics within the measurement model enables robust localisation in highly repetitive outdoor agricultural environments.
comment: Submmitted to IROS 2026
Sublinear-Time Reconfiguration of Programmable Matter with Joint Movements
We study centralized reconfiguration problems for geometric amoebot structures. A set of $n$ amoebots occupy nodes on the triangular grid and can reconfigure via expansion and contraction operations. We focus on the joint movement extension, where amoebots may expand and contract in parallel, enabling coordinated motion of larger substructures. Prior work introduced this extension and analyzed reconfiguration under additional assumptions such as metamodules. In contrast, we investigate the intrinsic dynamics of reconfiguration without such assumptions by restricting attention to centralized algorithms, leaving distributed solutions for future work. We study the reconfiguration problem between two classes of amoebot structures $A$ and $B$: For every structure $S\in A$, the goal is to compute a schedule that reconfigures $S$ into some structure $S'\in B$. Our focus is on sublinear-time algorithms. We affirmatively answer the open problem by Padalkin et al. (Auton. Robots, 2025) whether a within-the-model sublinear-time universal reconfiguration algorithm is possible, by proving that any structure can be reconfigured into a canonical line-segment structure in $O(\sqrt{n}\log n)$ rounds. Additionally, we give a constant-time algorithm for reconfiguring any spiral structure into a line segment. These results are enabled by new constant-time primitives that facilitate efficient parallel movement. Our findings demonstrate that the joint movement model supports sublinear reconfiguration without auxiliary assumptions. A central open question is whether universal reconfiguration within this model can be achieved in polylogarithmic or even constant time.
ASTER: Attitude-aware Suspended-payload Quadrotor Traversal via Efficient Reinforcement Learning
Agile maneuvering of the quadrotor cable-suspended system is significantly hindered by its non-smooth hybrid dynamics. While model-free Reinforcement Learning (RL) circumvents explicit differentiation of complex models, achieving attitude-constrained or inverted flight remains an open challenge due to the extreme reward sparsity under strict orientation requirements. This paper presents ASTER, a robust RL framework that achieves, to our knowledge, the first successful autonomous inverted flight for the cable-suspended system. We propose hybrid-dynamics-informed state seeding (HDSS), an initialization strategy that back-propagates target configurations through physics-consistent kinematic inversions across both taut and slack cable phases. HDSS enables the policy to discover aggressive maneuvers that are unreachable via standard exploration. Extensive simulations and real-world experiments demonstrate remarkable agility, precise attitude alignment, and robust zero-shot sim-to-real transfer across complex trajectories.
MAVEN: A Meta-Reinforcement Learning Framework for Varying-Dynamics Expertise in Agile Quadrotor Maneuvers
Reinforcement learning (RL) has emerged as a powerful paradigm for achieving online agile navigation with quadrotors. Despite this success, policies trained via standard RL typically fail to generalize across significant dynamic variations, exhibiting a critical lack of adaptability. This work introduces MAVEN, a meta-RL framework that enables a single policy to achieve robust end-to-end navigation across a wide range of quadrotor dynamics. Our approach features a novel predictive context encoder, which learns to infer a latent representation of the system dynamics from interaction history. We demonstrate our method in agile waypoint traversal tasks under two challenging scenarios: large variations in quadrotor mass and severe single-rotor thrust loss. We leverage a GPU-vectorized simulator to distribute tasks across thousands of parallel environments, overcoming the long training times of meta-RL to converge in less than an hour. Through extensive experiments in both simulation and the real world, we validate that MAVEN achieves superior adaptation and agility. The policy successfully executes zero-shot sim-to-real transfer, demonstrating robust online adaptation by performing high-speed maneuvers despite mass variations of up to 66.7% and single-rotor thrust losses as severe as 70%.
FutureVLA: Joint Visuomotor Prediction for Vision-Language-Action Model
Predictive foresight is important to intelligent embodied agents. Since the motor execution of a robot is intrinsically constrained by its visual perception of environmental geometry, effectively anticipating the future requires capturing this tightly coupled visuomotor interplay. While recent vision-language-action models attempt to incorporate future guidance, they struggle with this joint modeling. Existing explicit methods divert capacity to task-irrelevant visual details, whereas implicit methods relying on sparse frame pairs disrupt temporal continuity. By heavily relying on visual reconstruction, these methods become visually dominated, entangling static scene context with dynamic action intent. We argue that effective joint visuomotor predictive modeling requires both temporal continuity and visually-conditioned supervision decoupling. To this end, we propose FutureVLA, featuring a novel Joint Visuomotor Predictive Architecture. FutureVLA is designed to extract joint visuomotor embeddings by first decoupling visual and motor information, and then jointly encoding generalized physical priors. Specifically, in the pretraining stage, we leverage heterogeneous manipulation datasets and introduce a Joint Visuomotor Gating mechanism to structurally separate visual state preservation from temporal action modeling. It allows the motor stream to focus on continuous physical dynamics while explicitly querying visual tokens for environmental constraints, yielding highly generalizable joint visuomotor embeddings. Subsequently, in the post-training stage, we employ a latent embeddings alignment strategy, enabling diverse downstream VLA models to internalize these temporal priors without modifying their inference architectures. Extensive experiments demonstrate that FutureVLA consistently improves VLA frameworks.
Parallel-in-Time Nonlinear Optimal Control via GPU-native Sequential Convex Programming
Real-time trajectory optimization for nonlinear constrained autonomous systems is critical and typically performed by CPU-based sequential solvers. Specifically, reliance on global sparse linear algebra or the serial nature of dynamic programming algorithms restricts the utilization of massively parallel computing architectures like GPUs. To bridge this gap, we introduce a fully GPU-native trajectory optimization framework that combines sequential convex programming with a consensus-based alternating direction method of multipliers. By applying a temporal splitting strategy, our algorithm decouples the optimization horizon into independent, per-node subproblems that execute massively in parallel. The entire process runs fully on the GPU, eliminating costly memory transfers and large-scale sparse factorizations. This architecture naturally scales to multi-trajectory optimization. We validate the solver on a quadrotor agile flight task and a Mars powered descent problem using an on-board edge computing platform. Benchmarks reveal a sustained 4x throughput speedup and a 51% reduction in energy consumption over a heavily optimized 12-core CPU baseline. Crucially, the framework saturates the hardware, maintaining over 96% active GPU utilization to achieve planning rates exceeding 100 Hz. Furthermore, we demonstrate the solver's extensibility to robust Model Predictive Control by jointly optimizing dynamically coupled scenarios under stochastic disturbances, enabling scalable and safe autonomy.
MapGCLR: Geospatial Contrastive Learning of Representations for Online Vectorized HD Map Construction
Autonomous vehicles rely on map information to understand the world around them. However, the creation and maintenance of offline high-definition (HD) maps remains costly. A more scalable alternative lies in online HD map construction, which only requires map annotations at training time. To further reduce the need for annotating vast training labels, self-supervised training provides an alternative. This work focuses on improving the latent birds-eye-view (BEV) feature grid representation within a vectorized online HD map construction model by enforcing geospatial consistency between overlapping BEV feature grids as part of a contrastive loss function. To ensure geospatial overlap for contrastive pairs, we introduce an approach to analyze the overlap between traversals within a given dataset and generate subsidiary dataset splits following adjustable multi-traversal requirements. We train the same model supervised using a reduced set of single-traversal labeled data and self-supervised on a broader unlabeled set of data following our multi-traversal requirements, effectively implementing a semi-supervised approach. Our approach outperforms the supervised baseline across the board, both quantitatively in terms of the downstream tasks vectorized map perception performance and qualitatively in terms of segmentation in the principal component analysis (PCA) visualization of the BEV feature space.
OnFly: Onboard Zero-Shot Aerial Vision-Language Navigation toward Safety and Efficiency
Aerial vision-language navigation (AVLN) enables UAVs to follow natural-language instructions in complex 3D environments. However, existing zero-shot AVLN methods often suffer from unstable single-stream Vision-Language Model decision-making, unreliable long-horizon progress monitoring, and a trade-off between safety and efficiency. We propose OnFly, a fully onboard, real-time framework for zero-shot AVLN. OnFly adopts a shared-perception dual-agent architecture that decouples high-frequency target generation from low-frequency progress monitoring, thereby stabilizing decision-making. It further employs a hybrid keyframe-recent-frame memory to preserve global trajectory context while maintaining KV-cache prefix stability, enabling reliable long-horizon monitoring with termination and recovery signals. In addition, a semantic-geometric verifier refines VLM-predicted targets for instruction consistency and geometric safety using VLM features and depth cues, while a receding-horizon planner generates optimized collision-free trajectories under geometric safety constraints, improving both safety and efficiency. In simulation, OnFly improves task success from 26.4% to 67.8%, compared with the strongest state-of-the-art baseline, while fully onboard real-world flights validate its feasibility for real-time deployment. The code will be released at https://github.com/Robotics-STAR-Lab/OnFly
Cybo-Waiter: A Physical Agentic Framework for Humanoid Whole-Body Locomotion-Manipulation
Robots are increasingly expected to execute open ended natural language requests in human environments, which demands reliable long horizon execution under partial observability. This is especially challenging for humanoids because locomotion and manipulation are tightly coupled through stance, reachability, and balance. We present a humanoid agent framework that turns VLM plans into verifiable task programs and closes the loop with multi object 3D geometric supervision. A VLM planner compiles each instruction into a typed JSON sequence of subtasks with explicit predicate based preconditions and success conditions. Using SAM3 and RGB-D, we ground all task relevant entities in 3D, estimate object centroids and extents, and evaluate predicates over stable frames to obtain condition level diagnostics. The supervisor uses these diagnostics to verify subtask completion and to provide condition-level feedback for progression and replanning. We execute each subtask by coordinating humanoid locomotion and whole-body manipulation, selecting feasible motion primitives under reachability and balance constraints. Experiments on tabletop manipulation and long horizon humanoid loco manipulation tasks show improved robustness from multi object grounding, temporal stability, and recovery driven replanning.
Dynamic Modeling and Attitude Control of a Reaction-Wheel-Based Low-Gravity Bipedal Hopper
Planetary bodies characterized by low gravitational acceleration, such as the Moon and near-Earth asteroids, impose unique locomotion constraints due to diminished contact forces and extended airborne intervals. Among traversal strategies, hopping locomotion offers high energy efficiency but is prone to mid-flight attitude instability caused by asymmetric thrust generation and uneven terrain interactions. This paper presents an underactuated bipedal hopping robot that employs an internal reaction wheel to regulate body posture during the ballistic flight phase. The system is modeled as a gyrostat, enabling analysis of the dynamic coupling between torso rotation and reaction wheel momentum. The locomotion cycle comprises three phases: a leg-driven propulsive jump, mid-air attitude stabilization via an active momentum exchange controller, and a shock-absorbing landing. A reduced-order model is developed to capture the critical coupling between torso rotation and reaction wheel dynamics. The proposed framework is evaluated in MuJoCo-based simulations under lunar gravity conditions (g = 1.625 m/s^2). Results demonstrate that activation of the reaction wheel controller reduces peak mid-air angular deviation by more than 65% and constrains landing attitude error to within 3.5 degrees at touchdown. Additionally, actuator saturation per hop cycle is reduced, ensuring sufficient control authority. Overall, the approach significantly mitigates in-flight attitude excursions and enables consistent upright landings, providing a practical and control-efficient solution for locomotion on irregular extraterrestrial terrains.
comment: Preprint. Under review
STM32-Based Smart Waste Bin for Hygienic Disposal Using Embedded Sensing and Automated Control
The increasing demand for hygienic and contactless solutions in public and private environments has encouraged the development of automated systems for everyday applications. This paper presents the design and implementation of a motion- sensing automatic waste bin using an STM32 microcontroller, ultrasonic sensors, and a servo motor. The system detects user presence through ultrasonic sensing and automatically opens the bin lid using a servo motor controlled by the microcontroller. An additional ultrasonic sensor is used to monitor the internal waste level of the bin, while an OLED display provides real- time feedback regarding system status. The proposed system offers a low-cost, reliable, and easily deployable solution for touch-free waste disposal. Experimental evaluation demonstrates fast response time, stable sensing performance, and smooth mechanical operation. The system can be effectively deployed in homes, educational institutions, hospitals, and public facilities to improve hygiene and user convenience.
comment: This paper consists of 6 pages, with 3 figures, 3 tables, and 1 algorithm
Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions
Task and Motion Planning combines high-level task sequencing (what to do) with low-level motion planning (how to do it) to generate feasible, collision-free execution plans. However, in many real-world domains, such as automated warehouses, tasks are predefined, shifting the challenge to if, when, and how to execute them safely and efficiently under resource, time and motion constraints. In this paper, we formalize this as the Scheduling and Motion Planning problem for multi-object navigation in shared workspaces. We propose a novel solution framework that interleaves off-the-shelf schedulers and motion planners in an incremental learning loop. The scheduler generates candidate plans, while the motion planner checks feasibility and returns symbolic feedback, i.e., spatial conflicts and timing adjustments, to guide the scheduler towards motion-feasible solutions. We validate our proposal on logistics and job-shop scheduling benchmarks augmented with motion tasks, using state-of-the-art schedulers and sampling-based motion planners. Our results show the effectiveness of our framework in generating valid plans under complex temporal and spatial constraints, where synchronized motion is critical.
AdaClearGrasp: Learning Adaptive Clearing for Zero-Shot Robust Dexterous Grasping in Densely Cluttered Environments
In densely cluttered environments, physical interference, visual occlusions, and unstable contacts often cause direct dexterous grasping to fail, while aggressive singulation strategies may compromise safety. Enabling robots to adaptively decide whether to clear surrounding objects or directly grasp the target is therefore crucial for robust manipulation. We propose AdaClearGrasp, a closed-loop decision-execution framework for adaptive clearing and zero-shot dexterous grasping in densely cluttered environments. The framework formulates manipulation as a controllable high-level decision process that determines whether to directly grasp the target or first clear surrounding objects. A pretrained vision-language model (VLM) interprets visual observations and language task descriptions to reason about grasp interference and generate a high-level planning skeleton, which invokes structured atomic skills through a unified action interface. For dexterous grasping, we train a reinforcement learning policy with a relative hand-object distance representation, enabling zero-shot generalization across diverse object geometries and physical properties. During execution, visual feedback monitors outcomes and triggers replanning upon failures, forming a closed-loop correction mechanism. To evaluate language-conditioned dexterous grasping in clutter, we introduce Clutter-Bench, the first simulation benchmark with graded clutter complexity. It includes seven target objects across three clutter levels, yielding 210 task scenarios. We further perform sim-to-real experiments on three objects under three clutter levels (18 scenarios). Results demonstrate that AdaClearGrasp significantly improves grasp success rates in densely cluttered environments. For more videos and code, please visit our project website: https://chenzixuan99.github.io/adaclear-grasp.github.io/.
comment: 12 pages. Under review
Learning Bimanual Cloth Manipulation with Vision-based Tactile Sensing via Single Robotic Arm
Robotic cloth manipulation remains challenging due to the high-dimensional state space of fabrics, their deformable nature, and frequent occlusions that limit vision-based sensing. Although dual-arm systems can mitigate some of these issues, they increase hardware and control complexity. This paper presents Touch G.O.G., a compact vision-based tactile gripper and perception/control framework for single-arm bimanual cloth manipulation. The proposed framework combines three key components: (1) a novel gripper design and control strategy for in-gripper cloth sliding with a single robot arm, (2) a Vision Foundation Model-backboned Vision Transformer pipeline for cloth part classification (PC-Net) and edge pose estimation (PE-Net) using real and synthetic tactile images, and (3) an encoder-decoder synthetic data generator (SD-Net) that reduces manual annotation by producing high-fidelity tactile images. Experiments show 96% accuracy in distinguishing edges, corners, interior regions, and grasp failures, together with sub-millimeter edge localization and 4.5° orientation error. Real-world results demonstrate reliable cloth unfolding, even for crumpled fabrics, using only a single robotic arm. These results highlight Touch G.O.G. as a compact and cost-effective solution for deformable object manipulation.
comment: 11 pages, 13 figures
Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction CVPR 2026
Trajectory prediction is critical for autonomous driving, enabling safe and efficient planning in dense, dynamic traffic. Most existing methods optimize prediction accuracy under fixed-length observations. However, real-world driving often yields variable-length, incomplete observations, posing a challenge to these methods. A common strategy is to directly map features from incomplete observations to those from complete ones. This one-shot mapping, however, struggles to learn accurate representations for short trajectories due to significant information gaps. To address this issue, we propose a Progressive Retrospective Framework (PRF), which gradually aligns features from incomplete observations with those from complete ones via a cascade of retrospective units. Each unit consists of a Retrospective Distillation Module (RDM) and a Retrospective Prediction Module (RPM), where RDM distills features and RPM recovers previous timesteps using the distilled features. Moreover, we propose a Rolling-Start Training Strategy (RSTS) that enhances data efficiency during PRF training. PRF is plug-and-play with existing methods. Extensive experiments on datasets Argoverse 2 and Argoverse 1 demonstrate the effectiveness of PRF. Code is available at https://github.com/zhouhao94/PRF.
comment: Paper is accepted by CVPR 2026
Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion
We introduce Marigold-SSD, a single-step, late-fusion depth completion framework that leverages strong diffusion priors while eliminating the costly test-time optimization typically associated with diffusion-based methods. By shifting computational burden from inference to finetuning, our approach enables efficient and robust 3D perception under real-world latency constraints. Marigold-SSD achieves significantly faster inference with a training cost of only 4.5 GPU days. We evaluate our method across four indoor and two outdoor benchmarks, demonstrating strong cross-domain generalization and zero-shot performance compared to existing depth completion approaches. Our approach significantly narrows the efficiency gap between diffusion-based and discriminative models. Finally, we challenge common evaluation protocols by analyzing performance under varying input sparsity levels. Page: https://dtu-pas.github.io/marigold-ssd/
Safety-critical Control Under Partial Observability: Reach-Avoid POMDP meets Belief Space Control
Partially Observable Markov Decision Processes (POMDPs) provide a principled framework for robot decision-making under uncertainty. Solving reach-avoid POMDPs, however, requires coordinating three distinct behaviors: goal reaching, safety, and active information gathering to reduce uncertainty. Existing online POMDP solvers attempt to address all three within a single belief tree search, but this unified approach struggles with the conflicting time scales inherent to these objectives. We propose a layered, certificate-based control architecture that operates directly in belief space, decoupling goal reaching, information gathering, and safety into modular components. We introduce Belief Control Lyapunov Functions (BCLFs) that formalize information gathering as a Lyapunov convergence problem in belief space, and show how they can be learned via reinforcement learning. For safety, we develop Belief Control Barrier Functions (BCBFs) that leverage conformal prediction to provide probabilistic safety guarantees over finite horizons. The resulting control synthesis reduces to lightweight quadratic programs solvable in real time, even for non-Gaussian belief representations with dimension $>10^4$. Experiments in simulation and on a space-robotics platform demonstrate real-time performance and improved safety and task success compared to state-of-the-art constrained POMDP solvers.
TacLoc: Global Tactile Localization on Objects from a Registration Perspective
Pose estimation is essential for robotic manipulation, particularly when visual perception is occluded during gripper-object interactions. Existing tactile-based methods generally rely on tactile simulation or pre-trained models, which limits their generalizability and efficiency. In this study, we propose TacLoc, a novel tactile localization framework that formulates the problem as a one-shot point cloud registration task. TacLoc introduces a graph-theoretic partial-to-full registration method, leveraging dense point clouds and surface normals from tactile sensing for efficient and accurate pose estimation. Without requiring rendered data or pre-trained models, TacLoc achieves improved performance through normal-guided graph pruning and a hypothesis-and-verification pipeline. TacLoc is evaluated extensively on the YCB dataset. We further demonstrate TacLoc on real-world objects across two different visual-tactile sensors.
comment: 8 pages, 12 figures
BinWalker: Development and Field Evaluation of a Quadruped Manipulator Platform for Sustainable Litter Collection
Litter pollution represents a growing environmental problem affecting natural and urban ecosystems worldwide. Waste discarded in public spaces often accumulates in areas that are difficult to access, such as uneven terrains, coastal environments, parks, and roadside vegetation. Over time, these materials degrade and release harmful substances, including toxic chemicals and microplastics, which can contaminate soil and water and pose serious threats to wildlife and human health. Despite increasing awareness of the problem, litter collection is still largely performed manually by human operators, making large-scale cleanup operations labor-intensive, time-consuming, and costly. Robotic solutions have the potential to support and partially automate environmental cleanup tasks. In this work, we present a quadruped robotic system designed for autonomous litter collection in challenging outdoor scenarios. The robot combines the mobility advantages of legged locomotion with a manipulation system consisting of a robotic arm and an onboard litter container. This configuration enables the robot to detect, grasp, and store litter items while navigating through uneven terrains. The proposed system aims to demonstrate the feasibility of integrating perception, locomotion, and manipulation on a legged robotic platform for environmental cleanup tasks. Experimental evaluations conducted in outdoor scenarios highlight the effectiveness of the approach and its potential for assisting large-scale litter removal operations in environments that are difficult to reach with traditional robotic platforms. The code associated with this work can be found at: https://github.com/iit-DLSLab/trash-collection-isaaclab.
Muscle Synergy Priors Enhance Biomechanical Fidelity in Predictive Musculoskeletal Locomotion Simulation
Human locomotion emerges from high-dimensional neuromuscular control, making predictive musculoskeletal simulation challenging. We present a physiology-informed reinforcement-learning framework that constrains control using muscle synergies. We extracted a low-dimensional synergy basis from inverse musculoskeletal analyses of a small set of overground walking trials and used it as the action space for a muscle-driven three-dimensional model trained across variable speeds, slopes and uneven terrain. The resulting controller generated stable gait from 0.7-1.8 m/s and on $\pm$ 6$^{\circ}$ grades and reproduced condition-dependent modulation of joint angles, joint moments and ground reaction forces. Compared with an unconstrained controller, synergy-constrained control reduced non-physiological knee kinematics and kept knee moment profiles within the experimental envelope. Across conditions, simulated vertical ground reaction forces correlated strongly with human measurements, and muscle-activation timing largely fell within inter-subject variability. These results show that embedding neurophysiological structure into reinforcement learning can improve biomechanical fidelity and generalization in predictive human locomotion simulation with limited experimental data.
comment: 12 pages, 5 figures
DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference
Vision-Language-Action (VLA) models enable generalist robotic manipulation but suffer from high inference latency. This bottleneck stems from the massive number of visual tokens processed by large language backbones. Existing methods either prune or merge tokens uniformly, degrading the spatial reasoning essential for robotic control. We present DepthCache, a training-free framework that leverages depth as a structural prior for visual token compression. It partitions observations into depth-based regions and applies spatially differentiated merge ratios, preserving the near-field workspace while compressing the distant background. To exploit temporal redundancy, DepthCache distributes the merging process across consecutive frames, ensuring consistent representations while reducing per-step computation. A motion-adaptive pipeline further optimizes auxiliary view compression based on end-effector dynamics. The framework requires no model modification, generalizing across diverse VLA architectures. On the LIBERO benchmark, DepthCache achieves up to 1.28x inference speedup with less than 1% average success rate degradation across three VLA models (pi_0.5, OpenVLA, GR00T), whereas pruning and merging baselines incur 4--24% degradation at comparable compression. Real-world experiments on a physical manipulator demonstrate that DepthCache enables faster task throughput and more responsive closed-loop control in latency-sensitive scenarios.
comment: 8 pages, 6 figures
SUBTA: A Framework for Supported User-Guided Bimanual Teleoperation in Structured Assembly ICRA 2026
In human-robot collaboration, shared autonomy enhances human performance through precise, intuitive support. Effective robotic assistance requires accurately inferring human intentions and understanding task structures to determine optimal support timing and methods. In this paper, we present SUBTA, a supported teleoperation system for bimanual assembly that couples learned intention estimation, scene-graph task planning, and context-dependent motion assists. We validate our approach through a user study (N=12) comparing standard teleoperation, motion-support only, and SUBTA. Linear mixed-effects analysis revealed that SUBTA significantly outperformed standard teleoperation in position accuracy (p<0.001, d=1.18) and orientation accuracy (p<0.001, d=1.75), while reducing mental demand (p=0.002, d=1.34). Post-experiment ratings indicate clearer, more trustworthy visual feedback and predictable interventions in SUBTA. The results demonstrate that SUBTA greatly improves both effectiveness and user experience in teleoperation.
comment: 8 pages, 7 figures, accepted at ICRA 2026
FAR-Dex: Few-shot Data Augmentation and Adaptive Residual Policy Refinement for Dexterous Manipulation ICRA
Achieving human-like dexterous manipulation through the collaboration of multi-fingered hands with robotic arms remains a longstanding challenge in robotics, primarily due to the scarcity of high-quality demonstrations and the complexity of high-dimensional action spaces. To address these challenges, we propose FAR-Dex, a hierarchical framework that integrates few-shot data augmentation with adaptive residual refinement to enable robust and precise arm-hand coordination in dexterous tasks. First, FAR-DexGen leverages the IsaacLab simulator to generate diverse and physically constrained trajectories from a few demonstrations, providing a data foundation for policy training. Second, FAR-DexRes introduces an adaptive residual module that refines policies by combining multi-step trajectory segments with observation features, thereby enhancing accuracy and robustness in manipulation scenarios. Experiments in both simulation and real-world demonstrate that FAR-Dex improves data quality by 13.4% and task success rates by 7% over state-of-the-art methods. It further achieves over 80% success in real-world tasks, enabling fine-grained dexterous manipulation with strong positional generalization.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026
DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control
Vision-Language-Action (VLA) models have emerged as a promising paradigm for robot learning, but their representations are still largely inherited from static image-text pretraining, leaving physical dynamics to be learned from comparatively limited action data. Generative video models, by contrast, encode rich spatiotemporal structure and implicit physics, making them a compelling foundation for robotic manipulation. But their potentials are not fully explored in the literature. To bridge the gap, we introduce DiT4DiT, an end-to-end Video-Action Model that couples a video Diffusion Transformer with an action Diffusion Transformer in a unified cascaded framework. Instead of relying on reconstructed future frames, DiT4DiT extracts intermediate denoising features from the video generation process and uses them as temporally grounded conditions for action prediction. We further propose a dual flow-matching objective with decoupled timesteps and noise scales for video prediction, hidden-state extraction, and action inference, enabling coherent joint training of both modules. Across simulation and real-world benchmarks, DiT4DiT achieves state-of-the-art results, reaching average success rates of 98.6% on LIBERO and 50.8% on RoboCasa GR1 while using substantially less training data. On the Unitree G1 robot, it also delivers superior real-world performance and strong zero-shot generalization. Importantly, DiT4DiT improves sample efficiency by over 10x and speeds up convergence by up to 7x, demonstrating that video generation can serve as an effective scaling proxy for robot policy learning. We release code and models at https://dit4dit.github.io/.
comment: https://dit4dit.github.io/
KnowDiffuser: A Knowledge-Guided Diffusion Planner with LM Reasoning and Prior-Informed Trajectory Initialization
Recent advancements in Language Models (LMs) have demonstrated strong semantic reasoning capabilities, enabling their application in high-level decision-making for autonomous driving (AD). However, LMs operate over discrete token spaces and lack the ability to generate continuous, physically feasible trajectories required for motion planning. Meanwhile, diffusion models have proven effective at generating reliable and dynamically consistent trajectories, but often lack semantic interpretability and alignment with scene-level understanding. To address these limitations, we propose \textbf{KnowDiffuser}, a knowledge-guided motion planning framework that tightly integrates the semantic understanding of language models with the generative power of diffusion models. The framework employs a language model to infer context-aware meta-actions from structured scene representations, which are then mapped to prior trajectories that anchor the subsequent denoising process. A two-stage truncated denoising mechanism refines these trajectories efficiently, preserving both semantic alignment and physical feasibility. Experiments on the nuPlan benchmark demonstrate that KnowDiffuser significantly outperforms existing planners in both open-loop and closed-loop evaluations, establishing a robust and interpretable framework that effectively bridges the semantic-to-physical gap in AD systems.
comment: 10pages, 1 figure
AsyncMDE: Real-Time Monocular Depth Estimation via Asynchronous Spatial Memory
Foundation-model-based monocular depth estimation offers a viable alternative to active sensors for robot perception, yet its computational cost often prohibits deployment on edge platforms. Existing methods perform independent per-frame inference, wasting the substantial computational redundancy between adjacent viewpoints in continuous robot operation. This paper presents AsyncMDE, an asynchronous depth perception system consisting of a foundation model and a lightweight model that amortizes the foundation model's computational cost over time. The foundation model produces high-quality spatial features in the background, while the lightweight model runs asynchronously in the foreground, fusing cached memory with current observations through complementary fusion, outputting depth estimates, and autoregressively updating the memory. This enables cross-frame feature reuse with bounded accuracy degradation. At a mere 3.83M parameters, it operates at 237 FPS on an RTX 4090, recovering 77% of the accuracy gap to the foundation model while achieving a 25X parameter reduction. Validated across indoor static, dynamic, and synthetic extreme-motion benchmarks, AsyncMDE degrades gracefully between refreshes and achieves 161FPS on a Jetson AGX Orin with TensorRT, clearly demonstrating its feasibility for real-time edge deployment.
comment: 8 pages, 5 figures, 5 tables
COHORT: Hybrid RL for Collaborative Large DNN Inference on Multi-Robot Systems Under Real-Time Constraints
Large deep neural networks (DNNs), especially transformer-based and multimodal architectures, are computationally demanding and challenging to deploy on resource-constrained edge platforms like field robots. These challenges intensify in mission-critical scenarios (e.g., disaster response), where robots must collaborate under tight constraints on bandwidth, latency, and battery life, often without infrastructure or server support. To address these limitations, we present COHORT, a collaborative DNN inference and task-execution framework for multi-robot systems built on the Robotic Operating System (ROS). COHORT employs a hybrid offline-online reinforcement learning (RL) strategy to dynamically schedule and distribute DNN module execution across robots. Our key contributions are threefold: (a) Offline RL policy learning combined with Advantage-Weighted Regression (AWR), trained on auction-based task allocation data from heterogeneous DNN workloads across distributed robots, (b) Online policy adaptation via Multi-Agent PPO (MAPPO), initialized from the offline policy and fine-tuned in real time, and (c) comprehensive evaluation of COHORT on vision-language model (VLM) inference tasks such as CLIP and SAM, analyzing scalability with increasing robot/workload and robustness under . We benchmark COHORT against genetic algorithms and multiple RL baselines. Experimental results demonstrate that COHORT reduces battery consumption by 15.4% and increases GPU utilization by 51.67%, while satisfying frame-rate and deadline constraints 2.55 times of the time.
comment: Recently accepted at 27th IEEE International Symposium on a World of Wireless, Mobile and Multimedia Networks ( IEEE WoWMoM 2026)
Rethinking Gaussian Trajectory Predictors: Calibrated Uncertainty for Safe Planning
Accurate trajectory prediction is critical for safe autonomous navigation in crowded environments. While many trajectory predictors output Gaussian distributions to represent the multi-modal distribution over future pedestrian positions, the reliability of their confidence levels often remains unaddressed. This limitation can lead to unsafe or overly conservative motion planning when the predictor is integrated with an uncertainty-aware planner. Existing Gaussian trajectory predictors primarily rely on the Negative Log-Likelihood loss, which is prone to predict over- or under-confident distributions, and may compromise downstream planner safety. This paper introduces a novel loss function for calibrating prediction uncertainty which leverages Kernel Density Estimation to estimate the empirical distribution of confidence levels. The proposed formulation enforces consistency with the properties of a Gaussian assumption by explicitly matching the estimated empirical distribution to the Chi-squared distribution. To ensure accurate mean prediction, a Mean Squared Error term is also incorporated in the final loss formulation. Experimental results on real-world trajectory datasets show that our method significantly improves the reliability of confidence levels predicted by different State-Of-The-Art Gaussian trajectory predictors. We also demonstrate the importance of providing planners with reliable probabilistic insights (i.e. calibrated confidence levels) for collision-free navigation in complex scenarios. For this purpose, we integrate Gaussian trajectory predictors trained with our loss function with an uncertainty-aware Model Predictive Control on scenarios extracted from real-world datasets, achieving improved planning performance through calibrated confidence levels.
Shape Control of a Planar Hyper-Redundant Robot via Hybrid Kinematics-Informed and Learning-based Approach
Hyper-redundant robots offer high dexterity, making them good at operating in confined and unstructured environments. To extend the reachable workspace, we built a multi-segment flexible rack actuated planar robot. However, the compliance of the flexible mechanism introduces instability, rendering it sensitive to external and internal uncertainties. To address these limitations, we propose a hybrid kinematics-informed and learning-based shape control method, named SpatioCoupledNet. The neural network adopts a hierarchical design that explicitly captures bidirectional spatial coupling between segments while modeling local disturbance along the robot body. A confidence-gating mechanism integrates prior kinematic knowledge, allowing the controller to adaptively balance model-based and learned components for improved convergence and fidelity. The framework is validated on a five-segment planar hyper-redundant robot under three representative shape configurations. Experimental results demonstrate that the proposed method consistently outperforms both analytical and purely neural controllers. In complex scenarios, it reduces steady-state error by up to 75.5% against the analytical model, and accelerates convergence by up to 20.5% compared to the data-driven baseline. Furthermore, gating analysis reveals a state-dependent authority fusion, shifting toward data-driven predictions in unstable states, while relying on physical priors in the remaining cases. Finally, we demonstrate robust performance in a dynamic task where the robot maintains a fixed end-effector position while avoiding moving obstacles, achieving a precise tip-positioning accuracy with a mean error of 10.47 mm.
Safe Probabilistic Planning for Human-Robot Interaction using Conformal Risk Control
In this paper, we present a novel probabilistic safe control framework for human-robot interaction that combines control barrier functions (CBFs) with conformal risk control to provide formal safety guarantees while considering complex human behavior. The approach uses conformal risk control to quantify and control the prediction errors in CBF safety values and establishes formal guarantees on the probability of constraint satisfaction during interaction. We introduce an algorithm that dynamically adjusts the safety margins produced by conformal risk control based on the current interaction context. Through experiments on human-robot navigation scenarios, we demonstrate that our approach significantly reduces collision rates and safety violations as compared to baseline methods while maintaining high success rates in goal-reaching tasks and efficient control. The code, simulations, and other supplementary material can be found on the project website: https://jakeagonzales.github.io/crc-cbf-website/.
ScanDP: Generalizable 3D Scanning with Diffusion Policy
Learning-based 3D Scanning plays a crucial role in enabling efficient and accurate scanning of target objects. However, recent reinforcement learning-based methods often require large-scale training data and still struggle to generalize to unseen object categories.In this work, we propose a data-efficient 3D scanning framework that uses Diffusion Policy to imitate human-like scanning strategies. To enhance robustness and generalization, we adopt the Occupancy Grid Mapping instead of direct point cloud processing, offering improved noise resilience and handling of diverse object geometries. We also introduce a hybrid approach combining a sphere-based space representation with a path optimization procedure that ensures path safety and scanning efficiency. This approach addresses limitations in conventional imitation learning, such as redundant or unpredictable behavior. We evaluate our method on diverse unseen objects in both shape and scale. Ours achieves higher coverage and shorter paths than baselines, while remaining robust to sensor noise. We further confirm practical feasibility and stable operation in real-world execution.
comment: 8 pages, 7 figures, 5 tables. Project Page: https://treeitsuki.github.io/ScanDP/
Few-Shot Adaptation to Non-Stationary Environments via Latent Trend Embedding for Robotics
Robotic systems operating in real-world environments often suffer from concept shift, where the input-output relationship changes due to latent environmental factors that are not directly observable. Conventional adaptation methods update model parameters, which may cause catastrophic forgetting and incur high computational cost. This paper proposes a latent Trend ID-based framework for few-shot adaptation in non-stationary environments. Instead of modifying model weights, a low-dimensional environmental state, referred to as the Trend ID, is estimated via backpropagation while the model parameters remain fixed. To prevent overfitting caused by per-sample latent variables, we introduce temporal regularization and a state transition model that enforces smooth evolution of the latent space. Experiments on a quantitative food grasping task demonstrate that the learned Trend IDs are distributed across distinct regions of the latent space with temporally consistent trajectories, and that few-shot adaptation to unseen environments is achieved without modifying model parameters. The proposed framework provides a scalable and interpretable solution for robotics applications operating across diverse and evolving environments.
Adaptive Manipulation Potential and Haptic Estimation for Tool-Mediated Interaction
Achieving human-level dexterity in contact-rich, tool-mediated manipulation remains a significant challenge due to visual occlusion and the underdetermined nature of haptic sensing. This paper introduces a parameterized Equilibrium Manifold (EM) as a unified representation for tool-mediated interaction, and develops a closed-loop framework that integrates haptic estimation, online planning, and adaptive stiffness control. We establish a physical-geometric duality using an adaptive manipulation potential incorporating a differentiable contact model, which induces the manifold's geometric structure and ensures that complex physical interactions are encapsulated as continuous operations on the EM. Within this framework, we reformulate haptic estimation as a manifold parameter estimation problem. Specifically, a hybrid inference strategy (haptic SLAM) is employed in which discrete object shapes are classified via particle filtering, while the continuous object pose is estimated using analytical gradients for efficient optimization. By continuously updating the parameters of the manipulation potential, the framework dynamically reshapes the induced EM to guide online trajectory replanning and implement uncertainty-aware impedance control, thereby closing the perception-action loop. The system is validated through simulation and over 260 real-world screw-loosening trials. Experimental results demonstrate robust identification and manipulation success in standard scenarios while maintaining accurate tracking. Furthermore, ablation studies confirm that haptic SLAM and uncertainty-aware stiffness modulation outperform fixed impedance baselines, effectively preventing jamming during tight tolerance interactions.
Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation
Vision-Language-Action (VLA) models demonstrate impressive zero-shot generalization but frequently suffer from a "Precision-Reasoning Gap" in cluttered environments. This failure is driven by background-induced feature dilution, where high-frequency semantic noise corrupts the geometric grounding required for precise manipulation. To bridge this gap, we propose Concept-Gated Visual Distillation (CGVD), a training-free, model-agnostic inference framework that stabilizes VLA policies. CGVD operates by parsing instructions into safe and distractor sets, utilizing a two-layer target refinement process--combining cross-validation and spatial disambiguation--to explicitly penalize false positives and isolate genuine manipulation targets. We then process the scene via Fourier-based inpainting, generating a clean observation that actively suppresses semantic distractors while preserving critical spatial geometry and visual proprioception. Extensive evaluations in highly cluttered manipulation tasks demonstrate that CGVD prevents performance collapse. In environments with dense semantic distractors, our method significantly outperforms state-of-the-art baselines, achieving a 77.5% success rate compared to the baseline's 43.0%. By enforcing strict attribute adherence, CGVD establishes inference-time visual distillation as a critical prerequisite for robust robotic manipulation in the clutter.
comment: 7 pages, 4 figures, 3 tables
PC-Diffuser: Path-Consistent Capsule CBF Safety Filtering for Diffusion-Based Trajectory Planner
Autonomous driving in complex traffic requires planners that generalize beyond hand-crafted rules, motivating data-driven approaches that learn behavior from expert demonstrations. Diffusion-based trajectory planners have recently shown strong closed-loop performance by iteratively denoising a full-horizon plan, but they remain difficult to certify and can fail catastrophically in rare or out-of-distribution scenarios. To address this challenge, we present PC-Diffuser, a safety augmentation framework that embeds a certifiable, path-consistent barrier-function structure directly into the denoising loop of diffusion planning. The key idea is to make safety an intrinsic part of trajectory generation rather than a post-hoc fix: we enforce forward invariance along the rollout while preserving the diffusion model's intended path geometry. Specifically, PC-Diffuser (i) evaluates collision risk using a capsule-distance barrier function that better reflects vehicle geometry and reduces unnecessary conservativeness, (ii) converts denoised waypoints into dynamically feasible motion under a kinematic bicycle model, and (iii) applies a path-consistent safety filter that eliminates residual constraint violations without geometric distortion, so the corrected plan remains close to the learned distribution. By injecting these safety-consistent corrections at every denoising step and feeding the refined trajectory back into the diffusion process, PC-Diffuser enables iterative, context-aware safeguarding instead of post-hoc repair...
SteadyTray: Learning Object Balancing Tasks in Humanoid Tray Transport via Residual Reinforcement Learning
Stabilizing unsecured payloads against the inherent oscillations of dynamic bipedal locomotion remains a critical engineering bottleneck for humanoids in unstructured environments. To solve this, we introduce ReST-RL, a hierarchical reinforcement learning architecture that explicitly decouples locomotion from payload stabilization, evaluated via the SteadyTray benchmark. Rather than relying on monolithic end-to-end learning, our framework integrates a robust base locomotion policy with a dynamic residual module engineered to actively cancel gait-induced perturbations at the end-effector. This architectural separation ensures steady tray transport without degrading the underlying bipedal stability. In simulation, the residual design significantly outperforms end-to-end baselines in gait smoothness and orientation accuracy, achieving a 96.9% success rate in variable velocity tracking and 74.5% robustness against external force disturbances. Successfully deployed on the Unitree G1 humanoid hardware, this modular approach demonstrates highly reliable zero-shot sim-to-real generalization across various objects and external force disturbances.
comment: Project website: https://steadytray.github.io/
DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving
We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.
comment: 18 pages, 10 figures
PPGuide: Steering Diffusion Policies with Performance Predictive Guidance ICRA'26
Diffusion policies have shown to be very efficient at learning complex, multi-modal behaviors for robotic manipulation. However, errors in generated action sequences can compound over time which can potentially lead to failure. Some approaches mitigate this by augmenting datasets with expert demonstrations or learning predictive world models which might be computationally expensive. We introduce Performance Predictive Guidance (PPGuide), a lightweight, classifier-based framework that steers a pre-trained diffusion policy away from failure modes at inference time. PPGuide makes use of a novel self-supervised process: it uses attention-based multiple instance learning to automatically estimate which observation-action chunks from the policy's rollouts are relevant to success or failure. We then train a performance predictor on this self-labeled data. During inference, this predictor provides a real-time gradient to guide the policy toward more robust actions. We validated our proposed PPGuide across a diverse set of tasks from the Robomimic and MimicGen benchmarks, demonstrating consistent improvements in performance.
comment: Accepted by ICRA'26
Learning Adaptive Force Control for Contact-Rich Sample Scraping with Heterogeneous Materials IROS
The increasing demand for accelerated scientific discovery, driven by global challenges, highlights the need for advanced AI-driven robotics. Deploying robotic chemists in human-centric labs is key for the next horizon of autonomous discovery, as complex tasks still demand the dexterity of human scientists. Robotic manipulation in this context is uniquely challenged by handling diverse chemicals (granular, powdery, or viscous liquids), under varying lab conditions. For example, humans use spatulas for scraping materials from vial walls. Automating this process is challenging because it goes beyond simple robotic insertion tasks and traditional lab automation, requiring the execution of fine-granular movements within a constrained environment (the sample vial). Our work proposes an adaptive control framework to address this, relying on a low-level Cartesian impedance controller for stable and compliant physical interaction and a high-level reinforcement learning agent that learns to dynamically adjust interaction forces at the end-effector. The agent is guided by perception feedback, which provides the material's location. We first created a task-representative simulation environment with a Franka Research 3 robot, a scraping tool, and a sample vial containing heterogeneous materials. To facilitate the learning of an adaptive policy and model diverse characteristics, the sample is modelled as a collection of spheres, where each sphere is assigned a unique dislodgement force threshold, which is procedurally generated using Perlin noise. We train an agent to autonomously learn and adapt the optimal contact wrench for a sample scraping task in simulation and then successfully transfer this policy to a real robotic setup. Our method was evaluated across five different material setups, outperforming a fixed-wrench baseline by an average of 10.9%.
comment: 8 pages, 6 figures, 4 tables; Submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026
Contact Coverage-Guided Exploration for General-Purpose Dexterous Manipulation
Deep Reinforcement learning (DRL) has achieved remarkable success in domains with well-defined reward structures, such as Atari games and locomotion. In contrast, dexterous manipulation lacks general-purpose reward formulations and typically depends on task-specific, handcrafted priors to guide hand-object interactions. We propose Contact Coverage-Guided Exploration (CCGE), a general exploration method designed for general-purpose dexterous manipulation tasks. CCGE represents contact state as the intersection between object surface points and predefined hand keypoints, encouraging dexterous hands to discover diverse and novel contact patterns, namely which fingers contact which object regions. It maintains a contact counter conditioned on discretized object states obtained via learned hash codes, capturing how frequently each finger interacts with different object regions. This counter is leveraged in two complementary ways: (1) to assign a count-based contact coverage reward that promotes exploration of novel contact patterns, and (2) an energy-based reaching reward that guides the agent toward under-explored contact regions. We evaluate CCGE on a diverse set of dexterous manipulation tasks, including cluttered object singulation, constrained object retrieval, in-hand reorientation, and bimanual manipulation. Experimental results show that CCGE substantially improves training efficiency and success rates over existing exploration methods, and that the contact patterns learned with CCGE transfer robustly to real-world robotic systems. Project page is https://contact-coverage-guided-exploration.github.io.
comment: 16 pages
STADA: Specification-based Testing for Autonomous Driving Agents
Simulation-based testing has become a standard approach to validating autonomous driving agents prior to real-world deployment. A high-quality validation campaign will exercise an agent in diverse contexts comprised of varying static environments, e.g., lanes, intersections, signage, and dynamic elements, e.g., vehicles and pedestrians. To achieve this, existing test generation techniques rely on template-based, manually constructed, or random scenario generation. When applied to validate formally specified safety requirements, such methods either require significant human effort or run the risk of missing important behavior related to the requirement. To address this gap, we present STADA, a Specification-based Test generation framework for Autonomous Driving Agents that systematically generates the space of scenarios defined by a formal specification expressed in temporal logic (LTLf). Given a specification, STADA constructs all distinct initial scenes, a diverse space of continuations of those scenes, and simulations that reflect the behaviors of the specification. Evaluation of STADA on a variety of LTLf specifications formalized in SCENEFLOW using three complementary coverage criteria demonstrates that STADA yields more than 2x higher coverage than the best baseline on the finest criteria and a 75% increase for the coarsest criteria. Moreover, it matches the coverage of the best baseline with 6 times fewer simulations. While set in the context of autonomous driving, the approach is applicable to other domains with rich simulation environments.
Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment
We introduce a lifelong imitation learning framework that enables continual policy refinement across sequential tasks under realistic memory and data constraints. Our approach departs from conventional experience replay by operating entirely in a multimodal latent space, where compact representations of visual, linguistic, and robot's state information are stored and reused to support future learning. To further stabilize adaptation, we introduce an incremental feature adjustment mechanism that regularizes the evolution of task embeddings through an angular margin constraint, preserving inter-task distinctiveness. Our method establishes a new state of the art in the LIBERO benchmarks, achieving 10-17 point gains in AUC and up to 65% less forgetting compared to previous leading methods. Ablation studies confirm the effectiveness of each component, showing consistent gains over alternative strategies. The code is available at: https://github.com/yfqi/lifelong_mlr_ifa.
Vision-Based Hand Shadowing for Robotic Manipulation via Inverse Kinematics
Teleoperation of low-cost robotic manipulators remains challenging due to the complexity of mapping human hand articulations to robot joint commands. We present an offline hand-shadowing and retargeting pipeline from a single egocentric RGB-D camera mounted on 3D-printed glasses. The pipeline detects 21 hand landmarks per hand using MediaPipe Hands, deprojects them into 3D via depth sensing, transforms them into the robot coordinate frame, and solves a damped-least-squares inverse kinematics problem in PyBullet to produce joint commands for the 6-DOF SO-ARM101 robot. A gripper controller maps thumb-index finger geometry to grasp aperture with a four-level fallback hierarchy. Actions are first previewed in a physics simulation before replay on the physical robot through the LeRobot framework. We evaluate the IK retargeting pipeline on a structured pick-and-place benchmark (5-tile grid, 10 grasps per tile) achieving a 90% success rate, and compare it against four vision-language-action policies (ACT, SmolVLA, pi0.5, GR00T N1.5) trained on leader-follower teleoperation data. We also test the IK pipeline in unstructured real-world environments (grocery store, pharmacy), where hand occlusion by surrounding objects reduces success to 9.3% (N=75), highlighting both the promise and current limitations of marker-free analytical retargeting.
D-SLAMSpoof: An Environment-Agnostic LiDAR Spoofing Attack using Dynamic Point Cloud Injection
In this work, we introduce Dynamic SLAMSpoof (D-SLAMSpoof), a novel attack that compromises LiDAR SLAM even in feature-rich environments. The attack leverages LiDAR spoofing, which injects spurious measurements into LiDAR scans through external laser interference. By designing both spatial injection shapes and temporally coordinated dynamic injection patterns guided by scan-matching principles, D-SLAMSpoof significantly improves attack success rates in real-world, feature-rich environments such as urban areas and indoor spaces, where conventional LiDAR spoofing methods often fail. Furthermore, we propose a practical defense method, ISD-SLAM, that relies solely on inertial dead reckoning signals commonly available in autonomous systems. We demonstrate that ISD-SLAM accurately detects LiDAR spoofing attacks, including D-SLAMSpoof, and effectively mitigates the resulting position drift. Our findings expose inherent vulnerabilities in LiDAR-based SLAM and introduce the first practical defense against LiDAR-based SLAM spoofing using only standard onboard sensors, providing critical insights for improving the security and reliability of autonomous systems.
MirrorDrift: Actuated Mirror-Based Attacks on LiDAR SLAM
LiDAR SLAM provides high-accuracy localization but is fragile to point-cloud corruption because scan matching assumes geometric consistency. Prior physical attacks on LiDAR SLAM largely rely on LiDAR spoofing via external signal injection, which requires sensor-specific timing knowledge and is increasingly mitigated by modern defense mechanisms such as timing obfuscation and injection rejection. In this work, we show that specular reflection offers an injection-free alternative and demonstrate an attack, MirrorDrift, that uses an actuated planar mirror to cause ghost points in LiDAR scans and systematically bias scan-matching correspondences. MirrorDrift optimizes mirror placement, alignment, and actuation. In simulation, it increases the average pose error (APE) by 6.1x over random placement, degrading three SLAM systems to 2.29-3.31 m mean APE. In real-world experiments on a modern LiDAR with state-of-the-art interference mitigation, it induces localization errors of up to 6.03 m. To the best of our knowledge, this is the first successful SLAM-targeted attack against production-grade secure LiDARs.
Novelty Adaptation Through Hybrid Large Language Model (LLM)-Symbolic Planning and LLM-guided Reinforcement Learning
In dynamic open-world environments, autonomous agents often encounter novelties that hinder their ability to find plans to achieve their goals. Specifically, traditional symbolic planners fail to generate plans when the robot's planning domain lacks the operators that enable it to interact appropriately with novel objects in the environment. We propose a neuro-symbolic architecture that integrates symbolic planning, reinforcement learning, and a large language model (LLM) to learn how to handle novel objects. In particular, we leverage the common sense reasoning capability of the LLM to identify missing operators, generate plans with the symbolic AI planner, and write reward functions to guide the reinforcement learning agent in learning control policies for newly identified operators. Our method outperforms the state-of-the-art methods in operator discovery as well as operator learning in continuous robotic domains.
Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning CVPR 2026
Humanoid robotics has strong potential to transform daily service and caregiving applications. Although recent advances in general motion tracking within physics engines (GMT) have enabled virtual characters and humanoid robots to reproduce a broad range of human motions, these behaviors are primarily limited to contact-less social interactions or isolated movements. Assistive scenarios, by contrast, require continuous awareness of a human partner and rapid adaptation to their evolving posture and dynamics. In this paper, we formulate the imitation of closely interacting, force-exchanging human-human motion sequences as a multi-agent reinforcement learning problem. We jointly train partner-aware policies for both the supporter (assistant) agent and the recipient agent in a physics simulator to track assistive motion references. To make this problem tractable, we introduce a partner policies initialization scheme that transfers priors from single-human motion-tracking controllers, greatly improving exploration. We further propose dynamic reference retargeting and contact-promoting reward, which adapt the assistant's reference motion to the recipient's real-time pose and encourage physically meaningful support. We show that AssistMimic is the first method capable of successfully tracking assistive interaction motions on established benchmarks, demonstrating the benefits of a multi-agent RL formulation for physically grounded and socially aware humanoid control.
comment: Accepted at CVPR 2026 (main). Project page: https://yutoshibata07.github.io/AssistMimic-projectpage/
ADMM-based Continuous Trajectory Optimization in Graphs of Convex Sets
This paper presents a numerical solver for computing continuous trajectories in non-convex environments. Our approach relies on a customized implementation of the Alternating Direction Method of Multipliers (ADMM) built upon two key components: First, we parameterize trajectories as polynomials, allowing the primal update to be computed in closed form as a minimum-control-effort problem. Second, we introduce the concept of a spatio-temporal allocation graph based on a mixed-integer formulation and pose the slack update as a shortest-path search. The combination of these ingredients results in a solver with several distinct advantages over the state of the art. By jointly optimizing over both discrete spatial and continuous temporal domains, our method accesses a larger search space than existing decoupled approaches, enabling the discovery of superior trajectories. Additionally, the solver's structural robustness ensures reliable convergence from naive initializations, removing the bottleneck of complex warm starting in non-convex environments.
Distributed Kalman--Consensus Filtering with Adaptive Uncertainty Weighting for Multi-Object Tracking in Mobile Robot Networks
This paper presents an implementation and evaluation of a Distributed Kalman--Consensus Filter (DKCF) for Multi-Object Tracking (MOT) in mobile robot networks operating under partial observability and heterogeneous localization uncertainty. A key challenge in such systems is the fusion of information from agents with differing localization quality, where frame misalignment can lead to inconsistent estimates, track duplication, and ghost tracks. To address this issue, we build upon the MOTLEE framework and retain its frame-alignment methodology, which uses consistently tracked dynamic objects as transient landmarks to improve relative pose estimates between robots. On top of this framework, we propose an uncertainty-aware adaptive consensus weighting mechanism that dynamically adjusts the influence of neighbor information based on the covariance of the transmitted estimates, thereby reducing the impact of unreliable data during distributed fusion. Local tracking is performed using a Kalman Filter (KF) with a Constant Velocity Model (CVM) and Global Nearest Neighbor (GNN) data association. simulation results demonstrate that adaptive weighting effectively protects local estimates from inconsistent data, yielding a MOTA improvement of 0.09 for agents suffering from localization drift, although system performance remains constrained by communication latency.
comment: Presented at ICARA 2026. To appear in the IEEE conference proceedings
A Causal Approach to Predicting and Improving Human Perceptions of Social Navigation Robots
As mobile robots are increasingly deployed in human environments, enabling them to predict how people perceive them is critical for socially adaptable navigation. Predicting perceptions is challenging for two main reasons: (1) HRI prediction models must learn from limited data, and (2) the obtained models must be interpretable to enable safe and effective interactions. Interpretability is particularly important when a robot is perceived as incompetent (e.g., when the robot suddenly stops or rotates away from the goal), as it allows the robot to explain its reasoning and identify controllable factors to improve performance, requiring causal rather than associative reasoning. To address these challenges, we propose a Causal Bayesian Network designed to predict how people perceive a mobile robot's competence and how they interpret its intent during navigation. Additionally, we introduce a novel method to improve perceived robot competence employing a combinatorial search, guided by the proposed causal model, to identify better navigation behaviors. Our method enhances interpretability and generates counterfactual robot motions while achieving comparable or superior predictive performance to state-of-the-art methods, reaching an F1-score of 0.78 and 0.75 for competence and intention on a binary scale. To further assess our method's ability to improve the perceived robot competence, we conducted an online evaluation in which users rated robot behaviors on a 5-point Likert scale. Our method statistically significantly increased the perceived competence of low-competent robot behavior by 83%.
comment: 8 pages, to be submitted to RA-L
Multi-Robot Multitask Gaussian Process Estimation and Coverage
Coverage control is essential for the optimal deployment of agents to monitor or cover areas with sensory demands. While traditional coverage involves single-task robots, increasing autonomy now enables multitask operations. This paper introduces a novel multitask coverage problem and addresses it for both the cases of known and unknown sensory demands. For known demands, we design a federated multitask coverage algorithm and establish its convergence properties. For unknown demands, we employ a multitask Gaussian Process (GP) framework to learn sensory demand functions and integrate it with the multitask coverage algorithm to develop an adaptive algorithm. We introduce a novel notion of multitask coverage regret that compares the performance of the adaptive algorithm against an oracle with prior knowledge of the demand functions. We establish that our algorithm achieves sublinear cumulative regret, and numerically illustrate its performance.
Robust Co-design Optimisation for Agile Fixed-Wing UAVs
Co-design optimisation of autonomous systems has emerged as a powerful alternative to sequential approaches by jointly optimising physical design and control strategies. However, existing frameworks often neglect the robustness required for autonomous systems navigating unstructured, real-world environments. For agile Unmanned Aerial Vehicles (UAVs) operating at the edge of the flight envelope, this lack of robustness yields designs that are sensitive to perturbations and model mismatch. To address this, we propose a robust co-design framework for agile fixed-wing UAVs that integrates parametric uncertainty and wind disturbances directly into the concurrent optimisation process. Our bi-level approach optimises physical design in a high-level loop while discovering nominal solutions via a constrained trajectory planner and evaluating performance across a stochastic Monte Carlo ensemble using feedback LQR control. Validated across three agile flight missions, our strategy consistently outperforms deterministic baselines. The results demonstrate that our robust co-design strategy inherently tailors aerodynamic features, such as wing placement and aspect ratio, to achieve an optimal trade-off between mission performance and disturbance rejection.
ResWM: Residual-Action World Model for Visual RL KDD2026
Learning predictive world models from raw visual observations is a central challenge in reinforcement learning (RL), especially for robotics and continuous control. Conventional model-based RL frameworks directly condition future predictions on absolute actions, which makes optimization unstable: the optimal action distributions are task-dependent, unknown a priori, and often lead to oscillatory or inefficient control. To address this, we introduce the Residual-Action World Model (ResWM), a new framework that reformulates the control variable from absolute actions to residual actions -- incremental adjustments relative to the previous step. This design aligns with the inherent smoothness of real-world control, reduces the effective search space, and stabilizes long-horizon planning. To further strengthen the representation, we propose an Observation Difference Encoder that explicitly models the changes between adjacent frames, yielding compact latent dynamics that are naturally coupled with residual actions. ResWM is integrated into a Dreamer-style latent dynamics model with minimal modifications and no extra hyperparameters. Both imagination rollouts and policy optimization are conducted in the residual-action space, enabling smoother exploration, lower control variance, and more reliable planning. Empirical results on the DeepMind Control Suite demonstrate that ResWM achieves consistent improvements in sample efficiency, asymptotic returns, and control smoothness, significantly surpassing strong baselines such as Dreamer and TD-MPC. Beyond performance, ResWM produces more stable and energy-efficient action trajectories, a property critical for robotic systems deployed in real-world environments. These findings suggest that residual action modeling provides a simple yet powerful principle for bridging algorithmic advances in RL with the practical requirements of robotics.
comment: Submit KDD2026
RC-NF: Robot-Conditioned Normalizing Flow for Real-Time Anomaly Detection in Robotic Manipulation CVPR
Recent advances in Vision-Language-Action (VLA) models have enabled robots to execute increasingly complex tasks. However, VLA models trained through imitation learning struggle to operate reliably in dynamic environments and often fail under Out-of-Distribution (OOD) conditions. To address this issue, we propose Robot-Conditioned Normalizing Flow (RC-NF), a real-time monitoring model for robotic anomaly detection and intervention that ensures the robot's state and the object's motion trajectory align with the task. RC-NF decouples the processing of task-aware robot and object states within the normalizing flow. It requires only positive samples for unsupervised training and calculates accurate robotic anomaly scores during inference through the probability density function. We further present LIBERO-Anomaly-10, a benchmark comprising three categories of robotic anomalies for simulation evaluation. RC-NF achieves state-of-the-art performance across all anomaly types compared to previous methods in monitoring robotic tasks. Real-world experiments demonstrate that RC-NF operates as a plug-and-play module for VLA models (e.g., pi0), providing a real-time OOD signal that enables state-level rollback or task-level replanning when necessary, with a response latency under 100 ms. These results demonstrate that RC-NF noticeably enhances the robustness and adaptability of VLA-based robotic systems in dynamic environments.
comment: Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
Thousand-GPU Large-Scale Training and Optimization Recipe for AI-Native Cloud Embodied Intelligence Infrastructure
Embodied intelligence is a key step towards Artificial General Intelligence (AGI), yet its development faces multiple challenges including data, frameworks, infrastructure, and evaluation systems. To address these issues, we have, for the first time in the industry, launched a cloud-based, thousand-GPU distributed training platform for embodied intelligence, built upon the widely adopted LeRobot framework, and have systematically overcome bottlenecks across the entire pipeline. At the data layer, we have restructured the data pipeline to optimize the flow of embodied training data. In terms of training, for the GR00T-N1.5 model, utilizing thousand-GPU clusters and data at the scale of hundreds of millions, the single-round training time has been reduced from 15 hours to just 22 minutes, achieving a 40-fold speedup. At the model layer, by combining variable-length FlashAttention and Data Packing, we have moved from sample redundancy to sequence integration, resulting in a 188% speed increase; π-0.5 attention optimization has accelerated training by 165%; and FP8 quantization has delivered a 140% speedup. On the infrastructure side, relying on high-performance storage, a 3.2T RDMA network, and a Ray-driven elastic AI data lake, we have achieved deep synergy among data, storage, communication, and computation. We have also built an end-to-end evaluation system, creating a closed loop from training to simulation to assessment. This framework has already been fully validated on thousand-GPU clusters, laying a crucial technical foundation for the development and application of next-generation autonomous intelligent robots, and is expected to accelerate the arrival of the era of human-machine integration.
A Survey of Reasoning in Autonomous Driving Systems: Open Challenges and Emerging Paradigms
The development of high-level autonomous driving (AD) is shifting from perception-centric limitations to a more fundamental bottleneck, namely, a deficit in robust and generalizable reasoning. Although current AD systems manage structured environments, they consistently falter in long-tail scenarios and complex social interactions that require human-like judgment. Meanwhile, the advent of large language and multimodal models (LLMs and MLLMs) presents a transformative opportunity to integrate a powerful cognitive engine into AD systems, moving beyond pattern matching toward genuine comprehension. However, a systematic framework to guide this integration is critically lacking. To bridge this gap, we provide a comprehensive review of this emerging field and argue that reasoning should be elevated from a modular component to the system's cognitive core. Specifically, we first propose a novel Cognitive Hierarchy to decompose the monolithic driving task according to its cognitive and interactive complexity. Building on this, we further derive and systematize seven core reasoning challenges, such as the responsiveness-reasoning trade-off and social-game reasoning. Furthermore, we conduct a dual-perspective review of the state-of-the-art, analyzing both system-centric approaches to architecting intelligent agents and evaluation-centric practices for their validation. Our analysis reveals a clear trend toward holistic and interpretable "glass-box" agents. In conclusion, we identify a fundamental and unresolved tension between the high-latency, deliberative nature of LLM-based reasoning and the millisecond-scale, safety-critical demands of vehicle control. For future work, a primary objective is to bridge the symbolic-to-physical gap by developing verifiable neuro-symbolic architectures, robust reasoning under uncertainty, and scalable models for implicit social negotiation.
comment: Published in TMLR (March 2026) | OpenReview: https://openreview.net/forum?id=XwQ7dc4bqn
Edge-Assisted Multi-Robot Visual-Inertial SLAM with Efficient Communication
The integration of cloud computing and edge computing is an effective way to achieve global consistent and real-time multi-robot Simultaneous Localization and Mapping (SLAM). Cloud computing effectively solves the problem of limited computing, communication and storage capacity of terminal equipment. However, limited bandwidth and extremely long communication links between terminal devices and the cloud result in serious performance degradation of multi-robot SLAM systems. To reduce the computational cost of feature tracking and improve the real-time performance of the robot, a lightweight SLAM method of optical flow tracking based on pyramid IMU prediction is proposed. On this basis, a centralized multi-robot SLAM system based on a robot-edge-cloud layered architecture is proposed to realize real-time collaborative SLAM. It avoids the problems of limited on-board computing resources and low execution efficiency of single robot. In this framework, only the feature points and keyframe descriptors are transmitted and lossless encoding and compression are carried out to realize real-time remote information transmission with limited bandwidth resources. This design reduces the actual bandwidth occupied in the process of data transmission, and does not cause the loss of SLAM accuracy caused by data compression. Through experimental verification on the EuRoC dataset, compared with the current most advanced local feature compression method, our method can achieve lower data volume feature transmission, and compared with the current advanced centralized multi-robot SLAM scheme, it can achieve the same or better positioning accuracy under low computational load.
comment: 13 pages, 18 figures
Partially Equivariant Reinforcement Learning in Symmetry-Breaking Environments ICLR 2026
Group symmetries provide a powerful inductive bias for reinforcement learning (RL), enabling efficient generalization across symmetric states and actions via group-invariant Markov Decision Processes (MDPs). However, real-world environments almost never realize fully group-invariant MDPs; dynamics, actuation limits, and reward design usually break symmetries, often only locally. Under group-invariant Bellman backups for such cases, local symmetry-breaking introduces errors that propagate across the entire state-action space, resulting in global value estimation errors. To address this, we introduce Partially group-Invariant MDP (PI-MDP), which selectively applies group-invariant or standard Bellman backups depending on where symmetry holds. This framework mitigates error propagation from locally broken symmetries while maintaining the benefits of equivariance, thereby enhancing sample efficiency and generalizability. Building on this framework, we present practical RL algorithms -- Partially Equivariant (PE)-DQN for discrete control and PE-SAC for continuous control -- that combine the benefits of equivariance with robustness to symmetry-breaking. Experiments across Grid-World, locomotion, and manipulation benchmarks demonstrate that PE-DQN and PE-SAC significantly outperform baselines, highlighting the importance of selective symmetry exploitation for robust and sample-efficient RL. Project page: https://pranaboy72.github.io/perl_page/
comment: ICLR 2026
RACAS: Controlling Diverse Robots With a Single Agentic System
Many robotic platforms expose an API through which external software can command their actuators and read their sensors. However, transitioning from these low-level interfaces to high-level autonomous behaviour requires a complicated pipeline, whose components demand distinct areas of expertise. Existing approaches to bridging this gap either require retraining for every new embodiment or have only been validated across structurally similar platforms. We introduce RACAS (Robot-Agnostic Control via Agentic Systems), a cooperative agentic architecture in which three LLM/VLM-based modules (Monitors, a Controller, and a Memory Curator) communicate exclusively through natural language to provide closed-loop robot control. RACAS requires only a natural language description of the robot, a definition of available actions, and a task specification; no source code, model weights, or reward functions need to be modified to move between platforms. We evaluate RACAS on several tasks using a wheeled ground robot, a recently published novel multi-jointed robotic limb, and an underwater vehicle. RACAS consistently solved all assigned tasks across these radically different platforms, demonstrating the potential of agentic AI to substantially reduce the barrier to prototyping robotic solutions.
comment: 7 pages in main text + 1 page of appendices + 1 page of references, 5 figures in main text + 1 figure in appendices, 2 tables in main text; source code available at https://github.com/janprz11/robot-agnostic-control
vS-Graphs: Tightly Coupling Visual SLAM and 3D Scene Graphs Exploiting Hierarchical Scene Understanding
Current Visual Simultaneous Localization and Mapping (VSLAM) systems often struggle to create maps that are both semantically rich and easily interpretable. While incorporating semantic scene knowledge aids in building richer maps with contextual associations among mapped objects, representing them in structured formats, such as scene graphs, has not been widely addressed, resulting in complex map comprehension and limited scalability. This paper introduces vS-Graphs, a novel real-time VSLAM framework that integrates vision-based scene understanding with map reconstruction and comprehensible graph-based representation. The framework infers structural elements (i.e., rooms and floors) from detected building components (i.e., walls and ground surfaces) and incorporates them into optimizable 3D scene graphs. This solution enhances the reconstructed map's semantic richness, comprehensibility, and localization accuracy. Extensive experiments on standard benchmarks and real-world datasets demonstrate that vS-Graphs achieves an average of 15.22% accuracy gain across all tested datasets compared to state-of-the-art VSLAM methods. Furthermore, the proposed framework achieves environment-driven semantic entity detection accuracy comparable to that of precise LiDAR-based frameworks, using only visual features. The code is publicly available at https://github.com/snt-arg/visual_sgraphs and is actively being improved. Moreover, a web page containing more media and evaluation outcomes is available on https://snt-arg.github.io/vsgraphs-results/.
comment: 20 pages, 10 figures, 5 tables
A Chain-Driven, Sandwich-Legged Quadruped Robot: Design and Experimental Analysis
This paper introduces a chain-driven, sandwich-legged mid-size quadruped robot designed as an accessible research platform. The design prioritizes enhanced locomotion, improved actuation reliability and safety, and simplified, cost-effective manufacturing. Locomotion performance is improved through a sandwiched leg architecture and dual-motor configuration, reducing leg inertia for agile motion. Reliability and safety are enhanced using robust cable strain reliefs, motor heat sinks for thermal management, and mechanical limits to restrict leg motion. The design incorporates quasi-direct-drive (QDD) actuators and low-cost fabrication methods such as laser cutting and 3D printing for rapid prototyping. The $25\,\mathrm{kg}$ robot is built under \$8000, providing an affordable quadruped research platform. Experiments demonstrate trot and crawl gaits on flat terrain and slopes. We also open-source the mechanical designs. VIDEO: https://youtu.be/ygSMCPcFnP8?feature=shared CADs: https://github.com/singhaman1750/stoch3-design.git
comment: 6 pages, 9 figures
CostNav: A Navigation Benchmark for Real-World Economic-Cost Evaluation of Physical AI Agents
While current navigation benchmarks prioritize task success in simplified settings, they neglect the multidimensional economic constraints essential for the real-world commercialization of autonomous delivery systems. We introduce CostNav, an Economic Navigation Benchmark that evaluates physical AI agents through comprehensive economic cost-revenue analysis aligned with real-world business operations. By integrating industry-standard data--such as Securities and Exchange Commission (SEC) filings and Abbreviated Injury Scale (AIS) injury reports--with Isaac Sim's detailed collision and cargo dynamics, CostNav transcends simple task completion to accurately evaluate business value in complex, real-world scenarios. To our knowledge, CostNav is the first physics-grounded economic benchmark that uses industry-standard regulatory and financial data to quantitatively expose the gap between navigation research metrics and commercial viability, revealing that optimizing for task success on a simplified task fundamentally differs from optimizing for real-world economic deployment. Evaluating seven baselines--two rule-based and five imitation learning--we find that no current method is economically viable, all yielding negative contribution margins. The best-performing method, CANVAS (-27.36\$/run), equipped with only an RGB camera and GPS, outperforms LiDAR-equipped Nav2 w/ GPS (-35.46\$/run). We challenge the community to develop navigation policies that achieve economic viability on CostNav. We remain method-agnostic, evaluating success solely on cost rather than the underlying architecture. All resources are available at https://github.com/worv-ai/CostNav.
Cross-embodied Co-design for Dexterous Hands
Dexterous manipulation is limited by both control and design, without consensus as to what makes manipulators best for performing dexterous tasks. This raises a fundamental challenge: how should we design and control robot manipulators that are optimized for dexterity? We present a co-design framework that learns task-specific hand morphology and complementary dexterous control policies. The framework supports 1) an expansive morphology search space including joint, finger, and palm generation, 2) scalable evaluation across the wide design space via morphology-conditioned cross-embodied control, and 3) real-world fabrication with accessible components. We evaluate the approach across multiple dexterous tasks, including in-hand rotation with simulation and real deployment. Our framework enables an end-to-end pipeline that can design, train, fabricate, and deploy a new robotic hand in under 24 hours. The full framework will be open-sourced and available on our website: https://an-axolotl.github.io/HouseofDextra/ .
CompassNav: Steering From Path Imitation To Decision Understanding In Navigation
The dominant paradigm for training Large Vision-Language Models (LVLMs) in navigation relies on imitating expert trajectories. This approach reduces the complex navigation task to a sequence-to-sequence replication of a single correct path, fundamentally limiting the agent's ability to explore and generalize. In this work, we argue for and introduce a new paradigm: a shift from Path Imitation to Decision Understanding. The goal of this paradigm is to build agents that do not just follow, but truly understand how to navigate. We materialize this through two core contributions: first, we introduce Compass-Data-22k, a novel 22k-trajectory dataset. Its Reinforcement Fine-Tuning (RFT) subset provides a panoramic view of the decision landscape by annotating all feasible actions with A* geodesic distances. Second, we design a novel gap-aware hybrid reward function that dynamically adapts its feedback to decision certainty, shifting between decisive signals for optimal actions and nuanced scores to encourage exploration. Integrated into an SFT-then-RFT recipe, our CompassNav agent is trained not to memorize static routes, but to develop an internal compass that constantly intuits the direction to the goal by evaluating the relative quality of all possible moves. This approach enables our 7B agent to set a new state-of-the-art on Goal navigation benchmarks, outperforming even larger proprietary models, and achieve robust real-world goal navigation on a physical robot.
SPAARS: Safer RL Policy Alignment through Abstract Exploration and Refined Exploitation of Action Space
Offline-to-online reinforcement learning (RL) offers a promising paradigm for robotics by pre-training policies on safe, offline demonstrations and fine-tuning them via online interaction. However, a fundamental challenge remains: how to safely explore online without deviating from the behavioral support of the offline data? While recent methods leverage conditional variational autoencoders (CVAEs) to bound exploration within a latent space, they inherently suffer from an exploitation gap -- a performance ceiling imposed by the decoder's reconstruction loss. We introduce SPAARS, a curriculum learning framework that initially constrains exploration to the low-dimensional latent manifold for sample-efficient, safe behavioral improvement, then seamlessly transfers control to the raw action space, bypassing the decoder bottleneck. SPAARS has two instantiations: the CVAE-based variant requires only unordered (s,a) pairs and no trajectory segmentation; SPAARS-SUPE pairs SPAARS with OPAL temporal skill pretraining for stronger exploration structure at the cost of requiring trajectory chunks. We prove an upper bound on the exploitation gap using the Performance Difference Lemma, establish that latent-space policy gradients achieve provable variance reduction over raw-space exploration, and show that concurrent behavioral cloning during the latent phase directly controls curriculum transition stability. Empirically, SPAARS-SUPE achieves 0.825 normalized return on kitchen-mixed-v0 versus 0.75 for SUPE, with 5x better sample efficiency; standalone SPAARS achieves 92.7 and 102.9 normalized return on hopper-medium-v2 and walker2d-medium-v2 respectively, surpassing IQL baselines of 66.3 and 78.3 respectively, confirming the utility of the unordered-pair CVAE instantiation.
comment: 9 pages
Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation CVPR 2026
Text-goal instance navigation (TGIN) asks an agent to resolve a single, free-form description into actions that reach the correct object instance among same-category distractors. We present \textit{Context-Nav} that elevates long, contextual captions from a local matching cue to a global exploration prior and verifies candidates through 3D spatial reasoning. First, we compute dense text-image alignments for a value map that ranks frontiers -- guiding exploration toward regions consistent with the entire description rather than early detections. Second, upon observing a candidate, we perform a viewpoint-aware relation check: the agent samples plausible observer poses, aligns local frames, and accepts a target only if the spatial relations can be satisfied from at least one viewpoint. The pipeline requires no task-specific training or fine-tuning; we attain state-of-the-art performance on InstanceNav and CoIN-Bench. Ablations show that (i) encoding full captions into the value map avoids wasted motion and (ii) explicit, viewpoint-aware 3D verification prevents semantically plausible but incorrect stops. This suggests that geometry-grounded spatial reasoning is a scalable alternative to heavy policy training or human-in-the-loop interaction for fine-grained instance disambiguation in cluttered 3D scenes.
comment: Camera-ready version. Accepted to CVPR 2026
Scalable Multi-Task Learning through Spiking Neural Networks with Adaptive Task-Switching Policy for Intelligent Autonomous Agents
Training resource-constrained autonomous agents on multiple tasks simultaneously is crucial for adapting to diverse real-world environments. Recent works employ reinforcement learning (RL) approach, but they still suffer from sub-optimal multi-task performance due to task interference. State-of-the-art works employ Spiking Neural Networks (SNNs) to improve RL-based multi-task learning and enable low-power/energy operations through network enhancements and spike-driven data stream processing. However, they rely on fixed task-switching intervals during its training, thus limiting its performance and scalability. To address this, we propose SwitchMT, a novel methodology that employs adaptive task-switching for effective, scalable, and simultaneous multi-task learning. SwitchMT employs the following key ideas: (1) leveraging a Deep Spiking Q-Network with active dendrites and dueling structure, that utilizes task-specific context signals to create specialized sub-networks; and (2) devising an adaptive task-switching policy that leverages both rewards and internal dynamics of the network parameters. Experimental results demonstrate that SwitchMT achieves competitive scores in multiple Atari games (i.e., Pong: -8.8, Breakout: 5.6, and Enduro: 355.2) and longer game episodes as compared to the state-of-the-art. These results also highlight the effectiveness of SwitchMT methodology in addressing task interference without increasing the network complexity, enabling intelligent autonomous agents with scalable multi-task learning capabilities.
comment: Accepted at the 63rd ACM/IEEE Design Automation Conference (DAC), July 26-29, 2026 in Long Beach, CA, USA Codes: https://github.com/rachmadvwp/SwitchMT
Robust Cooperative Localization in Featureless Environments: A Comparative Study of DCL, StCL, CCL, CI, and Standard-CL
Cooperative localization (CL) enables accurate position estimation in multi-robot systems operating in GPS-denied environments. This paper presents a comparative study of five CL approaches: Centralized Cooperative Localization (CCL), Decentralized Cooperative Localization (DCL), Sequential Cooperative Localization (StCL), Covariance Intersection (CI), and Standard Cooperative Localization (Standard-CL). All methods are implemented in ROS and evaluated through Monte Carlo simulations under two conditions: weak data association and robust detection. Our analysis reveals fundamental trade-offs among the methods. StCL and Standard-CL achieve the lowest position errors but exhibit severe filter inconsistency, making them unsuitable for safety-critical applications. DCL demonstrates remarkable stability under challenging conditions due to its measurement stride mechanism, which provides implicit regularization against outliers. CI emerges as the most balanced approach, achieving near-optimal consistency while maintaining competitive accuracy. CCL provides theoretically optimal estimation but shows sensitivity to measurement outliers. These findings offer practical guidance for selecting CL algorithms based on application requirements.
comment: Presented at the 2026 12th International Conference on Automation, Robotics and Applications (ICARA); to be published in IEEE conference proceedings
Cosmos-H-Surgical: Learning Surgical Robot Policies from Videos via World Modeling
Data scarcity remains a fundamental barrier to achieving fully autonomous surgical robots. While large scale vision language action (VLA) models have shown impressive generalization in household and industrial manipulation by leveraging paired video action data from diverse domains, surgical robotics suffers from the paucity of datasets that include both visual observations and accurate robot kinematics. In contrast, vast corpora of surgical videos exist, but they lack corresponding action labels, preventing direct application of imitation learning or VLA training. In this work, we aim to alleviate this problem by learning policy models from Cosmos-H-Surgical, a world model designed for surgical physical AI. We curated the Surgical Action Text Alignment (SATA) dataset with detailed action description specifically for surgical robots. Then we built Cosmos-H-Surgical based on the most advanced physical AI world model and SATA. It's able to generate diverse, generalizable and realistic surgery videos. We are also the first to use an inverse dynamics model to infer pseudokinematics from synthetic surgical videos, producing synthetic paired video action data. We demonstrate that a surgical VLA policy trained with these augmented data significantly outperforms models trained only on real demonstrations on a real surgical robot platform. Our approach offers a scalable path toward autonomous surgical skill acquisition by leveraging the abundance of unlabeled surgical video and generative world modeling, thus opening the door to generalizable and data efficient surgical robot policies.
Symskill: Symbol and Skill Co-Invention for Data-Efficient and Reactive Long-Horizon Manipulation ICRA 2026
Multi-step manipulation in dynamic environments remains challenging. Imitation learning (IL) is reactive but lacks compositional generalization, since monolithic policies do not decide which skill to reuse when scenes change. Classical task-and-motion planning (TAMP) offers compositionality, but its high planning latency prevents real-time failure recovery. We introduce SymSkill, a unified framework that jointly learns predicates, operators, and skills from unlabeled, unsegmented demonstrations, combining compositional generalization with real-time recovery. Offline, SymSkill learns symbolic abstractions and goal-oriented skills directly from demonstrations. Online, given a conjunction of learned predicates, it uses a symbolic planner to compose and reorder skills to achieve symbolic goals while recovering from failures at both the motion and symbolic levels in real time. Coupled with a compliant controller, SymSkill supports safe execution under human and environmental disturbances. In RoboCasa simulation, SymSkill executes 12 single-step tasks with 85% success and composes them into multi-step plans without additional data. On a real Franka robot, it learns from 5 minutes of play data and performs 12-step tasks from goal specifications. Code and additional analysis are available at https://sites.google.com/view/symskill.
comment: ICRA 2026; CoRL 2025 Learning Effective Abstractions for Planning (LEAP) Workshop Best Paper Award (https://sites.google.com/view/symskill)
Open-World Task and Motion Planning via Vision-Language Model Generated Constraints
Foundation models like Vision-Language Models (VLMs) excel at common sense vision and language tasks such as visual question answering. However, they cannot yet directly solve complex, long-horizon robot manipulation problems requiring precise continuous reasoning. Task and Motion Planning (TAMP) systems can handle long-horizon reasoning through discrete-continuous hybrid search over parameterized skills, but rely on detailed environment models and cannot interpret novel human objectives, such as arbitrary natural language goals. We propose integrating VLMs into TAMP systems by having them generate discrete and continuous language-parameterized constraints that enable open-world reasoning. Specifically, we use VLMs to generate discrete action ordering constraints that constrain TAMP search over action sequences, and continuous constraints in the form of code that augments traditional TAMP manipulation constraints. Experiments show that our approach, OWL-TAMP, outperforms baselines relying solely on TAMP or VLMs across several long-horizon manipulation tasks specified directly in natural language. We additionally demonstrate that OWL-TAMP can be deployed with an off-the-shelf TAMP system to solve challenging manipulation tasks on real-world hardware.
comment: A version of this paper appears in IEEE Robotics and Automation Letters (RA-L) Volume 11, Issue 3
PlayWorld: Learning Robot World Models from Autonomous Play
Action-conditioned video models offer a promising path to building general-purpose robot simulators that can improve directly from data. Yet, despite training on large-scale robot datasets, current state-of-the-art video models still struggle to predict physically consistent robot-object interactions that are crucial in robotic manipulation. To close this gap, we present PlayWorld, a simple, scalable, and fully autonomous pipeline for training high-fidelity video world simulators from interaction experience. In contrast to prior approaches that rely on success-biased human demonstrations, PlayWorld is the first system capable of learning entirely from unsupervised robot self-play, enabling naturally scalable data collection while capturing complex, long-tailed physical interactions essential for modeling realistic object dynamics. Experiments across diverse manipulation tasks show that PlayWorld generates high-quality, physically consistent predictions for contact-rich interactions that are not captured by world models trained on human-collected data. We further demonstrate the versatility of PlayWorld in enabling fine-grained failure prediction and policy evaluation, with up to 40% improvements over human-collected data. Finally, we demonstrate how PlayWorld enables reinforcement learning in the world model, improving policy performance by 65% in success rates when deployed in the real world.
comment: https://robot-playworld.github.io/
PvP: Data-Efficient Humanoid Robot Learning with Proprioceptive-Privileged Contrastive Representations
Achieving efficient and robust whole-body control (WBC) is essential for enabling humanoid robots to perform complex tasks in dynamic environments. Despite the success of reinforcement learning (RL) in this domain, its sample inefficiency remains a significant challenge due to the intricate dynamics and partial observability of humanoid robots. To address this limitation, we propose PvP, a Proprioceptive-Privileged contrastive learning framework that leverages the intrinsic complementarity between proprioceptive and privileged states. PvP learns compact and task-relevant latent representations without requiring hand-crafted data augmentations, enabling faster and more stable policy learning. To support systematic evaluation, we develop SRL4Humanoid, the first unified and modular framework that provides high-quality implementations of representative state representation learning (SRL) methods for humanoid robot learning. Extensive experiments on the LimX Oli robot across velocity tracking and motion imitation tasks demonstrate that PvP significantly improves sample efficiency and final performance compared to baseline SRL methods. Our study further provides practical insights into integrating SRL with RL for humanoid WBC, offering valuable guidance for data-efficient humanoid robot learning.
comment: 15 pages, 17 figures
Self-Improving Loops for Visual Robotic Planning ICLR 2026
Video generative models trained on expert demonstrations have been utilized as performant text-conditioned visual planners for solving robotic tasks. However, generalization to unseen tasks remains a challenge. Whereas improved generalization may be facilitated by leveraging learned prior knowledge from additional pre-collected offline data sources, such as web-scale video datasets, in the era of experience we aim to design agents that can continuously improve in an online manner from self-collected behaviors. In this work we thus propose the Self-Improving Loops for Visual Robotic Planning (SILVR), where an in-domain video model iteratively updates itself on self-produced trajectories, and steadily improves its performance for a specified task of interest. We apply SILVR to a diverse suite of MetaWorld tasks, as well as two manipulation tasks on a real robot arm, and find that performance improvements continuously emerge over multiple iterations for novel tasks unseen during initial in-domain video model training. We demonstrate that SILVR is robust in the absence of human-provided ground-truth reward functions or expert-quality demonstrations, and is preferable to alternate approaches that utilize online experience in terms of performance and sample efficiency.
comment: ICLR 2026. Project Page: https://diffusion-supervision.github.io/silvr/
MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent CVPR 2026
Recent Vision-Language-Action (VLA) models reformulate vision-language models by tuning them with millions of robotic demonstrations. While they perform well when fine-tuned for a single embodiment or task family, extending them to multi-skill settings remains challenging: directly merging VLA experts trained on different tasks results in near-zero success rates. This raises a fundamental question: what prevents VLAs from mastering multiple skills within one model? With an empirical decomposition of learnable parameters during VLA fine-tuning, we identify two key sources of non-mergeability: (1) Finetuning drives LoRA adapters in the VLM backbone toward divergent, task-specific directions beyond the capacity of existing merging methods to unify. (2) Action experts develop inter-block dependencies through self-attention feedback, causing task information to spread across layers and preventing modular recombination. To address these challenges, we present MergeVLA, a merging-oriented VLA architecture that preserves mergeability by design. MergeVLA introduces sparsely activated LoRA adapters via task masks to retain consistent parameters and reduce irreconcilable conflicts in the VLM. Its action expert replaces self-attention with cross-attention-only blocks to keep specialization localized and composable. When the task is unknown, it uses a test-time task router to adaptively select the appropriate task mask and expert head from the initial observation, enabling unsupervised task inference. Across LIBERO, LIBERO-Plus, RoboTwin, and multi-task experiments on the real SO101 robotic arm, MergeVLA achieves performance comparable to or even exceeding individually finetuned experts, demonstrating robust generalization across tasks, embodiments, and environments. Project page: https://mergevla.github.io/
comment: Accepted to CVPR 2026
Moving On, Even When You're Broken: Fail-Active Trajectory Generation via Diffusion Policies Conditioned on Embodiment and Task
Robot failure is detrimental and disruptive, often requiring human intervention to recover. Our vision is 'fail-active' operation, allowing robots to safely complete their tasks even when damaged. Focusing on 'actuation failures', we introduce DEFT, a diffusion-based trajectory generator conditioned on the robot's current embodiment and task constraints. DEFT generalizes across failure types, supports constrained and unconstrained motions, and enables task completion under arbitrary failure. We evaluate DEFT in both simulation and real-world scenarios using a 7-DoF robotic arm. DEFT outperforms its baselines over thousands of failure conditions, achieving a 99.5% success rate for unconstrained motions versus RRT's 42.4%, and 46.4% for constrained motions versus differential IK's 30.9%. Furthermore, DEFT demonstrates robust zero-shot generalization by maintaining performance on failure conditions unseen during training. Finally, we perform real-world evaluations on two multi-step tasks, drawer manipulation and whiteboard erasing. These experiments demonstrate DEFT succeeding on tasks where classical methods fail. Our results show that DEFT achieves fail-active manipulation across arbitrary failure configurations and real-world deployments.
comment: To be published in the 2026 IEEE International Conference on Robotics & Automation
Pixel Motion Diffusion is What We Need for Robot Control CVPR 2026
We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: https://eronguyen.github.io/DAWN/
comment: Accepted to CVPR 2026. Project page: https://eronguyen.github.io/DAWN
UniFField: A Generalizable Unified Neural Feature Field for Visual, Semantic, and Spatial Uncertainties in Any Scene ICRA 2026
Comprehensive visual, geometric, and semantic understanding of a 3D scene is crucial for successful execution of robotic tasks, especially in unstructured and complex environments. Additionally, to make robust decisions, it is necessary for the robot to evaluate the reliability of perceived information. While recent advances in 3D neural feature fields have enabled robots to leverage features from pretrained foundation models for tasks such as language-guided manipulation and navigation, existing methods suffer from two critical limitations: (i) they are typically scene-specific, and (ii) they lack the ability to model uncertainty in their predictions. We present UniFField, a unified uncertainty-aware neural feature field that combines visual, semantic, and geometric features in a single generalizable representation while also predicting uncertainty in each modality. Our approach, which can be applied zero shot to any new environment, incrementally integrates RGB-D images into our voxel-based feature representation as the robot explores the scene, simultaneously updating uncertainty estimation. We evaluate our uncertainty estimations to accurately describe the model prediction errors in scene reconstruction and semantic feature prediction. Furthermore, we successfully leverage our feature predictions and their respective uncertainty for an active object search task using a mobile manipulator robot, demonstrating the capability for robust decision-making.
comment: ICRA 2026 Project website: https://sites.google.com/view/uniffield
Time as a Control Dimension in Robot Learning
Temporal awareness plays a central role in intelligent behavior by shaping how actions are paced, coordinated, and adapted to changing goals and environments. In contrast, most robot learning algorithms treat time only as a fixed episode horizon or scheduling constraint. Here we introduce time-aware policy learning, a reinforcement learning framework that treats time as a control dimension of robot behavior. The approach augments policies with two temporal signals, the remaining time and a time ratio that modulates the policy's internal progression of time, allowing a single policy to regulate its execution strategy across temporal regimes. Across diverse manipulation tasks including long-horizon manipulation, granular-media pouring, articulated-object interaction, and multi-agent coordination, the resulting policies adapt their behavior continuously from dynamic execution under tight schedules to stable and deliberate interaction when more time is available. This temporal awareness improves efficiency, robustness under sim-to-real mismatch and disturbances, and controllability through human input without retraining. Treating time as a controllable variable provides a new framework for adaptive and human-aligned robot autonomy.
POrTAL: Plan-Orchestrated Tree Assembly for Lookahead IROS 26
When tasking robots in partially observable environments, these robots must efficiently and robustly plan to achieve task goals under uncertainty. Although many probabilistic planning algorithms exist for this purpose, these algorithms can be inefficient if executed with the robot's limited computational resources, or may produce policies that take more steps than expected to achieve the goal. We therefore created a new, lightweight, probabilistic planning algorithm, Plan-Orchestrated Tree Assembly for Lookahead (POrTAL), that combines the strengths of two baseline planning algorithms, FF-Replan and POMCP. We demonstrate that POrTAL is an anytime algorithm that generally outperforms these baselines in terms of the final executed plan length given bounded computation time, especially for problems with only moderate levels of uncertainty.
comment: Submitted to IROS 26
Inference-Time Enhancement of Generative Robot Policies via Predictive World Modeling
We present Generative Predictive Control (GPC), an inference-time method for improving pretrained behavior-cloning policies without retraining. GPC augments a frozen diffusion policy at deployment with an action-conditioned world model trained on expert demonstrations and random exploration rollouts. The world model predicts the consequences of action proposals generated by the diffusion policy and enables lightweight online planning that ranks and refines these proposals through model-based look-ahead. By combining a generative prior with predictive foresight, GPC enables test-time adaptation while keeping the original policy fixed. Across diverse robotic manipulation tasks, including state- and vision-based settings in both simulation and real-world experiments, GPC consistently outperforms standard behavior cloning and compares favorably with other inference-time adaptation baselines.
comment: Acceptance to IEEE Robotics and Automation Letters. Website: https://computationalrobotics.seas.harvard.edu/GPC
Human-Aware Robot Behaviour in Self-Driving Labs
Self-driving laboratories (SDLs) are rapidly transforming research in chemistry and materials science to accelerate new discoveries. Mobile robot chemists (MRCs) play a pivotal role by autonomously navigating the lab to transport samples, effectively connecting synthesis, analysis, and characterisation equipment. The instruments within an SDL are typically designed or retrofitted to be accessed by both human and robotic chemists, ensuring operational flexibility and integration between manual and automated workflows. In many scenarios, human and robotic chemists may need to use the same equipment simultaneously. Currently, MRCs rely on simple LiDAR-based obstruction detection, which forces the robot to passively wait if a human is present. This lack of situational awareness leads to unnecessary delays and inefficient coordination in time-critical automated workflows in human-robot shared labs. To address this, we present an initial study of an embodied, AI-driven perception method that facilitates proactive human-robot interaction in shared-access scenarios. Our method features a hierarchical human intention prediction model that allows the robot to distinguish between preparatory actions (waiting) and transient interactions (accessing the instrument). Our results demonstrate that the proposed approach enhances efficiency by enabling proactive human-robot interaction, streamlining coordination, and potentially increasing the efficiency of autonomous scientific labs.
Multiagent Systems
GRACE: A Unified 2D Multi-Robot Path Planning Simulator & Benchmark for Grid, Roadmap, And Continuous Environments ICRA 2026
Advancing Multi-Agent Pathfinding (MAPF) and Multi-Robot Motion Planning (MRMP) requires platforms that enable transparent, reproducible comparisons across modeling choices. Existing tools either scale under simplifying assumptions (grids, homogeneous agents) or offer higher fidelity with less comparable instrumentation. We present GRACE, a unified 2D simulator+benchmark that instantiates the same task at multiple abstraction levels (grid, roadmap, continuous) via explicit, reproducible operators and a common evaluation protocol. Our empirical results on public maps and representative planners enable commensurate comparisons on a shared instance set. Furthermore, we quantify the expected representation-fidelity trade-offs (MRMP solves instances at higher fidelity but lower speed, while grid/roadmap planners scale farther). By consolidating representation, execution, and evaluation, GRACE thereby aims to make cross-representation studies more comparable and provides a means to advance multi-robot planning research and its translation to practice.
comment: ICRA 2026, code will be released soon
COMIC: Agentic Sketch Comedy Generation
We propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation, and improvement. A key contribution is the introduction of LLM critics aligned with real viewer preferences through the analysis of a corpus of comedy videos on YouTube to automatically evaluate humor. Our experiments show that our framework produces results approaching the quality of professionally produced sketches while demonstrating state-of-the-art performance in video generation.
comment: Project page: https://susunghong.github.io/COMIC/
LLMGreenRec: LLM-Based Multi-Agent Recommender System for Sustainable E-Commerce
Rising environmental awareness in e-commerce necessitates recommender systems that not only guide users to sustainable products but also minimize their own digital carbon footprints. Traditional session-based systems, optimized for short-term conversions, often fail to capture nuanced user intents for eco-friendly choices, perpetuating a gap between green intentions and actions. To tackle this, we introduce LLMGreenRec, a novel multi-agent framework that leverages Large Language Models (LLMs) to promote sustainable consumption. Through collaborative analysis of user interactions and iterative prompt refinement, LLMGreenRec's specialized agents deduce green-oriented user intents and prioritize eco-friendly product recommendations. Notably, this intent-driven approach also reduces unnecessary interactions and energy consumption. Extensive experiments on benchmark datasets validate LLMGreenRec's effectiveness in recommending sustainable products, demonstrating a robust solution that fosters a responsible digital economy.
comment: Accepted to the Proceedings of the Conference on Digital Economy and Fintech Innovation (DEFI 2025). To appear in IEEE Xplore
Resolving Java Code Repository Issues with iSWE Agent
Resolving issues on code repositories is an important part of software engineering. Various recent systems automatically resolve issues using large language models and agents, often with impressive performance. Unfortunately, most of these models and agents focus primarily on Python, and their performance on other programming languages is lower. In particular, a lot of enterprise software is written in Java, yet automated issue resolution for Java is under-explored. This paper introduces iSWE Agent, an automated issue resolver with an emphasis on Java. It consists of two sub-agents, one for localization and the other for editing. Both have access to novel tools based on rule-based Java static analysis and transformation. Using this approach, iSWE achieves state-of-the-art issue resolution rates across the Java splits of both Multi-SWE-bench and SWE-PolyBench. More generally, we hope that by combining the best of rule-based and model-based techniques, this paper contributes towards improving enterprise software development.
Enhancing Value Alignment of LLMs with Multi-agent system and Combinatorial Fusion ICASSP
Aligning large language models (LLMs) with human values is a central challenge for ensuring trustworthy and safe deployment. While existing methods such as Reinforcement Learning from Human Feedback (RLHF) and its variants have improved alignment, they often rely on a single evaluator or narrowly defined reward signals, limiting their ability to capture ethical pluralism. In this work, we propose the Value Alignment System using Combinatorial Fusion Analysis (VAS-CFA), a framework that operationalizes multi-agent fusion alignment. It instantiates multiple moral agents, each fine-tuned to represent a distinct normative perspective, and fuses their outputs using CFA with both rank- and score-based aggregation. This design leverages cognitive diversity, between agents, to mitigate conflicts and redundancies across multiple agents, producing responses that better reflect human values. Empirical evaluation demonstrates that VAS-CFA outperforms both single agent baselines and prior aggregation approaches on standard metrics, showing that multi-agent fusion provides a robust and effective mechanism for advancing value alignment in LLMs.
comment: 5 pages, 3 figures, accepted to 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Edge-Assisted Multi-Robot Visual-Inertial SLAM with Efficient Communication
The integration of cloud computing and edge computing is an effective way to achieve global consistent and real-time multi-robot Simultaneous Localization and Mapping (SLAM). Cloud computing effectively solves the problem of limited computing, communication and storage capacity of terminal equipment. However, limited bandwidth and extremely long communication links between terminal devices and the cloud result in serious performance degradation of multi-robot SLAM systems. To reduce the computational cost of feature tracking and improve the real-time performance of the robot, a lightweight SLAM method of optical flow tracking based on pyramid IMU prediction is proposed. On this basis, a centralized multi-robot SLAM system based on a robot-edge-cloud layered architecture is proposed to realize real-time collaborative SLAM. It avoids the problems of limited on-board computing resources and low execution efficiency of single robot. In this framework, only the feature points and keyframe descriptors are transmitted and lossless encoding and compression are carried out to realize real-time remote information transmission with limited bandwidth resources. This design reduces the actual bandwidth occupied in the process of data transmission, and does not cause the loss of SLAM accuracy caused by data compression. Through experimental verification on the EuRoC dataset, compared with the current most advanced local feature compression method, our method can achieve lower data volume feature transmission, and compared with the current advanced centralized multi-robot SLAM scheme, it can achieve the same or better positioning accuracy under low computational load.
comment: 13 pages, 18 figures
The Yokai Learning Environment: Tracking Beliefs Over Space and Time IJCAI 2025
The ability to cooperate with unknown partners is a central challenge in cooperative AI and widely studied in the form of zero-shot coordination (ZSC), which evaluates an algorithm by measuring the performance of independently trained agents when paired. The Hanabi Learning Environment (HLE) has become the dominant benchmark for ZSC, but recent work has achieved near-perfect inter-seed cross-play performance, limiting its ability to track algorithmic progress. We introduce the Yokai Learning Environment (YLE) - an open-source multi-agent RL benchmark in which effective collaboration requires building common ground by tracking and updating beliefs over moving cards, reasoning under ambiguous hints, and deciding when to terminate the game based on inferred shared knowledge - features absent in the HLE, where beliefs are tied to hand slots and hints are truthful by rule. We evaluate the leading ZSC methods, including High-Entropy IPPO, Other-Play, and Off-Belief Learning, which achieve near-perfect inter-seed cross-play in the HLE, and show that in the YLE they exhibit persistent SP-XP gaps, degraded early-ending calibration, and weaker belief representations in cross-play, indicating failure to maintain consistent internal models with unseen partners. Methods that perform best in the HLE do not perform best in the YLE, indicating that progress measured on a single benchmark may not generalise. Together, these results establish YLE as a challenging new ZSC benchmark.
comment: A previous version was presented as an oral presentation at the the ToM IJCAI 2025 Workshop
RACAS: Controlling Diverse Robots With a Single Agentic System
Many robotic platforms expose an API through which external software can command their actuators and read their sensors. However, transitioning from these low-level interfaces to high-level autonomous behaviour requires a complicated pipeline, whose components demand distinct areas of expertise. Existing approaches to bridging this gap either require retraining for every new embodiment or have only been validated across structurally similar platforms. We introduce RACAS (Robot-Agnostic Control via Agentic Systems), a cooperative agentic architecture in which three LLM/VLM-based modules (Monitors, a Controller, and a Memory Curator) communicate exclusively through natural language to provide closed-loop robot control. RACAS requires only a natural language description of the robot, a definition of available actions, and a task specification; no source code, model weights, or reward functions need to be modified to move between platforms. We evaluate RACAS on several tasks using a wheeled ground robot, a recently published novel multi-jointed robotic limb, and an underwater vehicle. RACAS consistently solved all assigned tasks across these radically different platforms, demonstrating the potential of agentic AI to substantially reduce the barrier to prototyping robotic solutions.
comment: 7 pages in main text + 1 page of appendices + 1 page of references, 5 figures in main text + 1 figure in appendices, 2 tables in main text; source code available at https://github.com/janprz11/robot-agnostic-control
Mindstorms in Natural Language-Based Societies of Mind
Both Minsky's "society of mind" and Schmidhuber's "learning to think" inspire diverse societies of large multimodal neural networks (NNs) that solve problems by interviewing each other in a "mindstorm." Recent implementations of NN-based societies of minds consist of large language models (LLMs) and other NN-based experts communicating through a natural language interface. In doing so, they overcome the limitations of single LLMs, improving multimodal zero-shot reasoning. In these natural language-based societies of mind (NLSOMs), new agents -- all communicating through the same universal symbolic language -- are easily added in a modular fashion. To demonstrate the power of NLSOMs, we assemble and experiment with several of them (having up to 129 members), leveraging mindstorms in them to solve some practical AI tasks: visual question answering, image captioning, text-to-image synthesis, 3D generation, egocentric retrieval, embodied AI, and general language-based task solving. We view this as a starting point towards much larger NLSOMs with billions of agents-some of which may be humans. And with this emergence of great societies of heterogeneous minds, many new research questions have suddenly become paramount to the future of artificial intelligence. What should be the social structure of an NLSOM? What would be the (dis)advantages of having a monarchical rather than a democratic structure? How can principles of NN economies be used to maximize the total reward of a reinforcement learning NLSOM? In this work, we identify, discuss, and try to answer some of these questions.
comment: published in Computational Visual Media Journal (CVMJ); 9 pages in main text + 7 pages of references + 38 pages of appendices, 14 figures in main text + 13 in appendices, 7 tables in appendices
Systems and Control (EESS)
Towards Polynomial Immersion of Port-Hamiltonian Systems
Port-Hamiltonian (pH) systems offer a highly structured and energy-based modular framework for control systems. Many pH systems exhibit non-polynomial non-linearities. We consider the problem of immersing such systems into a higher-dimensional polynomial representation. We prove that, along system trajectories, important features of the non-polynomial pH system are preserved such as the internal interconnection geometry, the energy balance relation with passivity supply rate, as well as energy dissipation. We illustrate how the lifted system enables the design of stabilizing feedback laws by combining sum-of-squares optimization with concepts from passivity-based control. We draw upon several examples to illustrate our findings.
A gripper for flap separation and opening of sealed bags ICRA2026
Separating thin, flexible layers that must be individually grasped is a common but challenging manipulation primitive for most off-the-shelf grippers. A prominent example arises in clinical settings: the opening of sterile flat pouches for the preparation of the operating room, where the first step is to separate and grasp the flaps. We present a novel gripper design and opening strategy that enables reliable flap separation and robust seal opening. This capability addresses a high-volume repetitive hospital procedure in which nurses manually open up to 240 bags per shift, a physically demanding task linked to musculoskeletal injuries. Our design combines an active dented-roller fingertip with compliant fingers that exploit environmental constraints to robustly grasp thin flexible flaps. Experiments demonstrate that the proposed gripper reliably grasps and separates sealed bag flaps and other thin-layered materials from the hospital, the most sensitive variable affecting performance being the normal force applied. When two copies of the gripper grasp both flaps, the system withstands the forces needed to open the seals robustly. To our knowledge, this is one of the first demonstrations of robotic assistance to automate this repetitive, low-value, but critical hospital task.
comment: 8 pages, Accepted at the 2026 IEEE International Conference on Robotics & Automation (ICRA2026)
The potential and viability of V2G for California BEV drivers
Vehicle-to-Grid (V2G) adoption is hindered by uncertainties regarding its effects on battery lifetime and vehicle usability. These uncertainties are compounded by limited insight into real-world vehicle usage. Here, we leverage real-world Californian BEV usage data to design and evaluate a user-centric V2G strategy. We identified four clustered driver profiles for V2G assessment, ranging from "Daily Chargers" to "Public Chargers". We show that V2G participation is most feasible for "Daily Chargers," and that the effects on battery lifetime depend on calendar aging sensitivity. For batteries with low sensitivity, V2G participation increases capacity loss for all drivers. However, for batteries with high sensitivity, V2G participation can lead to negligible changes in capacity or even improved capacity retention, particularly for drivers who tend to keep their batteries at high states of charge. Our findings enable stakeholders to better assess the potential and viability of V2G adoption.
Distributed Safety Critical Control among Uncontrollable Agents using Reconstructed Control Barrier Functions
This paper investigates the distributed safety critical control for multi-agent systems (MASs) in the presence of uncontrollable agents with uncertain behaviors. To ensure system safety, the control barrier function (CBF) is employed in this paper. However, a key challenge is that the CBF constraints are coupled when MASs perform collaborative tasks, which depend on information from multiple agents and impede the design of a fully distributed safe control scheme. To overcome this, a novel reconstructed CBF approach is proposed. In this method, the coupled CBF is reconstructed by leveraging state estimates of other agents obtained from a distributed adaptive observer. Furthermore, a prescribed performance adaptive parameter is designed to modify this reconstruction, ensuring that satisfying the reconstructed CBF constraint is sufficient to meet the original coupled one. Based on the reconstructed CBF, we design a safety-critical quadratic programming (QP) controller and prove that the proposed distributed control scheme rigorously guarantees the safety of the MAS, even in the uncertain dynamic environments involving uncontrollable agents. The effectiveness of the proposed method is illustrated through simulations.
Distributed Stability Certification and Control from Local Data
Most data-driven analysis and control methods rely on centralized access to system measurements. In contrast, we consider a setting in which the measurements are distributed across multiple agents and raw data are not shared. Each agent has access only to locally held samples, possibly as little as a single measurement, and agents exchange only locally computed signals. Consequently, no individual agent possesses sufficient information to identify the entire system or synthesize a controller independently. To address this limitation, we develop distributed dynamical algorithms that enable the agents to collectively compute global system certificates from local data. Two problems are addressed. First, for stable linear time-invariant (LTI) systems, the agents compute a Lyapunov certificate by solving the Lyapunov equation in a fully distributed manner. Second, for general LTI systems, they compute the stabilizing solution of the algebraic Riccati equation and hence the optimal linear-quadratic regulator (LQR). An initially proposed scheme guarantees practical convergence, while a subsequent augmented PI-type algorithm achieves exact convergence to the desired solution. We further establish robustness of the resulting LQR controller to uncertainty and measurement noise. The approach is illustrated through distributed Lyapunov certification of a quadruple-tank process and distributed LQR design for helicopter dynamics.
Towards Intelligent Spectrum Management: Spectrum Demand Estimation Using Graph Neural Networks
The growing demand for wireless connectivity, combined with limited spectrum resources, calls for more efficient spectrum management. Spectrum sharing is a promising approach; however, regulators need accurate methods to characterize demand dynamics and guide allocation decisions. This paper builds and validates a spectrum demand proxy from public deployment records and uses a graph attention network in a hierarchical, multi-resolution setup (HR-GAT) to estimate spectrum demand at fine spatial scales. The model captures both neighborhood effects and cross-scale patterns, reducing spatial autocorrelation and improving generalization. Evaluated across five Canadian cities and against eight competitive baselines, HR-GAT reduces median RMSE by roughly 21% relative to the best alternative and lowers residual spatial bias. The resulting demand maps are regulator-accessible and support spectrum sharing and spectrum allocation in wireless networks.
comment: 13 pages, 10 figures. Submitted to IEEE Transactions on Machine Learning in Communications and Networking
AI-Enhanced Spatial Cellular Traffic Demand Prediction with Contextual Clustering and Error Correction for 5G/6G Planning
Accurate spatial prediction of cellular traffic demand is essential for 5G NR capacity planning, network densification, and data-driven 6G planning. Although machine learning can fuse heterogeneous geospatial and socio-economic layers to estimate fine-grained demand maps, spatial autocorrelation can cause neighborhood leakage under naive train/test splits, inflating accuracy and weakening planning reliability. This paper presents an AI-driven framework that reduces leakage and improves spatial generalization via a context-aware two-stage splitting strategy with residual spatial error correction. Experiments using crowdsourced usage indicators across five major Canadian cities show consistent mean absolute error (MAE) reductions relative to location-only clustering, supporting more reliable bandwidth provisioning and evidence-based spectrum planning and sharing assessments.
comment: 5 pages, 8 figures. Submitted to IEEE Wireless Communications Letters
A Control-Theoretic Foundation for Agentic Systems
This paper develops a control-theoretic framework for analyzing agentic systems embedded within feedback control loops. In such systems, an AI agent may adapt controller parameters, select among control strategies, invoke tools, reconfigure decision architectures, or modify control objectives during operation. We formalize these capabilities by interpreting agency as hierarchical decision authority over the control architecture. A unified dynamical representation is introduced that incorporates memory, learning, tool activation, interaction signals, and goal descriptors within a single closed-loop structure. Based on this representation, we define a five-level hierarchy of agency ranging from reactive rule-based control to the synthesis of control objectives and controller architectures. The framework is presented in both nonlinear and linear settings, allowing agentic behaviors to be interpreted using standard control-theoretic constructs such as feedback gains, switching signals, parameter adaptation laws, and quadratic cost functions. The analysis shows that increasing agency introduces dynamical mechanisms including time-varying adaptation, endogenous switching, decision-induced delays, and structural reconfiguration of the control pipeline. This perspective provides a mathematical foundation for analyzing stability, safety, and performance of AI-enabled control systems.
Scaling and Trade-offs in Multi-agent Autonomous Systems
Designing autonomous drone swarms is hampered by a vast design space spanning platform, algorithmic, and numerical-strength choices. We perform large-scale agent-based simulations in three canonical scenarios: swarm-on-swarm battle, cooperative area search with attrition, and pursuit of scattering targets. We demonstrate that dimensional-analysis and data-scaling, established techniques in physical sciences, can be leveraged to collapse performance data onto scaling functions that are mathematically simple, yet counterintuitive and therefore difficult to predict a priori. These scaling laws reveal success-failure boundaries, including sharp break points. Additionally, we show how this technique can be used to quantify trade-offs between agent count and platform parameters such as velocity, sensing or weapon range, and attrition rate. Furthermore, we show the benefits of embedding an optimal path planning loop within this framework, which can qualitatively improve the scaling laws that govern the outcome. The methods we demonstrate are highly flexible and would enable rapid, budget-aware sizing and algorithm selection for large autonomous swarms.
comment: his work has been submitted to the IEEE for possible publication. 15 pages, 12 figures
Parallel-in-Time Nonlinear Optimal Control via GPU-native Sequential Convex Programming
Real-time trajectory optimization for nonlinear constrained autonomous systems is critical and typically performed by CPU-based sequential solvers. Specifically, reliance on global sparse linear algebra or the serial nature of dynamic programming algorithms restricts the utilization of massively parallel computing architectures like GPUs. To bridge this gap, we introduce a fully GPU-native trajectory optimization framework that combines sequential convex programming with a consensus-based alternating direction method of multipliers. By applying a temporal splitting strategy, our algorithm decouples the optimization horizon into independent, per-node subproblems that execute massively in parallel. The entire process runs fully on the GPU, eliminating costly memory transfers and large-scale sparse factorizations. This architecture naturally scales to multi-trajectory optimization. We validate the solver on a quadrotor agile flight task and a Mars powered descent problem using an on-board edge computing platform. Benchmarks reveal a sustained 4x throughput speedup and a 51% reduction in energy consumption over a heavily optimized 12-core CPU baseline. Crucially, the framework saturates the hardware, maintaining over 96% active GPU utilization to achieve planning rates exceeding 100 Hz. Furthermore, we demonstrate the solver's extensibility to robust Model Predictive Control by jointly optimizing dynamically coupled scenarios under stochastic disturbances, enabling scalable and safe autonomy.
Dynamic Modeling and Attitude Control of a Reaction-Wheel-Based Low-Gravity Bipedal Hopper
Planetary bodies characterized by low gravitational acceleration, such as the Moon and near-Earth asteroids, impose unique locomotion constraints due to diminished contact forces and extended airborne intervals. Among traversal strategies, hopping locomotion offers high energy efficiency but is prone to mid-flight attitude instability caused by asymmetric thrust generation and uneven terrain interactions. This paper presents an underactuated bipedal hopping robot that employs an internal reaction wheel to regulate body posture during the ballistic flight phase. The system is modeled as a gyrostat, enabling analysis of the dynamic coupling between torso rotation and reaction wheel momentum. The locomotion cycle comprises three phases: a leg-driven propulsive jump, mid-air attitude stabilization via an active momentum exchange controller, and a shock-absorbing landing. A reduced-order model is developed to capture the critical coupling between torso rotation and reaction wheel dynamics. The proposed framework is evaluated in MuJoCo-based simulations under lunar gravity conditions (g = 1.625 m/s^2). Results demonstrate that activation of the reaction wheel controller reduces peak mid-air angular deviation by more than 65% and constrains landing attitude error to within 3.5 degrees at touchdown. Additionally, actuator saturation per hop cycle is reduced, ensuring sufficient control authority. Overall, the approach significantly mitigates in-flight attitude excursions and enables consistent upright landings, providing a practical and control-efficient solution for locomotion on irregular extraterrestrial terrains.
comment: Preprint. Under review
Distributed State Estimation of Discrete-Time LTI Systems via Jordan Canonical Representation
In this paper, we address the problem of distributed state estimation for a discrete-time, linear time-invariant system. Building on the framework proposed in [2], we exploit the Jordan canonical form of the system matrix to develop a distributed estimation scheme that ensures the asymptotic convergence of the local state estimates to the true system state. The proposed approach relies on the idea that each node reconstructs the components of the system state that are detectable for it through a local Luenberger observer, while employing a consensus-based strategy to estimate the undetectable components. Necessary and sufficient conditions for the existence of a distributed observer that guarantees asymptotic estimation accuracy are derived. Compared with the previous work [2], the proposed design offers greater flexibility in the selection of the coupling gains and leads to a less restrictive set of conditions for solvability.
comment: Extended version of the conference paper accepted for presentation at the 24th European Control Conference (ECC) in Reykjavík, Iceland
Propagation and Rate-Aware Cell Switching Optimization in HAPS-Assisted Wireless Networks
Cell switching is a promising approach for improving energy efficiency in wireless networks; however, existing studies largely rely on simplified models and energy-centric formulations that overlook key performance-limiting factors. This paper revisits the cell switching concept by redefining its modeling assumptions and mathematical formulation, explicitly incorporating realistic propagation effects such as building entry loss (BEL) and atmospheric losses relevant to non-terrestrial networks (NTN), particularly high-altitude platform station (HAPS). Beyond proposing a new cell switching strategy, the conventional energy-focused problem is reformulated as a multi-objective optimization framework that jointly minimizes power consumption, unconnected users, and data rate degradation. Through this reformulation, the proposed methods ensure that energy-efficient operation is achieved without compromising user connectivity and data rate performance, thereby inherently supporting sustainability objectives for sixth-generation (6G) networks. To solve this reformulated problem, two complementary approaches are employed: the weighted sum method (WSM), which enables flexible and adaptive weighting mechanism, and the {ε-constraint-inspired method (εCM), which converts connectivity and rate-related objectives into constraints within the conventional energy-focused problem. Moreover, unlike prior work relying only on simulations, this study combines system-level simulations with Sionna-OpenAirInterface (OAI) based emulation on a smaller network to validate the proposed cell switching concept under realistic conditions. The results show that, compared to the conventional approach, WSM reduces rate degradation for up to 70% for high-loss indoor users and eliminates the 44% drop for low-loss indoor users.
Quantization Robustness of Monotone Operator Equilibrium Networks
Monotone operator equilibrium networks are implicit-layer models whose output is the unique equilibrium of a monotone operator, guaranteeing existence, uniqueness, and convergence. When deployed on low-precision hardware, weights are quantized, potentially destroying these guarantees. We analyze weight quantization as a spectral perturbation of the underlying monotone inclusion. Convergence of the quantized solver is guaranteed whenever the spectral-norm weight perturbation is smaller than the monotonicity margin; the displacement between quantized and full-precision equilibria is bounded in terms of the perturbation size and margin; and a condition number characterizing the ratio of the operator norm to the margin links quantization precision to forward error. MNIST experiments confirm a phase transition at the predicted threshold: three- and four-bit post-training quantization diverge, while five-bit and above converge. The backward-pass guarantee enables quantization-aware training, which recovers provable convergence at four bits.
comment: 6 pages, 4 figures. Submitted to IEEE Control Systems Letters (L-CSS)
Suppressing Acoustomigration and Temperature Rise for High-power Robust Acoustics
High-frequency acoustic wave transducers, vibrating at gigahertz (GHz), favored for their compact size, are not only dominating the front-end of mobile handsets but are also expanding into various interdisciplinary fields, including quantum acoustics, acoustic-optics, acoustic-fluids, acoustoelectric, and sustainable power conversion systems. However, like strong vibration can "shake off" substances and produce heat, a long-standing bottleneck has been the ability to harness acoustics under high-power vibration loads, while simultaneously suppressing temperature rise, especially for IDT-based surface acoustic wave (SAW) systems. Here, we proposed a layered acoustic wave (LAW) platform, utilizing a quasi-infinite multifunctional top layer, that redefines mechanical and thermal boundary conditions to overcome three fundamental challenges in high-power acoustic wave vibration: self-heating, thermal instability, and acoustomigration. By simply leveraging a simplified, thick single-material overlayer to achieve electro-thermo-mechanical co-design, this acoustic platform moves beyond prior substrate-focused thermal management in SAW technology. It demonstrates, for the first time from the top boundary, simultaneous redistribution of the von Mises stress field and the creation of an efficient vertical thermal dissipation path. The LAW transducer, vibrating at over 2 GHz, achieves a 70% reduction in temperature rise under identical power loads, a first-order temperature coefficient of frequency (TCF) of -13 ppm/C with minimal dispersion, and an unprecedented threshold power density of 45.61 dBm/mm2 - over one order-of-magnitude higher than that of state-of-the-art thin-film surface acoustic wave (TF-SAW) counterparts at the same wavelength.
comment: Main text with supplementary information
World Model for Battery Degradation Prediction Under Non-Stationary Aging
Degradation prognosis for lithium-ion cells requires forecasting the state-of-health (SOH) trajectory over future cycles. Existing data-driven approaches can produce trajectory outputs through direct regression, but lack a mechanism to propagate degradation dynamics forward in time. This paper formulates battery degradation prognosis as a world model problem, encoding raw voltage, current, and temperature time-series from each cycle into a latent state and propagating it forward via a learned dynamics transition to produce a future trajectory spanning 80 cycles. To investigate whether electrochemical knowledge improves the learned dynamics, a Single Particle Model (SPM) constraint is incorporated into the training loss. Three configurations are evaluated on the Severson LiFePO4 (LFP) dataset of 138 cells. Iterative rollout halves the trajectory forecast error compared to direct regression from the same encoder. The SPM constraint improves prediction at the degradation knee where the resistance to SOH relationship is most applicable, without changing aggregate accuracy.
comment: 18 pages, 3 figures
Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation
Vision-Language-Action (VLA) models demonstrate impressive zero-shot generalization but frequently suffer from a "Precision-Reasoning Gap" in cluttered environments. This failure is driven by background-induced feature dilution, where high-frequency semantic noise corrupts the geometric grounding required for precise manipulation. To bridge this gap, we propose Concept-Gated Visual Distillation (CGVD), a training-free, model-agnostic inference framework that stabilizes VLA policies. CGVD operates by parsing instructions into safe and distractor sets, utilizing a two-layer target refinement process--combining cross-validation and spatial disambiguation--to explicitly penalize false positives and isolate genuine manipulation targets. We then process the scene via Fourier-based inpainting, generating a clean observation that actively suppresses semantic distractors while preserving critical spatial geometry and visual proprioception. Extensive evaluations in highly cluttered manipulation tasks demonstrate that CGVD prevents performance collapse. In environments with dense semantic distractors, our method significantly outperforms state-of-the-art baselines, achieving a 77.5% success rate compared to the baseline's 43.0%. By enforcing strict attribute adherence, CGVD establishes inference-time visual distillation as a critical prerequisite for robust robotic manipulation in the clutter.
comment: 7 pages, 4 figures, 3 tables
Simulation-in-the-Reasoning (SiR): A Conceptual Framework for Empirically Grounded AI in Autonomous Transportation
Large Language Models (LLMs) have advanced reasoning through techniques like Chain-of-Thought (CoT). However, their reasoning largely re-mains textual and hypothetical, lacking empirical grounding in complex, dynamic domains like transportation. This paper introduces Simulation-in-the-Reasoning (SiR), a novel conceptual framework that embeds domain-specific simulators directly into the LLM reasoning loop. By treating intermediate reasoning steps as executable simulation experiments, SiR transforms LLM reasoning from narrative plausibility into a falsifiable, hypothesis-simulate-analyze workflow. We discuss applications, where LLM can formulate Intelligent Transport System (ITS) strategy hypotheses, invoke a traffic simulator via the Model Context Protocol (MCP), evaluate results under different demand patterns, and refine strategies through verification and aggregation. While implementing the framework is part of our ongoing work, this paper primarily establishes the conceptual foundation, discusses design considerations like API granularity, and outlines the vision of SiR as a cornerstone for interactive transportation digital twins. We argue that SiR represents a critical step towards trustworthy, empirically-validated AI for autonomous transportation systems.
Inverse Learning-Based Output Feedback Control of Nonlinear Systems with Verifiable Guarantees
In this paper, we present a data-driven output feedback controller for nonlinear systems that achieves practical output regulation, using noise-free input/output measurement data. The proposed controller is based on (i) an inverse model of the system identified via kernel interpolation, which maps a desired output and the current state to the corresponding desired control input; and (ii) a data-driven reference selection framework that actively chooses a suitable desired output from the dataset which has been used for the identification. We establish a verifiable sufficient condition on the dataset under which the proposed controller guarantees practical output regulation. Numerical simulations demonstrate the effectiveness of the proposed controller, with additional evaluations in the presence of output measurement noise to assess its robustness empirically.
comment: 17 pages, 5 figures
Reference Architecture of a Quantum-Centric Supercomputer
Quantum computers have demonstrated utility in simulating quantum systems beyond brute-force classical approaches. As the community builds on these demonstrations to explore using quantum computing for applied research, algorithms and workflows have emerged that require leveraging both quantum computers and classical high-performance computing (HPC) systems to scale applications, especially in chemistry and materials, beyond what either system can simulate alone. Today, these disparate systems operate in isolation, forcing users to manually orchestrate workloads, coordinate job scheduling, and transfer data between systems -- a cumbersome process that hinders productivity and severely limits rapid algorithmic exploration. These challenges motivate the need for flexible and high-performance Quantum-Centric Supercomputing (QCSC) systems that integrate Quantum Processing Units (QPUs), Graphics Processing Units (GPUs), and Central Processing Units (CPUs) to accelerate discovery of such algorithms across applications. These systems will be co-designed across quantum and classical HPC infrastructure, middleware, and application layers to accelerate the adoption of quantum computing for solving critical computational problems. We envision QCSC evolution through three distinct phases: (1) quantum systems as specialized compute offload engines within existing HPC complexes; (2) heterogeneous quantum and classical HPC systems coupled through advanced middleware, enabling seamless execution of hybrid quantum-classical algorithms; and (3) fully co-designed heterogeneous quantum-HPC systems for hybrid computational workflows. This article presents a reference architecture and roadmap for these QCSC systems.
comment: 19 pages, 5 figures
Contractivity of Multi-Stage Runge-Kutta Dynamics
Many control, optimization, and learning algorithms rely on discretizations of continuous-time contracting systems, where preservation of contractivity under numerical integration is key for stability, robustness, and reliable fixed-point computation. In this paper, we establish conditions under which multi-stage Runge-Kutta methods preserve strong contractivity when discretizing infinitesimally contractive continuous-time systems. For explicit Runge-Kutta methods, preservation conditions are derived by bounding Lipschitz constants of the associated composite stage mappings, leading to coefficient-dependent criteria. For implicit methods, the algebraic structure of the stage equations enables explicit conditions on the Runge-Kutta coefficients that guarantee preservation of strong contractivity. In the implicit case, these results extend classical guarantees, typically limited to weak contractivity in the Euclidean metric, to strong contractivity with respect to the $\ell_1$-, $\ell_2$-, and $\ell_\infty$-norms. In addition, we study well-definedness of implicit methods through an auxiliary continuous-time system associated with the stage equations. We show that strong infinitesimal contractivity of this auxiliary system is sufficient to guarantee unique solvability of the stage equations. This analysis generalizes standard well-definedness conditions and provides a dynamic implementation approach that avoids direct solution of the implicit algebraic equations.
Distributed Kalman--Consensus Filtering with Adaptive Uncertainty Weighting for Multi-Object Tracking in Mobile Robot Networks
This paper presents an implementation and evaluation of a Distributed Kalman--Consensus Filter (DKCF) for Multi-Object Tracking (MOT) in mobile robot networks operating under partial observability and heterogeneous localization uncertainty. A key challenge in such systems is the fusion of information from agents with differing localization quality, where frame misalignment can lead to inconsistent estimates, track duplication, and ghost tracks. To address this issue, we build upon the MOTLEE framework and retain its frame-alignment methodology, which uses consistently tracked dynamic objects as transient landmarks to improve relative pose estimates between robots. On top of this framework, we propose an uncertainty-aware adaptive consensus weighting mechanism that dynamically adjusts the influence of neighbor information based on the covariance of the transmitted estimates, thereby reducing the impact of unreliable data during distributed fusion. Local tracking is performed using a Kalman Filter (KF) with a Constant Velocity Model (CVM) and Global Nearest Neighbor (GNN) data association. simulation results demonstrate that adaptive weighting effectively protects local estimates from inconsistent data, yielding a MOTA improvement of 0.09 for agents suffering from localization drift, although system performance remains constrained by communication latency.
comment: Presented at ICARA 2026. To appear in the IEEE conference proceedings
Conduction-Diffusion in N-Dimensional settings as irreversible port-Hamiltonian systems
This work extends previous 1D irreversible port-Hamiltonian system (IPHS) formulations to boundary-controlled ND distributed parameter systems describing conduction-diffusion fluid phenomena. Within a unified and thermodynamically consistent framework, we show that conduction and diffusion can be represented through a single coherent structure that preserves global energy balance and ensures a correct characterization of entropy production. The resulting formulation provides a foundation for the systematic modeling and control of complex multi-physical processes governed by coupled transport mechanisms in N dimensions. In the longer term, this framework opens the door to structure-preserving numerical schemes capable of enforcing thermodynamic principles directly at the discretized level.
Multi-Robot Multitask Gaussian Process Estimation and Coverage
Coverage control is essential for the optimal deployment of agents to monitor or cover areas with sensory demands. While traditional coverage involves single-task robots, increasing autonomy now enables multitask operations. This paper introduces a novel multitask coverage problem and addresses it for both the cases of known and unknown sensory demands. For known demands, we design a federated multitask coverage algorithm and establish its convergence properties. For unknown demands, we employ a multitask Gaussian Process (GP) framework to learn sensory demand functions and integrate it with the multitask coverage algorithm to develop an adaptive algorithm. We introduce a novel notion of multitask coverage regret that compares the performance of the adaptive algorithm against an oracle with prior knowledge of the demand functions. We establish that our algorithm achieves sublinear cumulative regret, and numerically illustrate its performance.
Irreversible Port-Hamiltonian Formulations for 1-Dimensional fluid systems
The Irreversible Port-Hamiltonian Systems (IPHS) framework is extended to the modelling of non-isentropic fluids with viscous dissipation in the Eulerian description. Building on earlier IPHS formulations for diffusion-driven and non-convective distributed systems, it is shown that convective transport can be consistently encompassed by the framework by modifying the underlying differential operators. After revisiting the constitutive relations of non-isentropic fluids in both Eulerian and Lagrangian coordinates, it is demonstrate how these systems fit within an extended IPHS formulation. Furthermore, an extended parametrisation of the boundary port variables which ensures that the first and second laws of Thermodynamics are fulfilled allows to define a general class of boundary controlled IPHS.
Metaheuristic algorithm parameters selection for building an optimal hierarchical structure of a control system: a case study
Metaheuristic algorithms are currently widely used to solve a variety of optimization problems across various industries. This article discusses the application of a metaheuristic algorithm to optimize the hierarchical architecture of an industrial distributed control system. The success of the algorithm depends largely on the choice of starting conditions and algorithm parameters. We examine the impact of parameter selection on the convergence of a modified ant colony algorithm and provide recommendations for tuning the algorithm to achieve optimal results for a specific industrial problem. The findings presented in this article can also be applied to other combinatorial optimization problems.
System-Theoretic Analysis of Dynamic Generalized Nash Equilibria -- Turnpikes and Dissipativity
Generalized Nash equilibria are used in multi-agent control applications to model strategic interactions between agents that are coupled in the cost, dynamics, and constraints, and provide the foundations for game-theoretic MPC (Receding Horizon Games). We study properties of finite-horizon dynamic GNE trajectories from a system-theoretic perspective. We show how strict dissipativity generates the turnpike phenomenon in GNE solutions. Moreover, we establish a converse turnpike result, i.e., the implication from turnpike to strict dissipativity. We derive conditions under which the steady-state GNE is the optimal operating point and, using a game value function, we give a local characterization of the geometry of storage functions. Finally, we design linear terminal penalties that ensure dynamic GNE trajectories applied in open-loop converge to and remain at the steady-state GNE. These connections provide the foundation for future system-theoretic analysis of GNEs similar to those existing in optimal control as well as for recursive feasibility and closed-loop stability results of game-theoretic MPC.
LexiSafe: Offline Safe Reinforcement Learning with Lexicographic Safety-Reward Hierarchy
Offline safe reinforcement learning (RL) is increasingly important for cyber-physical systems (CPS), where safety violations during training are unacceptable and only pre-collected data are available. Existing offline safe RL methods typically balance reward-safety tradeoffs through constraint relaxation or joint optimization, but they often lack structural mechanisms to prevent safety drift. We propose LexiSafe, a lexicographic offline RL framework designed to preserve safety-aligned behavior. We first develop LexiSafe-SC, a single-cost formulation for standard offline safe RL, and derive safety-violation and performance-suboptimality bounds that together yield sample-complexity guarantees. We then extend the framework to hierarchical safety requirements with LexiSafe-MC, which supports multiple safety costs and admits its own sample-complexity analysis. Empirically, LexiSafe demonstrates reduced safety violations and improved task performance compared to constrained offline baselines. By unifying lexicographic prioritization with structural bias, LexiSafe offers a practical and theoretically grounded approach for safety-critical CPS decision-making.
comment: 17th ACM/IEEE International Conference on Cyber-Physical Systems
Universal Dynamics with Globally Controlled Analog Quantum Simulators
Analog quantum simulators with global control fields have emerged as powerful platforms for exploring complex quantum phenomena. Despite these advances, a fundamental theoretical question remains unresolved: to what extent can such systems realize universal quantum dynamics under global control? Here we establish a necessary and sufficient condition for universal quantum computation using only global pulse control, proving that a broad class of analog quantum simulators is, in fact, universal. We further extend this framework to fermionic and bosonic systems, including modern platforms such as ultracold atoms in optical superlattices. Moreover, we observe that analog simulators driven by random global pulses exhibit information scrambling comparable to random unitary circuits. In a dual-species neutral-atom array setup, the measurement outcomes anti-concentrate on a $\log N$ timescale despite the presence of only temporal randomness, opening opportunities for efficient randomness generation. To bridge theoretical possibility with experimental reality, we introduce \emph{direct quantum optimal control}, a control framework that enables the synthesis of complex effective Hamiltonians while incorporating realistic hardware constraints. Using this approach, we experimentally engineer three-body interactions outside the blockade regime and demonstrate topological dynamics on a Rydberg-atom array. Experimental measurements reveal dynamical signatures of symmetry-protected-topological edge modes, confirming both the expressivity and feasibility of our method. Our work opens a new avenue for quantum simulation beyond native hardware Hamiltonians, enabling the engineering of effective multi-body interactions and advancing the frontier of quantum information processing with globally-controlled analog platforms.
comment: The updated version adds new applications and discussions on information scrambling with globally controlled analog quantum systems. 11 pages, 6 figures with Methods. HYH, AMG, and LC contributed equally to this work. Updated acknowledgement and references
Simplifying Preference Elicitation in Local Energy Markets: Combinatorial Clock Exchange
As distributed energy resources (DERs) proliferate, future power system will need new market platforms enabling prosumers to trade various electricity and grid-support products. However, prosumers often exhibit complex, product interdependent preferences and face limited cognitive and computational resources, hindering engagement with complex market structures and bid formats. We address this challenge by introducing a multi-product market that allows prosumers to express complex preferences through an intuitive format, by fusing combinatorial clock exchange and machine learning (ML) techniques. The iterative mechanism only requires prosumers to report their preferred package of products at posted prices, eliminating the need for forecasting product prices or adhering to complex bid formats, while the ML-aided price discovery speeds up convergence. The linear pricing rule further enhances transparency and interpretability. Finally, numerical simulations demonstrate convergence to clearing prices in approximately 15 clock iterations.
comment: Accepted for presentation at Power Systems Computation Conference 2026
Design and Quantitative Evaluation of an Embedded EEG Instrumentation Platform for Real-Time SSVEP Decoding
This paper presents an embedded EEG instrumentation platform for real-time steady-state visually evoked potential (SSVEP) decoding based on an ESP32-S3 microcontroller and an ADS1299 analog front end. The system performs $8$-channel EEG acquisition, zero-phase bandpass filtering, and canonical correlation analysis entirely on-device, while supporting wireless communication and closed-loop operation without external computation. A central contribution is the quantitative characterization of the platform's measurement integrity. Reported results demonstrate a stable shorted-input noise floor ($\approx 0.08~μ\text{V}_{\text{RMS}}$), tightly bounded sampling jitter ($0.56~μ\text{s}$ standard deviation), and negligible long-term drift ($< 1~\text{ppm}$). Numerical fidelity analysis shows $100\%$ decision agreement between the mixed-precision embedded pipeline and a $64$-bit double-precision reference. Effective common-mode attenuation exceeded $112~\text{dB}$ under balanced conditions, with a localized $26.9~\text{dB}$ degradation observed under source-impedance mismatch. Closed-loop validation achieved $99.17\%$ online accuracy and an information transfer rate of $27.66~\text{bits/min}$. These results position the proposed system as a quantitatively characterized embedded EEG measurement and processing platform for real-time SSVEP decoding.
Response time central-limit and failure rate estimation for stationary periodic rate monotonic real-time systems
Real-time systems consist of a set of tasks, a scheduling policy, and a system architecture, all constrained by timing requirements. Many everyday embedded systems, within devices such as airplanes, cars, trains, and spatial probes, operate as real-time systems. To ensure safe failure rates, response times-the time required for the exection of a task-must be bounded. Rate Monotonic real-time systems prioritize tasks according to their arrival rate. This paper focuses on the use of the central limit of response times built in \cite{zagalo2022} and an approximation of their distribution with an inverse Gaussian mixture distribution. The distribution parameters and their associated failure rates are estimated through a suitable re-parameterization of the inverse Gaussian distribution and an adapted Expectation-Maximization algorithm. Extensive simulations demonstrate that the method is well-suited for the approximation of failure rates. We discuss the extension of such method to a chi-squared independence test adapted to real-time systems.
comment: submitted to IEEE Journal
Security-Constrained Substation Reconfiguration Considering Busbar and Coupler Contingencies
Substation reconfiguration via busbar splitting can mitigate transmission grid congestion and reduce operational costs. However, existing approaches neglect the security of substation topology, particularly for substations without busbar splitting (i.e., closed couplers), which can lead to severe consequences. Additionally, the computational complexity of optimizing substation topology remains a challenge. This paper introduces a MILP formulation for security-constrained substation reconfiguration (SC-SR), considering N-1 line, coupler and busbar contingencies to ensure secure substation topology. To efficiently solve this problem, we propose a heuristic approach with multiple master problems (HMMP). A central master problem optimizes dispatch, while independent substation master problems determine individual substation topologies in parallel. Linear AC power flow equations ensure PF accuracy, while feasibility and optimality sub-problems evaluate contingency cases. The proposed HMMP significantly reduces computational complexity and enables scalability to large-scale power systems. Case studies on the IEEE 14-bus, 118-bus, and PEGASE 1354-bus system show the effectiveness of the approach in mitigating the impact of coupler and busbar tripping, balancing system security and cost, and computational efficiency.
Modular Control of Discrete Event System for Modeling and Mitigating Power System Cascading Failures
Cascading failures in power systems caused by sequential tripping of components are a serious concern as they can lead to complete or partial shutdowns, disrupting vital services and causing damage and inconvenience. In prior work, we developed a new approach for identifying and preventing cascading failures in power systems. The approach uses supervisory control technique of discrete event systems (DES) by incorporating both on-line lookahead control and forcible events. In this paper, we use modular supervisory control of DES to reduce computation complexity and increase the robustness and reliability of control. Modular supervisory control allows us to predict and mitigate cascading failures in power systems more effectively. We implemented the proposed control technique on a simulation platform developed in MATLAB and applied the proposed DES controller. The calculations of modular supervisory control of DES are performed using an external tool and imported into the MATLAB platform. We conduct simulation studies for the IEEE 30-bus, 118-bus and 300-bus systems, and the results demonstrate the effectiveness of our proposed approach.
Robust targeted exploration for systems with non-stochastic disturbances
We propose a novel targeted exploration strategy designed specifically for uncertain linear time-invariant systems with energy-bounded disturbances, i.e., without any assumptions on the distribution of the disturbances. We use classical results characterising the set of non-falsified parameters consistent with energy-bounded disturbances. We derive a semidefinite program which computes an exploration strategy that guarantees a desired accuracy of the parameter estimate. This design is based on sufficient conditions on the spectral content of the exploration data that robustly account for initial parametric uncertainty. Finally, we highlight the applicability of the exploration strategy through a numerical example involving a nonlinear system.
comment: Submitted to Automatica
Customized Interior-Point Methods Solver for Embedded Real-Time Convex Optimization
This paper presents a customized second-order cone programming (SOCP) solver tailored for embedded real-time optimization, which frequently arises in modern guidance and control (G&C) applications. The solver employs a practically efficient predictor-corrector type primal-dual interior-point method (PDIPM) combined with a homogeneous embedding framework for infeasibility detection. Unlike conventional homogeneous self-dual embedding formulations, the adopted approach can directly handle quadratic cost functions without requiring problem reformulation. This capability allows the solver to directly address quadratic objective SOCP problems, while avoiding unnecessary performance degradation caused by the loss of sparsity due to problem reformulation. To support a systematic workflow, we also develop a code generation tool that analyzes the sparsity pattern of the problem to be solved and generates customized solver code using a predefined code template. The generated solver code is written in C with no external dependencies other than the standard library math.h, and it supports complete static allocation of all data. Additionally, it provides parsing information to facilitate the use of the solver by end users. Finally, benchmark and numerical experiments on an embedded platform demonstrate that the developed solver outperforms the existing solvers on problem scales typical of G&C applications.
comment: Accepted for publication in IEEE Transactions on Aerospace and Electronic Systems
Event-Based Control via Sparsity-Promoting Regularization: A Rollout Approach with Performance Guarantees
This paper presents a controller design framework aiming to balance control performance and actuation rate. Control performance is evaluated by an infinite-horizon average cost, and the number of control actions is penalized via sparsity-promoting regularization. Since the formulated optimal control problem has a combinatorial nature, we employ a rollout algorithm to obtain a tractable suboptimal solution. In the proposed scheme, actuation timings are determined through a multistage minimization procedure based on a receding-horizon approach, and the corresponding control inputs are computed online. We establish theoretical performance guarantees with respect to periodic control and prove the stability of the closed-loop system. The effectiveness of the proposed method is demonstrated through a numerical example.
comment: 15 pages
Contractor-Expander and Universal Inverse Optimal Positive Nonlinear Control
For general control-affine nonlinear systems in the positive orthant, and with positive controls, we show how strict CLFs can be utilized for inverse optimal stabilization. Conventional ``LgV'' inverse optimal feedback laws, for systems with unconstrained states and controls, assume sign-unconstrained inputs and input penalties that are class-K in the input magnitude, hence symmetric about zero. Such techniques do not extend to positive-state-and-control systems. Major customizations are needed, and introduced in this paper, for positive systems where highly asymmetric (or unconventionally symmetric) costs not only on the state but also on control are necessary. With the predator-prey positive-state positive-input benchmark system as inspiration, using a strict CLF built in our previous paper, we prototype two general inverse optimal methodological frameworks that employ particular ``contractor and expander functions.'' One framework (A) employs a triple consisting of a CLF, a stabilizing feedback, and an expander, whereas the other framework (B) employs a pair of a CLF and a contractor function. Both frameworks yield inverse optimal stabilizer constructions, on positive orthants of arbitrary dimensions. A stronger construction results from a stronger CLF condition. Biological interpretation for the predator-prey model illuminates that such inverse optimal control constructions are bio-ecologically meaningful. In addition to general frameworks, we present two fully explicit designs: two Sontag-like universal formulae for stabilization of positive-orthant systems by positive feedback, one of them with inverse optimality.
Robust control synthesis for uncertain linear systems with input saturation using mixed IQCs
This paper develops a robust control synthesis method for uncertain linear systems with input saturation in the framework of integral quadratic constraints (IQCs). The system is reformulated as a linear fractional representation (LFR) that captures both dead-zone nonlinearity and time-varying uncertainties. By combining mixed IQC-based dissipation inequalities with quadratic Lyapunov functions, sufficient conditions for robust stabilization are established. Compared with conventional approaches based on a single static sector condition for the dead-zone nonlinearity, the proposed method yields improved $\mathcal{L}_2$-gain performance through the use of scaled mixed IQCs. For systems subject to time-varying structured uncertainties, a new scaled bounded real lemma is further developed based on the IQC characterization. The resulting $\mathcal{H}_\infty$ synthesis conditions are expressed as linear matrix inequalities (LMIs), which are numerically tractable in all decision variables, including the scaling factors in the IQC multipliers. The proposed method is validated using a second-order uncertain system in linear fractional form, and its superiority over an anti-windup design is further illustrated by a cart-pendulum example.
Formation Control via Rotation Symmetry Constraints
This work introduces a distributed formation control strategy for multi-agent systems based solely on rotation symmetry constraints. We propose a potential function that enforces inter-agent \textbf{rotational} symmetries, whose gradient defines a control law that drives the agents toward a desired planar symmetric configuration. We show that only $n-1$ edges (the minimal connectivity requirement) are sufficient to implement the strategy, where $n$ is the number of agents. We further augment the design to address the \textbf{maneuvering problem}, enabling the formation to undergo coordinated translations, rotations, and scaling along a predefined virtual trajectory. Simulation examples are provided to validate the effectiveness of the proposed method.
Enhancing Sample Efficiency in Multi-Agent RL with Uncertainty Quantification and Selective Exploration
Multi-agent reinforcement learning (MARL) methods have achieved state-of-the-art results on a range of multi-agent tasks. Yet, MARL algorithms typically require significantly more environment interactions than their single-agent counterparts to converge, a problem exacerbated by the difficulty in exploring over a large joint action space and the high variance intrinsic to MARL environments. To tackle these issues, we propose a novel algorithm that combines a decomposed centralized critic with decentralized ensemble learning, incorporating several key contributions. The main component in our scheme is a selective exploration method that leverages ensemble kurtosis. We extend the global decomposed critic with a diversity-regularized ensemble of individual critics and utilize its excess kurtosis to guide exploration toward high-uncertainty states and actions. To improve sample efficiency, we train the centralized critic with a novel truncated variation of the TD($λ$) algorithm, enabling efficient off-policy learning with reduced variance. On the actor side, our suggested algorithm adapts the mixed samples approach to MARL, mixing on-policy and off-policy loss functions for training the actors. This approach balances between stability and efficiency and outperforms purely off-policy learning. The evaluation shows our method outperforms state-of-the-art baselines on standard MARL benchmarks, including a variety of SMAC II maps.
Distributed Koopman Learning using Partial Trajectories for Control
This paper proposes a distributed data-driven framework for dynamics learning, termed distributed deep Koopman learning using partial trajectories (DDKL-PT). In this framework, each agent in a multi-agent system is assigned a partial trajectory offline and locally approximates the unknown dynamics using a deep neural network within the Koopman operator framework. By exchanging local estimated dynamics rather than training data, agents achieve consensus on a global dynamics model without sharing their private training trajectories. Simulation studies on a surface vehicle demonstrate that DDKL-PT achieves consensus on the learned dynamics, and each agent attains reasonably small approximation errors on the testing dataset. Furthermore, a model predictive control scheme is developed by integrating the learned Koopman dynamics with known kinematic relations. Results on a reference-tracking task indicate that the distributedly learned dynamics are sufficiently accurate for model-based optimal control.
Towards xApp Conflict Evaluation with Explainable Machine Learning and Causal Inference in O-RAN
The Open Radio Access Network (O-RAN) architecture enables a flexible, vendor-neutral deployment of 5G networks by disaggregating base station components and supporting third-party xApps for near real-time RAN control. However, the concurrent operation of multiple xApps can lead to conflicting control actions, which may cause network performance degradation. In this work, we propose a framework for xApp conflict management that combines explainable machine learning and causal inference to evaluate the causal relationships between RAN Control Parameters (RCPs) and Key Performance Indicators (KPIs). We use model explainability tools such as SHAP to identify RCPs that jointly affect the same KPI, signaling potential conflicts, and represent these interactions as a causal Directed Acyclic Graph (DAG). We then estimate the causal impact of each of these RCPs on their associated KPIs using metrics such as Average Treatment Effect (ATE) and Conditional Average Treatment Effect (CATE). This approach offers network operators guided insights into identifying conflicts and quantifying their impacts, enabling more informed and effective conflict resolution strategies across diverse xApp deployments.
Robotics
OTPL-VIO: Robust Visual-Inertial Odometry with Optimal Transport Line Association and Adaptive Uncertainty
Robust stereo visual-inertial odometry (VIO) remains challenging in low-texture scenes and under abrupt illumination changes, where point features become sparse and unstable, leading to ambiguous association and under-constrained estimation. Line structures offer complementary geometric cues, yet many efficient point-line systems still rely on point-guided line association, which can break down when point support is weak and may lead to biased constraints. We present a stereo point-line VIO system in which line segments are equipped with dedicated deep descriptors and matched using an entropy-regularized optimal transport formulation, enabling globally consistent correspondences under ambiguity, outliers, and partial observations. The proposed descriptor is training-free and is computed by sampling and pooling network feature maps. To improve estimation stability, we analyze the impact of line measurement noise and introduce reliability-adaptive weighting to regulate the influence of line constraints during optimization. Experiments on EuRoC and UMA-VI, together with real-world deployments in low-texture and illumination-challenging environments, demonstrate improved accuracy and robustness over representative baselines while maintaining real-time performance.
A Generalized Voronoi Graph based Coverage Control Approach for Non-Convex Environment
To address the challenge of efficient coverage by multi-robot systems in non-convex regions with multiple obstacles, this paper proposes a coverage control method based on the Generalized Voronoi Graph (GVG), which has two phases: Load-Balancing Algorithm phase and Collaborative Coverage phase. In Load-Balancing Algorithm phase, the non-convex region is partitioned into multiple sub-regions based on GVG. Besides, a weighted load-balancing algorithm is developed, which considers the quality differences among sub-regions. By iteratively optimizing the robot allocation ratio, the number of robots in each sub-region is matched with the sub-region quality to achieve load balance. In Collaborative Coverage phase, each robot is controlled by a new controller to effectively coverage the region. The convergence of the method is proved and its performance is evaluated through simulations.
comment: 8 pages, 7 figures, published to ACC 2026
Towards Terrain-Aware Safe Locomotion for Quadrupedal Robots Using Proprioceptive Sensing
Achieving safe quadrupedal locomotion in real-world environments has attracted much attention in recent years. When walking over uneven terrain, achieving reliable estimation and realising safety-critical control based on the obtained information is still an open question. To address this challenge, especially for low-cost robots equipped solely with proprioceptive sensors (e.g., IMUs, joint encoders, and contact force sensors), this work first presents an estimation framework that generates a 2.5-D terrain map and extracts support plane parameters, which are then integrated into contact and state estimation. Then, we integrate this estimation framework into a safety-critical control pipeline by formulating control barrier functions that provide rigorous safety guarantees. Experiments demonstrate that the proposed terrain estimation method provides smooth terrain representations. Moreover, the coupled estimation framework of terrain, state, and contact reduces the mean absolute error of base position estimation by 64.8%, decreases the estimation variance by 47.2%, and improves the robustness of contact estimation compared to a decoupled framework. The terrain-informed CBFs integrate historical terrain information and current proprioceptive measurements to ensure global safety by keeping the robot out of hazardous areas and local safety by preventing body-terrain collision, relying solely on proprioceptive sensing.
comment: 8 pages, 10 figures
SCDP: Learning Humanoid Locomotion from Partial Observations via Mixed-Observation Distillation
Distilling humanoid locomotion control from offline datasets into deployable policies remains a challenge, as existing methods rely on privileged full-body states that require complex and often unreliable state estimation. We present Sensor-Conditioned Diffusion Policies (SCDP) that enables humanoid locomotion using only onboard sensors, eliminating the need for explicit state estimation. SCDP decouples sensing from supervision through mixed-observation training: diffusion model conditions on sensor histories while being supervised to predict privileged future state-action trajectories, enforcing the model to infer the motion dynamics under partial observability. We further develop restricted denoising, context distribution alignment, and context-aware attention masking to encourage implicit state estimation within the model and to prevent train-deploy mismatch. We validate SCDP on velocity-commanded locomotion and motion reference tracking tasks. In simulation, SCDP achieves near-perfect success on velocity control (99-100%) and 93% tracking success in AMASS test set, performing comparable to privileged baselines while using only onboard sensors. Finally, we deploy the trained policy on a real G1 humanoid at 50 Hz, demonstrating robust real robot locomotion without external sensing or state estimation.
comment: 6 pages, 8 figures, 5 tables, iRos
ReTac-ACT: A State-Gated Vision-Tactile Fusion Transformer for Precision Assembly
Precision assembly requires sub-millimeter corrections in contact-rich "last-millimeter" regions where visual feedback fails due to occlusion from the end-effector and workpiece. We present ReTac-ACT (Reconstruction-enhanced Tactile ACT), a vision-tactile imitation learning policy that addresses this challenge through three synergistic mechanisms: (i) bidirectional cross-attention enabling reciprocal visuo-tactile feature enhancement before fusion, (ii) a proprioception-conditioned gating network that dynamically elevates tactile reliance when visual occlusion occurs, and (iii) a tactile reconstruction objective enforcing learning of manipulation-relevant contact information rather than generic visual textures. Evaluated on the standardized NIST Assembly Task Board M1 benchmark, ReTac-ACT achieves 90% peg-in-hole success, substantially outperforming vision-only and generalist baseline methods, and maintains 80% success at industrial-grade 0.1mm clearance. Ablation studies validate that each architectural component is indispensable. The ReTac-ACT codebase and a vision-tactile demonstration dataset covering various clearance levels with both visual and tactile features will be released to support reproducible research.
Trajectory Optimization for Self-Wrap-Aware Cable-Towed Planar Object Manipulation under Implicit Tension Constraints
Cable/rope elements are pervasive in deformable-object manipulation, often serving as a deformable force-transmission medium whose routing and contact determine how wrenches are delivered. In cable-towed manipulation, transmission is unilateral and hybrid: the tether can pull only when taut and becomes force-free when slack; in practice, the tether may also contact the object boundary and self-wrap around edges, which is not merely collision avoidance but a change of the wrench transmission channel by shifting the effective application point and moment arm, thereby coupling routing geometry with rigid-body motion and tensioning. We formulate self-wrap towing as a routing-aware, tensioning-implicit trajectory optimization (TITO) problem that couples (i) a tensioning-implicit taut/slack constraint and (ii) routing-conditioned transmission maps for effective length and wrench, and we build a relaxation hierarchy from a strict mode-conditioned reference to three tractable relaxations: Full-Mode Relaxation (FMR), Binary-Mode Relaxation (BMR), and Implicit-Mode Relaxation (IMR). Across planar towing tasks, we find that making routing an explicit decision often yields conservative solutions that stay near switching boundaries, whereas IMR induces self-wrap through state evolution and exploits the redirected torque channel whenever turning requires it.
On the Cost of Evolving Task Specialization in Multi-Robot Systems
Task specialization can lead to simpler robot behaviors and higher efficiency in multi-robot systems. Previous works have shown the emergence of task specialization during evolutionary optimization, focusing on feasibility rather than costs. In this study, we take first steps toward a cost-benefit analysis of task specialization in robot swarms using a foraging scenario. We evolve artificial neural networks as generalist behaviors for the entire task and as task-specialist behaviors for subtasks within a limited evaluation budget. We show that generalist behaviors can be successfully optimized while the evolved task-specialist controllers fail to cooperate efficiently, resulting in worse performance than the generalists. Consequently, task specialization does not necessarily improve efficiency when optimization budget is limited.
comment: Accepted for publication in the proceeding of ANTS 2026 - 15th International Conference on Swarm Intelligence
NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models
Vision-Language-Action (VLA) models are formulated to ground instructions in visual context and generate action sequences for robotic manipulation. Despite recent progress, VLA models still face challenges in learning related and reusable primitives, reducing reliance on large-scale data and complex architectures, and enabling exploration beyond demonstrations. To address these challenges, we propose a novel Neuro-Symbolic Vision-Language-Action (NS-VLA) framework via online reinforcement learning (RL). It introduces a symbolic encoder to embedding vision and language features and extract structured primitives, utilizes a symbolic solver for data-efficient action sequencing, and leverages online RL to optimize generation via expansive exploration. Experiments on robotic manipulation benchmarks demonstrate that NS-VLA outperforms previous methods in both one-shot training and data-perturbed settings, while simultaneously exhibiting superior zero-shot generalizability, high data efficiency and expanded exploration space. Our code is available.
Beyond Short-Horizon: VQ-Memory for Robust Long-Horizon Manipulation in Non-Markovian Simulation Benchmarks
The high cost of collecting real-robot data has made robotic simulation a scalable platform for both evaluation and data generation. Yet most existing benchmarks concentrate on simple manipulation tasks such as pick-and-place, failing to capture the non-Markovian characteristics of real-world tasks and the complexity of articulated object interactions. To address this limitation, we present RuleSafe, a new articulated manipulation benchmark built upon a scalable LLM-aided simulation framework. RuleSafe features safes with diverse unlocking mechanisms, such as key locks, password locks, and logic locks, which require different multi-stage reasoning and manipulation strategies. These LLM-generated rules produce non-Markovian and long-horizon tasks that require temporal modeling and memory-based reasoning. We further propose VQ-Memory, a compact and structured temporal representation that uses vector-quantized variational autoencoders (VQ-VAEs) to encode past proprioceptive states into discrete latent tokens. This representation filters low-level noise while preserving high-level task-phase context, providing lightweight yet robust temporal cues that are compatible with existing Vision-Language-Action models (VLA). Extensive experiments on state-of-the-art VLA models and diffusion policies show that VQ-Memory consistently improves long-horizon planning, enhances generalization to unseen configurations, and enables more efficient manipulation with reduced computational cost. Project page: vqmemory.github.io
comment: 9 pages
Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation CVPR 2026
Text-goal instance navigation (TGIN) asks an agent to resolve a single, free-form description into actions that reach the correct object instance among same-category distractors. We present \textit{Context-Nav} that elevates long, contextual captions from a local matching cue to a global exploration prior and verifies candidates through 3D spatial reasoning. First, we compute dense text-image alignments for a value map that ranks frontiers -- guiding exploration toward regions consistent with the entire description rather than early detections. Second, upon observing a candidate, we perform a viewpoint-aware relation check: the agent samples plausible observer poses, aligns local frames, and accepts a target only if the spatial relations can be satisfied from at least one viewpoint. The pipeline requires no task-specific training or fine-tuning; we attain state-of-the-art performance on InstanceNav and CoIN-Bench. Ablations show that (i) encoding full captions into the value map avoids wasted motion and (ii) explicit, viewpoint-aware 3D verification prevents semantically plausible but incorrect stops. This suggests that geometry-grounded spatial reasoning is a scalable alternative to heavy policy training or human-in-the-loop interaction for fine-grained instance disambiguation in cluttered 3D scenes.
comment: Camera-ready version. Accepted to CVPR 2026
StyleVLA: Driving Style-Aware Vision Language Action Model for Autonomous Driving
Vision Language Models (VLMs) bridge visual perception and linguistic reasoning. In Autonomous Driving (AD), this synergy has enabled Vision Language Action (VLA) models, which translate high-level multimodal understanding into driving behaviors, typically represented as future trajectories. However, existing VLA models mainly generate generic collision-free trajectories. Beyond collision avoidance, adapting to diverse driving styles (e.g., sporty, comfortable) is essential for personalized driving. Moreover, many methods treat trajectory generation as naive token prediction, which can produce kinematically infeasible actions. To address these limitations, we present StyleVLA, a physics-informed VLA framework for generating diverse and physically plausible driving behaviors. We introduce a hybrid loss that combines a kinematic consistency constraint with a continuous regression head to improve trajectory feasibility. To train StyleVLA, built on Qwen3-VL-4B, we construct a large-scale instruction dataset with over 1.2k scenarios, 76k Bird's Eye View (BEV) samples, and 42k First Person View (FPV) samples, with ground-truth trajectories for five driving styles and natural-language instructions. Experiments show that our 4B-parameter StyleVLA significantly outperforms proprietary models (e.g., Gemini-3-Pro) and state-of-the-art VLA models. Using a composite driving score measuring success rate, physical feasibility, and style adherence, StyleVLA achieves 0.55 on BEV and 0.51 on FPV, versus 0.32 and 0.35 for Gemini-3-Pro. These results show that a specialized, physics-informed, lightweight model can surpass closed-source models on domain-specific tasks.
comment: 8 pages
Receptogenesis in a Vascularized Robotic Embodiment
Equipping robotic systems with the capacity to generate $\textit{ex novo}$ hardware during operation extends control of physical adaptability. Unlike modular systems that rely on discrete component integration pre- or post-deployment, we envision the possibility that physical adaptation and development emerge from dynamic material restructuring to shape the body's intrinsic functions. Drawing inspiration from circulatory systems that redistribute mass and function in biological organisms, we utilize fluidics to restructure the material interface, a capability currently unpaired in robotics. Here, we realize this synthetic growth capability through a vascularized robotic composite designed for programmable material synthesis, demonstrated via receptogenesis - the on-demand construction of sensors from internal fluid reserves based on environmental cues. By coordinating the fluidic transport of precursors with external localized UV irradiation, we drive an $\textit{in situ}$ photopolymerization that chemically reconstructs the vasculature from the inside out. This reaction converts precursors with photolatent initiator into a solid dispersion of UV-sensitive polypyrrole, establishing a sensing modality validated by a characteristic decrease in electrical impedance. The newly synthesized sensor closed a control loop to regulate wing flapping in a moth-inspired robotic demonstrator. This physical update increased the robot's capability in real time. This work establishes a materials-based framework for constitutive evolution, enabling robots to physically grow the hardware needed to support emerging behaviors in a complex environment; for example, suggesting a pathway toward autonomous systems capable of generating specialized features, such as neurovascular systems in situated robotics.
comment: Supplementary Files currently unavailable online. Please contact the First Author to request any Supplementary Files
SEA-Nav: Efficient Policy Learning for Safe and Agile Quadruped Navigation in Cluttered Environments
Efficiently training quadruped robot navigation in densely cluttered environments remains a significant challenge. Existing methods are either limited by a lack of safety and agility in simple obstacle distributions or suffer from slow locomotion in complex environments, often requiring excessively long training phases. To this end, we propose SEA-Nav (Safe, Efficient, and Agile Navigation), a reinforcement learning framework for quadruped navigation. Within diverse and dense obstacle environments, a differentiable control barrier function (CBF)-based shield constraints the navigation policy to output safe velocity commands. An adaptive collision replay mechanism and hazardous exploration rewards are introduced to increase the probability of learning from critical experiences, guiding efficient exploration and exploitation. Finally, kinematic action constraints are incorporated to ensure safe velocity commands, facilitating successful physical deployment. To the best of our knowledge, this is the first approach that achieves highly challenging quadruped navigation in the real world with minute-level training time.
comment: Project website: https://11chens.github.io/sea-nav/
Stein Variational Ergodic Surface Coverage with SE(3) Constraints
Surface manipulation tasks require robots to generate trajectories that comprehensively cover complex 3D surfaces while maintaining precise end-effector poses. Existing ergodic trajectory optimization (TO) methods demonstrate success in coverage tasks, while struggling with point-cloud targets due to the nonconvex optimization landscapes and the inadequate handling of SE(3) constraints in sampling-as-optimization (SAO) techniques. In this work, we introduce a preconditioned SE(3) Stein Variational Gradient Descent (SVGD) approach for SAO ergodic trajectory generation. Our proposed approach comprises multiple innovations. First, we reformulate point-cloud ergodic coverage as a manifold-aware sampling problem. Second, we derive SE(3)-specific SVGD particle updates, and, third, we develop a preconditioner to accelerate TO convergence. Our sampling-based framework consistently identifies superior local optima compared to strong optimization-based and SAO baselines while preserving the SE(3) geometric structure. Experiments on a 3D point-cloud surface coverage benchmark and robotic surface drawing tasks demonstrate that our method achieves superior coverage quality with tractable computation in our setting relative to existing TO and SAO approaches, and is validated in real-world robot experiments.
Open-World Motion Forecasting
Motion forecasting aims to predict the future trajectories of dynamic agents in the scene, enabling autonomous vehicles to effectively reason about scene evolution. Existing approaches operate under the closed-world regime and assume fixed object taxonomy as well as access to high-quality perception. Therefore, they struggle in real-world settings where perception is imperfect and object taxonomy evolves over time. In this work, we bridge this fundamental gap by introducing open-world motion forecasting, a novel setting in which new object classes are sequentially introduced over time and future object trajectories are estimated directly from camera images. We tackle this setting by proposing the first end-to-end class-incremental motion forecasting framework to mitigate catastrophic forgetting while simultaneously learning to forecast newly introduced classes. When a new class is introduced, our framework employs a pseudo-labeling strategy to first generate motion forecasting pseudo-labels for all known classes which are then processed by a vision-language model to filter inconsistent and over-confident predictions. Parallelly, our approach further mitigates catastrophic forgetting by using a novel replay sampling strategy that leverages query feature variance to sample previous sequences with informative motion patterns. Extensive evaluation on the nuScenes and Argoverse 2 datasets demonstrates that our approach successfully resists catastrophic forgetting and maintains performance on previously learned classes while improving adaptation to novel ones. Further, we demonstrate that our approach supports zero-shot transfer to real-world driving and naturally extends to end-to-end class-incremental planning, enabling continual adaptation of the full autonomous driving system. We provide the code at https://omen.cs.uni-freiburg.de .
From Flow to One Step: Real-Time Multi-Modal Trajectory Policies via Implicit Maximum Likelihood Estimation-based Distribution Distillation
Generative policies based on diffusion and flow matching achieve strong performance in robotic manipulation by modeling multi-modal human demonstrations. However, their reliance on iterative Ordinary Differential Equation (ODE) integration introduces substantial latency, limiting high-frequency closed-loop control. Recent single-step acceleration methods alleviate this overhead but often exhibit distributional collapse, producing averaged trajectories that fail to execute coherent manipulation strategies. We propose a framework that distills a Conditional Flow Matching (CFM) expert into a fast single-step student via Implicit Maximum Likelihood Estimation (IMLE). A bi-directional Chamfer distance provides a set-level objective that promotes both mode coverage and fidelity, enabling preservation of the teacher multi-modal action distribution in a single forward pass. A unified perception encoder further integrates multi-view RGB, depth, point clouds, and proprioception into a geometry-aware representation. The resulting high-frequency control supports real-time receding-horizon re-planning and improved robustness under dynamic disturbances.
comment: https://sites.google.com/view/flow2one, 8 pages
Vision-Augmented On-Track System Identification for Autonomous Racing via Attention-Based Priors and Iterative Neural Correction
Operating autonomous vehicles at the absolute limits of handling requires precise, real-time identification of highly non-linear tire dynamics. However, traditional online optimization methods suffer from "cold-start" initialization failures and struggle to model high-frequency transient dynamics. To address these bottlenecks, this paper proposes a novel vision-augmented, iterative system identification framework. First, a lightweight CNN (MobileNetV3) translates visual road textures into a continuous heuristic friction prior, providing a robust "warm-start" for parameter optimization. Next, a S4 model captures complex temporal dynamic residuals, circumventing the memory and latency limitations of traditional MLPs and RNNs. Finally, a derivative-free Nelder-Mead algorithm iteratively extracts physically interpretable Pacejka tire parameters via a hybrid virtual simulation. Co-simulation in CarSim demonstrates that the lightweight vision backbone reduces friction estimation error by 76.1 using 85 fewer FLOPs, accelerating cold-start convergence by 71.4. Furthermore, the S4-augmented framework improves parameter extraction accuracy and decreases lateral force RMSE by over 60 by effectively capturing complex vehicle dynamics, demonstrating superior performance compared to conventional neural architectures.
SPAARS: Safer RL Policy Alignment through Abstract Exploration and Refined Exploitation of Action Space
Offline-to-online reinforcement learning (RL) offers a promising paradigm for robotics by pre-training policies on safe, offline demonstrations and fine-tuning them via online interaction. However, a fundamental challenge remains: how to safely explore online without deviating from the behavioral support of the offline data? While recent methods leverage conditional variational autoencoders (CVAEs) to bound exploration within a latent space, they inherently suffer from an exploitation gap -- a performance ceiling imposed by the decoder's reconstruction loss. We introduce SPAARS, a curriculum learning framework that initially constrains exploration to the low-dimensional latent manifold for sample-efficient, safe behavioral improvement, then seamlessly transfers control to the raw action space, bypassing the decoder bottleneck. SPAARS has two instantiations: the CVAE-based variant requires only unordered (s,a) pairs and no trajectory segmentation; SPAARS-SUPE pairs SPAARS with OPAL temporal skill pretraining for stronger exploration structure at the cost of requiring trajectory chunks. We prove an upper bound on the exploitation gap using the Performance Difference Lemma, establish that latent-space policy gradients achieve provable variance reduction over raw-space exploration, and show that concurrent behavioral cloning during the latent phase directly controls curriculum transition stability. Empirically, SPAARS-SUPE achieves 0.825 normalized return on kitchen-mixed-v0 versus 0.75 for SUPE, with 5x better sample efficiency; standalone SPAARS achieves 92.7 and 102.9 normalized return on hopper-medium-v2 and walker2d-medium-v2 respectively, surpassing IQL baselines of 66.3 and 78.3 respectively, confirming the utility of the unordered-pair CVAE instantiation.
comment: 9 pages
NLiPsCalib: An Efficient Calibration Framework for High-Fidelity 3D Reconstruction of Curved Visuotactile Sensors ICRA 2026
Recent advances in visuotactile sensors increasingly employ biomimetic curved surfaces to enhance sensorimotor capabilities. Although such curved visuotactile sensors enable more conformal object contact, their perceptual quality is often degraded by non-uniform illumination, which reduces reconstruction accuracy and typically necessitates calibration. Existing calibration methods commonly rely on customized indenters and specialized devices to collect large-scale photometric data, but these processes are expensive and labor-intensive. To overcome these calibration challenges, we present NLiPsCalib, a physics-consistent and efficient calibration framework for curved visuotactile sensors. NLiPsCalib integrates controllable near-field light sources and leverages Near-Light Photometric Stereo (NLiPs) to estimate contact geometry, simplifying calibration to just a few simple contacts with everyday objects. We further introduce NLiPsTac, a controllable-light-source tactile sensor developed to validate our framework. Experimental results demonstrate that our approach enables high-fidelity 3D reconstruction across diverse curved form factors with a simple calibration procedure. We emphasize that our approach lowers the barrier to developing customized visuotactile sensors of diverse geometries, thereby making visuotactile sensing more accessible to the broader community.
comment: 8 pages, 8 figures, accepted to 2026 IEEE International Conference on Robotics & Automation (ICRA 2026)
CORAL: Scalable Multi-Task Robot Learning via LoRA Experts
Deploying Vision-Language-Action (VLA) models in real-world robotics exposes a core multi-task learning challenge: reconciling task interference in multi-task robotic learning. When multiple tasks are jointly fine-tuned in a single stage, gradients from different tasks can conflict, causing negative transfer and reducing per-task performance. Yet maintaining a separate full checkpoint per task is often storage- and deployment-prohibitive. To address this dilemma, we present CORAL, a backbone- and embodiment-agnostic framework designed primarily to mitigate multi-task interference while remaining naturally extensible to a continuous stream of new tasks. CORAL freezes a single pre-trained VLA backbone and attaches one lightweight Low-Rank Adaptation (LoRA) expert per task; at runtime, a dynamic inference engine (the CORAL Manager) routes language instructions to the appropriate expert and swaps experts on the fly with zero inference overhead. This strict parameter isolation avoids complex gating networks and prevents parameter-level cross-task interference by construction; as an added capability, it also enables sequentially introducing new tasks without parameter overwriting caused by catastrophic forgetting. We validate CORAL on a real-world Galaxea R1 dual-arm mobile manipulator and three simulation benchmarks (LIBERO, WidowX, Google Robot), where CORAL overcomes fine-grained instructional ambiguity and substantially outperforms joint training, yielding a practical and scalable system for lifelong multi-task robot learning. Website: https://frontierrobo.github.io/CORAL
See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation CVPR
Measurement of task progress through explicit, actionable milestones is critical for robust robotic manipulation. This progress awareness enables a model to ground its current task status, anticipate verifiable intermediate states, and detect and recover from failures when progress stalls. To embody this capability, we introduce See, Plan, Rewind (SPR), a progress-aware vision-language-action framework that dynamically grounds language instructions into a sequence of spatial subgoals. SPR operates through a continuous core cycle, Seeing the current state and upcoming milestone, Planning a trajectory towards the next 2D waypoint, and Rewinding to a recoverable state upon failure by monitoring progress against the expected sequence. This closed-loop approach enables robust error correction without requiring additional training data or auxiliary models. Extensive experiments demonstrate the framework's effectiveness, generalization and robustness: SPR outperforms the MolmoAct baseline by 5\% on the LIBERO benchmark. On the challenging LIBERO-Plus benchmark with unseen instructions and initial states, SPR achieves state-of-the-art robustness with the smallest performance drop, surpassing OpenVLA-OFT and UniVLA, demonstrating superior out-of-distribution robustness.
comment: Suggested to CVPR Findings. https://tingjundai.github.io/SPRVLA/
Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos CVPR 2025
Vision-and-Language Navigation (VLN) has long been constrained by the limited diversity and scalability of simulator-curated datasets, which fail to capture the complexity of real-world environments. To overcome this limitation, we introduce a large-scale video-instruction framework derived from web-based room tour videos, enabling agents to learn from natural human walking demonstrations in diverse, realistic indoor settings. Unlike existing datasets, our framework integrates both open-ended description-enriched trajectories and action-enriched trajectories reconstructed in 3D, providing richer spatial and semantic supervision. A key extension in this work is the incorporation of implicit geometry representations, which extract spatial cues directly from RGB frames without requiring fragile 3D reconstruction. This approach substantially improves data utilization, alleviates reconstruction failures, and unlocks large portions of previously unusable video data. Comprehensive experiments across multiple VLN benchmarks (CVDN, SOON, R2R, and REVERIE) demonstrate that our method not only sets new state-of-the-art performance but also enables the development of robust zero-shot navigation agents. By bridging large-scale web videos with implicit spatial reasoning, this work advances embodied navigation towards more scalable, generalizable, and real-world applicable solutions.
comment: Extension of CVPR 2025 RoomTour3D with implicit geometric representations
RAE-NWM: Navigation World Model in Dense Visual Representation Space
Visual navigation requires agents to reach goals in complex environments through perception and planning. World models address this task by simulating action-conditioned state transitions to predict future observations. Current navigation world models typically learn state evolution under actions within the compressed latent space of a Variational Autoencoder, where spatial compression often discards fine-grained structural information and hinders precise control. To better understand the propagation characteristics of different representations, we conduct a linear dynamics probe and observe that dense DINOv2 features exhibit stronger linear predictability for action-conditioned transitions. Motivated by this observation, we propose the Representation Autoencoder-based Navigation World Model (RAE-NWM), which models navigation dynamics in a dense visual representation space. We employ a Conditional Diffusion Transformer with Decoupled Diffusion Transformer head (CDiT-DH) to model continuous transitions, and introduce a separate time-driven gating module for dynamics conditioning to regulate action injection strength during generation. Extensive evaluations show that modeling sequential rollouts in this space improves structural stability and action accuracy, benefiting downstream planning and navigation.
comment: Code is available at: https://github.com/20robo/raenwm
MO-Playground: Massively Parallelized Multi-Objective Reinforcement Learning for Robotics
Multi-objective reinforcement learning (MORL) is a powerful tool to learn Pareto-optimal policy families across conflicting objectives. However, unlike traditional RL algorithms, existing MORL algorithms do not effectively leverage large-scale parallelization to concurrently simulate thousands of environments, resulting in vastly increased computation time. Ultimately, this has limited MORL's application towards complex multi-objective robotics problems. To address these challenges, we present 1) MORLAX, a new GPU-native, fast MORL algorithm, and 2) MO-Playground, a pip-installable playground of GPU-accelerated multi-objective environments. Together, MORLAX and MO-Playground approximate Pareto sets within minutes, offering 25-270x speed-ups compared to legacy CPU-based approaches whilst achieving superior Pareto front hypervolumes. We demonstrate the versatility of our approach by implementing a custom BRUCE humanoid robot environment using MO-Playground and learning Pareto-optimal locomotion policies across 6 realistic objectives for BRUCE, such as smoothness, efficiency and arm swinging.
comment: 8 pages, 4 figures, 3 tables
TRIP-Bag: A Portable Teleoperation System for Plug-and-Play Robotic Arms and Leaders
Large scale, diverse demonstration data for manipulation tasks remains a major challenge in learning-based robot policies. Existing in-the-wild data collection approaches often rely on vision-based pose estimation of hand-held grippers or gloves, which introduces an embodiment gap between the collection platform and the target robot. Teleoperation systems eliminate the embodiment gap, but are typically impractical to deploy outside the laboratory environment. We propose TRIP-Bag (Teleoperation, Recording, Intelligence in a Portable Bag), a portable, puppeteer-style teleoperation system fully contained within a commercial suitcase, as a practical solution for collecting high-fidelity manipulation data across varied settings. With a setup time of under five minutes and direct joint-to-joint teleoperation, TRIP-Bag enables rapid and reliable data collection in any environment. We validated TRIP-Bag's usability through experiments with non-expert users, showing that the system is intuitive and easy to operate. Furthermore, we confirmed the quality of the collected data by training benchmark manipulation policies, demonstrating its value as a practical resource for robot learning.
Embodied Human Simulation for Quantitative Design and Analysis of Interactive Robotics
Physical interactive robotics, ranging from wearable devices to collaborative humanoid robots, require close coordination between mechanical design and control. However, evaluating interactive dynamics is challenging due to complex human biomechanics and motor responses. Traditional experiments rely on indirect metrics without measuring human internal states, such as muscle forces or joint loads. To address this issue, we develop a scalable simulation-based framework for the quantitative analysis of physical human-robot interaction. At its core is a full-body musculoskeletal model serving as a predictive surrogate for the human dynamical system. Driven by a reinforcement learning controller, it generates adaptive, physiologically grounded motor behaviors. We employ a sequential training pipeline where the pre-trained human motion control policy acts as a consistent evaluator, making large-scale design space exploration computationally tractable. By simulating the coupled human-robot system, the framework provides access to internal biomechanical metrics, offering a systematic way to concurrently co-optimize a robot's structural parameters and control policy. We demonstrate its capability in optimizing human-exoskeleton interactions, showing improved joint alignment and reduced contact forces. This work establishes embodied human simulation as a scalable paradigm for interactive robotics design.
WESPR: Wind-adaptive Energy-Efficient Safe Perception & Planning for Robust Flight with Quadrotors
Local wind conditions strongly influence drone performance: headwinds increase flight time, crosswinds and wind shear hinder agility in cluttered spaces, while tailwinds reduce travel time. Although adaptive controllers can mitigate turbulence, they remain unaware of the surrounding geometry that generates it, preventing proactive avoidance. Existing methods that model how wind interacts with the environment typically rely on computationally expensive fluid dynamics simulations, limiting real-time adaptation to new environments and conditions. To bridge this gap, we present WESPR, a fast framework that predicts how environmental geometry affects local wind conditions, enabling proactive path planning and control adaptation. Our lightweight pipeline integrates geometric perception and local weather data to estimate wind fields, compute cost-efficient paths, and adjust control strategies-all within 10 seconds. We validate WESPR on a Crazyflie drone navigating turbulent obstacle courses. Our results show a 12.5-58.7% reduction in maximum trajectory deviation and a 24.6% improvement in stability compared to a wind-agnostic adaptive controller.
comment: 8 pages, 9 Figures
Robust Spatiotemporal Motion Planning for Multi-Agent Autonomous Racing via Topological Gap Identification and Accelerated MPC
High-speed multi-agent autonomous racing demands robust spatiotemporal planning and precise control under strict computational limits. Current methods often oversimplify interactions or abandon strict kinematic constraints. We resolve this by proposing a Topological Gap Identification and Accelerated MPC framework. By predicting opponent behaviors via SGPs, our method constructs dynamic occupancy corridors to robustly select optimal overtaking gaps. We ensure strict kinematic feasibility using a Linear Time-Varying MPC powered by a customized Pseudo-Transient Continuation (PTC) solver for high-frequency execution. Experimental results on the F1TENTH platform show that our method significantly outperforms state-of-the-art baselines: it reduces total maneuver time by 51.6% in sequential scenarios, consistently maintains an overtaking success rate exceeding 81% in dense bottlenecks, and lowers average computational latency by 20.3%, pushing the boundaries of safe and high-speed autonomous racing.
STONE Dataset: A Scalable Multi-Modal Surround-View 3D Traversability Dataset for Off-Road Robot Navigation
Reliable off-road navigation requires accurate estimation of traversable regions and robust perception under diverse terrain and sensing conditions. However, existing datasets lack both scalability and multi-modality, which limits progress in 3D traversability prediction. In this work, we introduce STONE, a large-scale multi-modal dataset for off-road navigation. STONE provides (1) trajectory-guided 3D traversability maps generated by a fully automated, annotation-free pipeline, and (2) comprehensive surround-view sensing with synchronized 128-channel LiDAR, six RGB cameras, and three 4D imaging radars. The dataset covers a wide range of environments and conditions, including day and night, grasslands, farmlands, construction sites, and lakes. Our auto-labeling pipeline reconstructs dense terrain surfaces from LiDAR scans, extracts geometric attributes such as slope, elevation, and roughness, and assigns traversability labels beyond the robot's trajectory using a Mahalanobis-distance-based criterion. This design enables scalable, geometry-aware ground-truth construction without manual annotation. Finally, we establish a benchmark for voxel-level 3D traversability prediction and provide strong baselines under both single-modal and multi-modal settings. STONE is available at: https://konyul.github.io/STONE-dataset/
ZeroWBC: Learning Natural Visuomotor Humanoid Control Directly from Human Egocentric Video
Achieving versatile and naturalistic whole-body control for humanoid robot scene-interaction remains a significant challenge. While some recent works have demonstrated autonomous humanoid interactive control, they are constrained to rigid locomotion patterns and expensive teleoperation data collection, lacking the versatility to execute more human-like natural behaviors such as sitting or kicking. Furthermore, acquiring the necessary real robot teleoperation data is prohibitively expensive and time-consuming. To address these limitations, we introduce ZeroWBC, a novel framework that learns a natural humanoid visuomotor control policy directly from human egocentric videos, eliminating the need for large-scale robot teleoperation data and enabling natural humanoid robot scene-interaction control. Specifically, our approach first fine-tunes a Vision-Language Model (VLM) to predict future whole-body human motions based on text instructions and egocentric visual context, then these generated motions are retargeted to real robot joints and executed via our robust general motion tracking policy for humanoid whole-body control. Extensive experiments on the Unitree G1 humanoid robot demonstrate that our method outperforms baseline approaches in motion naturalness and versatility, successfully establishing a pipeline that eliminates teleoperation data collection overhead for whole-body humanoid control, offering a scalable and efficient paradigm for general humanoid whole-body control.
SPAN-Nav: Generalized Spatial Awareness for Versatile Vision-Language Navigation
Recent embodied navigation approaches leveraging Vision-Language Models (VLMs) demonstrate strong generalization in versatile Vision-Language Navigation (VLN). However, reliable path planning in complex environments remains challenging due to insufficient spatial awareness. In this work, we introduce SPAN-Nav, an end-to-end foundation model designed to infuse embodied navigation with universal 3D spatial awareness using RGB video streams. SPAN-Nav extracts spatial priors across diverse scenes through an occupancy prediction task on extensive indoor and outdoor environments. To mitigate the computational burden, we introduce a compact representation for spatial priors, finding that a single token is sufficient to encapsulate the coarse-grained cues essential for navigation tasks. Furthermore, inspired by the Chain-of-Thought (CoT) mechanism, SPAN-Nav utilizes this single spatial token to explicitly inject spatial cues into action reasoning through an end-to end framework. Leveraging multi-task co-training, SPAN-Nav captures task-adaptive cues from generalized spatial priors, enabling robust spatial awareness to generalize even to the task lacking explicit spatial supervision. To support comprehensive spatial learning, we present a massive dataset of 4.2 million occupancy annotations that covers both indoor and outdoor scenes across multi-type navigation tasks. SPAN-Nav achieves state-of-the-art performance across three benchmarks spanning diverse scenarios and varied navigation tasks. Finally, real-world experiments validate the robust generalization and practical reliability of our approach across complex physical scenarios.
Walking on Rough Terrain with Any Number of Legs
Robotics would gain by replicating the remarkable agility of arthropods in navigating complex environments. Here we consider the control of multi-legged systems which have 6 or more legs. Current multi-legged control strategies in robots include large black-box machine learning models, Central Pattern Generator (CPG) networks, and open-loop feed-forward control with stability arising from mechanics. Here we present a multi-legged control architecture for rough terrain using a segmental robot with 3 actuators for every 2 legs, which we validated in simulation for robots with 6 to 16 legs. Segments have identical state machines, and each segment also receives input from the segment in front of it. Our design bridges the gap between WalkNet-like event cascade controllers and CPG-based controllers: it tightly couples to the ground when contact is present, but produces fictive locomotion when ground contact is missing. The approach may be useful as an adaptive and computationally lightweight controller for multi-legged robots, and as a baseline capability for scaffolding the learning of machine learning controllers.
comment: 10 pages, 6 figures
DexHiL: A Human-in-the-Loop Framework for Vision-Language-Action Model Post-Training in Dexterous Manipulation
While Vision-Language-Action (VLA) models have demonstrated promising generalization capabilities in robotic manipulation, deploying them on specific and complex downstream tasks still demands effective post-training. In parallel, Human-in-the-Loop (HiL) learning has proven to be a powerful mechanism for refining robot policies. However, extending this paradigm to dexterous manipulation remains challenging: multi-finger control is high-dimensional, contact-intensive, and exhibits execution distributions that differ markedly from standard arm motions, leaving existing dexterous VLA systems limited in reliability and adaptability. We present DexHiL, the first integrated arm-hand human-in-the-loop framework for dexterous VLA models, enabling coordinated interventions over the arm and the dexterous hand within a single system. DexHiL introduces an intervention-aware data sampling strategy that prioritizes corrective segments for post-training, alongside a lightweight teleoperation interface that supports instantaneous human corrections during execution. Real-robot experiments demonstrate that DexHiL serves as an effective post-training framework, yielding a substantial performance leap, outperforming standard offline-only fine-tuning baselines by an average of 25% in success rates across distinct tasks. Project page: https://chenzhongxi-sjtu.github.io/dexhil/
comment: 9 pages, 5 figures
PM-Nav: Priori-Map Guided Embodied Navigation in Functional Buildings
Existing language-driven embodied navigation paradigms face challenges in functional buildings (FBs) with highly similar features, as they lack the ability to effectively utilize priori spatial knowledge. To tackle this issue, we propose a Priori-Map Guided Embodied Navigation (PM-Nav), wherein environmental maps are transformed into navigation-friendly semantic priori-maps, a hierarchical chain-of-thought prompt template with an annotation priori-map is designed to enable precise path planning, and a multi-model collaborative action output mechanism is built to accomplish positioning decisions and execution control for navigation planning. Comprehensive tests using a home-made FB dataset show that the PM-Nav obtains average improvements of 511\% and 1175\%, and 650\% and 400\% over the SG-Nav and the InstructNav in simulation and real-world, respectively. These tremendous boosts elucidate the great potential of using the PM-Nav as a backbone navigation framework for FBs.
comment: 6 pages, 4 figures
Latent World Models for Automated Driving: A Unified Taxonomy, Evaluation Framework, and Open Challenges
Emerging generative world models and vision-language-action (VLA) systems are rapidly reshaping automated driving by enabling scalable simulation, long-horizon forecasting, and capability-rich decision making. Across these directions, latent representations serve as the central computational substrate: they compress high-dimensional multi-sensor observations, enable temporally coherent rollouts, and provide interfaces for planning, reasoning, and controllable generation. This paper proposes a unifying latent-space framework that synthesizes recent progress in world models for automated driving. The framework organizes the design space by the target and form of latent representations (latent worlds, latent actions, latent generators; continuous states, discrete tokens, and hybrids) and by structural priors for geometry, topology, and semantics. Building on this taxonomy, the paper articulates five cross-cutting internal mechanics (i.e, structural isomorphism, long-horizon temporal stability, semantic and reasoning alignment, value-aligned objectives and post-training, as well as adaptive computation and deliberation) and connects these design choices to robustness, generalization, and deployability. The work also proposes concrete evaluation prescriptions, including a closed-loop metric suite and a resource-aware deliberation cost, designed to reduce the open-loop / closed-loop mismatch. Finally, the paper identifies actionable research directions toward advancing latent world model for decision-ready, verifiable, and resource-efficient automated driving.
comment: 17 pages, 6 figures, under review by IEEE Transactions on Intelligent Transportation Systems (IEEE-T-ITS)
Provably Safe Trajectory Generation for Manipulators Under Motion and Environmental Uncertainties
Robot manipulators operating in uncertain and non-convex environments present significant challenges for safe and optimal motion planning. Existing methods often struggle to provide efficient and formally certified collision risk guarantees, particularly when dealing with complex geometries and non-Gaussian uncertainties. This article proposes a novel risk-bounded motion planning framework to address this unmet need. Our approach integrates a rigid manipulator deep stochastic Koopman operator (RM-DeSKO) model to robustly predict the robot's state distribution under motion uncertainty. We then introduce an efficient, hierarchical verification method that combines parallelizable physics simulations with sum-of-squares (SOS) programming as a filter for fine-grained, formal certification of collision risk. This method is embedded within a Model Predictive Path Integral (MPPI) controller that uniquely utilizes binary collision information from SOS decomposition to improve its policy. The effectiveness of the proposed framework is validated on two typical robot manipulators through extensive simulations and real-world experiments, including a challenging human-robot collaboration scenario, demonstrating sim-to-real transfer of the learned model and its ability to generate safe and efficient trajectories in complex, uncertain settings.
GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models
VLA models encode visual observations as 2D patch tokens with no intrinsic geometric structure. We introduce GST-VLA with two contributions. First, the Gaussian Spatial Tokenizer (GST) converts frozen dense depth and frozen semantic patch features into $N_g{=}128$ anisotropic 3D Gaussian primitives, each parameterized by a metric residual mean $μ\in \mathbb{R}^3$, log-scale covariance $\log σ\in \mathbb{R}^3$, and learned opacity $α\in (0,1)$. The covariance eigenstructure encodes local surface orientation, and opacity provides per-primitive geometric confidence, both inaccessible from scalar depth. Spatial attention pooling with learned queries concentrates the fixed token budget on geometrically salient regions rather than distributing uniformly. Second, 3D Depth-Aware Chain-of-Thought (DA-CoT) reasoning supervises four structured intermediate spatial thoughts, covering 3D object grounding, grasp affordance contact geometry, pairwise metric distances, and coarse SE(3) waypoints, as explicit generation targets in the training loss. A cross-attention sublayer at every VLM transformer block provides direct access to the raw 256-primitive Gaussian field during DA-CoT generation. A 300M-parameter flow-matching action expert with mixture-of-experts feedforward sublayers decodes 7-DoF delta action chunks via conditional ODE integration, conditioned on both VLM hidden states and DA-CoT outputs through dual cross-attention. Trained with composite $\mathcal{L}_\mathrm{flow} + \mathcal{L}_\mathrm{CoT} + \mathcal{L}_\mathrm{depth}$ across three progressive stages, GST-VLA achieves 96.4% on LIBERO (+2.0%), and 80.2% on SimplerEnv (+5.4%). Ablations isolate the contribution of each GST component, each DA-CoT thought, and each training stage, confirming independent and synergistic gains concentrated on precision demanding tasks.
comment: The results presented in this paper are preliminary. Please note that the experiments are currently ongoing, and the final data is subject to change upon the completion of the study. All ideas, results, methods, and any content herein are the sole property of the authors
High-Slip-Ratio Control for Peak Tire-Road Friction Estimation Using Automated Vehicles
Accurate estimation of the tire-road friction coefficient (TRFC) is critical for ensuring safe vehicle control, especially under adverse road conditions. However, most existing methods rely on naturalistic driving data from regular vehicles, which typically operate under mild acceleration and braking. As a result, the data provide insufficient slip excitation and offer limited observability of the peak TRFC. This paper presents a high-slip-ratio control framework that enables automated vehicles (AVs) to actively excite the peak friction region during empty-haul operations while maintaining operational safety. A simplified Magic Formula tire model is adopted to represent nonlinear slip-force dynamics and is locally fitted using repeated high-slip measurements. To support safe execution in car-following scenarios, we formulate a constrained optimal control strategy that balances slip excitation, trajectory tracking, and collision avoidance. In parallel, a binning-based statistical projection method is introduced to robustly estimate peak TRFC under noise and local sparsity. The framework is validated through both closed-loop simulations and real-vehicle experiments, demonstrating its accuracy, safety, and feasibility for scalable, cost-effective roadway friction screening.
3D UAV Trajectory Estimation and Classification from Internet Videos via Language Model
Reliable 3D trajectory estimation of unmanned aerial vehicles (UAVs) is a fundamental requirement for anti-UAV systems, yet the acquisition of large-scale and accurately annotated trajectory data remains prohibitively expensive. In this work, we present a novel framework that derives UAV 3D trajectories and category information directly from Internet-scale UAV videos, without relying on manual annotations. First, language-driven data acquisition is employed to autonomously discover and collect UAV-related videos, while vision-language reasoning progressively filters task-relevant segments. Second, a training-free cross-modal label generation module is introduced to infer 3D trajectory hypotheses and UAV type cues. Third, a physics-informed refinement process is designed to impose temporal smoothness and kinematic consistency on the estimated trajectories. The resulting video clips and trajectory annotations can be readily utilized for downstream anti-UAV tasks. To assess effectiveness and generalization, we conduct zero-shot transfer experiments on a public, well-annotated 3D UAV benchmark. Results reveal a clear data scaling behavior: as the amount of online video data increases, zero-shot transfer performance on the target dataset improves consistently, without any target-domain training. The proposed method closely approaches the current state-of-the-art, highlighting its robustness and applicability to real-world anti-UAV scenarios. Code and datasets will be released upon acceptance.
Quality over Quantity: Demonstration Curation via Influence Functions for Data-Centric Robot Learning ICRA 2026
Learning from demonstrations has emerged as a promising paradigm for end-to-end robot control, particularly when scaled to diverse and large datasets. However, the quality of demonstration data, often collected through human teleoperation, remains a critical bottleneck for effective data-driven robot learning. Human errors, operational constraints, and teleoperator variability introduce noise and suboptimal behaviors, making data curation essential yet largely manual and heuristic-driven. In this work, we propose Quality over Quantity (QoQ), a grounded and systematic approach to identifying high-quality data by defining data quality as the contribution of each training sample to reducing loss on validation demonstrations. To efficiently estimate this contribution, we leverage influence functions, which quantify the impact of individual training samples on model performance. We further introduce two key techniques to adapt influence functions for robot demonstrations: (i) using maximum influence across validation samples to capture the most relevant state-action pairs, and (ii) aggregating influence scores of state-action pairs within the same trajectory to reduce noise and improve data coverage. Experiments in both simulated and real-world settings show that QoQ consistently improves policy performances over prior data selection methods.
comment: Accepted to ICRA 2026, 8 pages
Cutting the Cord: System Architecture for Low-Cost, GPU-Accelerated Bimanual Mobile Manipulation
We present a bimanual mobile manipulator built on the open-source XLeRobot with integrated onboard compute for less than \$1300. Key contributions include: (1) optimized mechanical design maximizing stiffness-to-weight ratio, (2) a Tri-Bus power topology isolating compute from motor-induced voltage transients, and (3) embedded autonomy using NVIDIA Jetson Orin Nano for untethered operation. The platform enables teleoperation, autonomous SLAM navigation, and vision-based manipulation without external dependencies, providing a low-cost alternative for research and education in robotics and robot learning.
Beyond Amplitude: Channel State Information Phase-Aware Deep Fusion for Robotic Activity Recognition ICASSP
Wi-Fi Channel State Information (CSI) has emerged as a promising non-line-of-sight sensing modality for human and robotic activity recognition. However, prior work has predominantly relied on CSI amplitude while underutilizing phase information, particularly in robotic arm activity recognition. In this paper, we present GateFusion-Bidirectional Long Short-Term Memory network (GF-BiLSTM) for WiFi sensing in robotic activity recognition. GF-BiLSTM is a two-stream gated fusion network that encodes amplitude and phase separately and adaptively integrates per-time features through a learned gating mechanism. We systematically evaluate state-of-the-art deep learning models under a Leave-One-Velocity-Out (LOVO) protocol across four input configurations: amplitude only, phase only, amplitude + unwrapped phase, and amplitude + sanitized phase. Experimental results demonstrate that incorporating phase alongside amplitude consistently improves recognition accuracy and cross-speed robustness, with GF-BiLSTM achieving the best performance. To the best of our knowledge, this work provides the first systematic exploration of CSI phase for robotic activity recognition, establishing its critical role in Wi-Fi-based sensing.
comment: Accepted at 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 4--8, 2026, Barcelona, Spain
ImpedanceDiffusion: Diffusion-Based Global Path Planning for UAV Swarm Navigation with Generative Impedance Control
Safe swarm navigation in cluttered indoor environment requires long-horizon planning, reactive obstacle avoidance, and adaptive compliance. We propose ImpedanceDiffusion, a hierarchical framework that leverages image-conditioned diffusion-based global path planning with Artificial Potential Field (APF) tracking and semantic-aware variable impedance control for aerial drone swarms. The diffusion model generates geometric global trajectories directly from RGB images without explicit map construction. These trajectories are tracked by an APF-based reactive layer, while a VLM-RAG module performs semantic obstacle classification with 90% retrieval accuracy to adapt impedance parameters for mixed obstacle environments during execution. Two diffusion planners are evaluated: (i) a top-view long-horizon planner using single-pass inference and (ii) a first-person-view (FPV) short-horizon planner deployed via a two-stage inference pipeline. Both planners achieve a 100% trajectory generation rate across twenty static and dynamic experimental configurations and are validated via zero-shot sim-to-real deployment on Crazyflie 2.1 drones through the hierarchical APF-impedance control stack. The top-view planner produces smoother trajectories that yield conservative tracking speeds of 1.0-1.2 m/s near hard obstacles and 0.6-1.0 m/s near soft obstacles. In contrast, the FPV planner generates trajectories with greater local clearance and typically higher speeds, reaching 1.4-2.0 m/s near hard obstacles and up to 1.6 m/s near soft obstacles. Across 20 experimental configurations (100 total runs), the framework achieved a 92% success rate while maintaining stable impedance-based formation control with bounded oscillations and no in-flight collisions, demonstrating reliable and adaptive swarm navigation in cluttered indoor environments.
comment: This is paper is under review
Update-Free On-Policy Steering via Verifiers
In recent years, Behavior Cloning (BC) has become one of the most prevalent methods for enabling robots to mimic human demonstrations. However, despite their successes, BC policies are often brittle and struggle with precise manipulation. To overcome these issues, we propose UF-OPS, an Update-Free On-Policy Steering method that enables the robot to predict the success likelihood of its actions and adapt its strategy at execution time. We accomplish this by training verifier functions using policy rollout data obtained during an initial evaluation of the policy. These verifiers are subsequently used to steer the base policy toward actions with a higher likelihood of success. Our method improves the performance of black-box diffusion policy, without changing the base parameters, making it light-weight and flexible. We present results from both simulation and real-world data and achieve an average 49% improvement in success rate over the base policy across 5 real tasks.
comment: 9 pages, 6 figures
Design of a Robot-Assisted Chemical Dialysis System
Scientists perform diverse manual procedures that are tedious and laborious. Such procedures are considered a bottleneck for modern experimental science, as they consume time and increase burdens in fields including material science and medicine. We employ a user-centered approach to designing a robot-assisted system for dialysis, a common multi-day purification method used in polymer and protein synthesis. Through two usability studies, we obtain participant feedback and revise design requirements to develop the final system that satisfies scientists' needs and has the potential for applications in other experimental workflows. We anticipate that integration of this system into real synthesis procedures in a chemical wet lab will decrease workload on scientists during long experimental procedures and provide an effective approach to designing more systems that have the potential to accelerate scientific discovery and liberate scientists from tedious labor.
comment: Accepted at ACM/IEEE International Conference on Human-Robot Interaction (HRI'26), Late Breaking Reports 5 pages, 2 figures
From Prior to Pro: Efficient Skill Mastery via Distribution Contractive RL Finetuning
We introduce Distribution Contractive Reinforcement Learning (DICE-RL), a framework that uses reinforcement learning (RL) as a "distribution contraction" operator to refine pretrained generative robot policies. DICE-RL turns a pretrained behavior prior into a high-performing "pro" policy by amplifying high-success behaviors from online feedback. We pretrain a diffusion- or flow-based policy for broad behavioral coverage, then finetune it with a stable, sample-efficient residual off-policy RL framework that combines selective behavior regularization with value-guided action selection. Extensive experiments and analyses show that DICE-RL reliably improves performance with strong stability and sample efficiency. It enables mastery of complex long-horizon manipulation skills directly from high-dimensional pixel inputs, both in simulation and on a real robot. Project website: https://zhanyisun.github.io/dice.rl.2026/.
Degeneracy-Resilient Teach and Repeat for Geometrically Challenging Environments Using FMCW Lidar
Teach and Repeat (T&R) topometric navigation enables robots to autonomously repeat previously traversed paths without relying on GPS, making it well suited for operations in GPS-denied environments such as underground mines and lunar navigation. State-of-the-art T&R systems typically rely on iterative closest point (ICP)-based estimation; however, in geometrically degenerate environments with sparsely structured terrain, ICP often becomes ill-conditioned, resulting in degraded localization and unreliable navigation performance. To address this challenge, we present a degeneracy-resilient Frequency-Modulated Continuous-Wave (FMCW) lidar T&R navigation system consisting of Doppler velocity-based odometry and degeneracy-aware scan-to-map localization. Leveraging FMCW lidar, which provides per-point radial velocity measurements via the Doppler effect, we extend a geometry-independent, correspondence-free motion estimation to include principled pose uncertainty estimation that remains stable in degenerate environments. We further propose a degeneracy-aware localization method that incorporates per-point curvature for improved data association, and unifies translational and rotational scales to enable consistent degeneracy detection. Closed-loop field experiments across three environments with varying structural richness demonstrate that the proposed system reliably completes autonomous navigation, including in a challenging flat airport test field where a conventional ICP-based system fails.
Hierarchical Task Model Predictive Control for Sequential Mobile Manipulation Tasks
Mobile manipulators are envisioned to serve more complex roles in people's everyday lives. With recent breakthroughs in large language models, task planners have become better at translating human verbal instructions into a sequence of tasks. However, there is still a need for a decision-making algorithm that can seamlessly interface with the high-level task planner to carry out the sequence of tasks efficiently. In this work, building on the idea of nonlinear lexicographic optimization, we propose a novel Hierarchical-Task Model Predictive Control framework that is able to complete sequential tasks with improved performance and reactivity by effectively leveraging the robot's redundancy. Compared to the state-of-the-art task-prioritized inverse kinematic control method, our approach has improved hierarchical trajectory tracking performance by 42% on average when facing task changes, robot singularity and reference variations. Compared to a typical single-task architecture, our proposed hierarchical task control architecture enables the robot to traverse a shorter path in task space and achieves an execution time 2.3 times faster when executing a sequence of delivery tasks. We demonstrated the results with real-world experiments on a 9 degrees of freedom mobile manipulator.
comment: 8 pages, Published in IEEE Robotics and Automation Letters ( Volume: 9, Issue: 2, February 2024)
Perceptive Hierarchical-Task MPC for Sequential Mobile Manipulation in Unstructured Semi-Static Environments
As compared to typical mobile manipulation tasks, sequential mobile manipulation poses a unique challenge -- as the robot operates over extended periods, successful task completion is not solely dependent on consistent motion generation but also on the robot's awareness and adaptivity to changes in the operating environment. While existing motion planners can generate whole-body trajectories to complete sequential tasks, they typically assume that the environment remains static and rely on precomputed maps. This assumption often breaks down during long-term operations, where semi-static changes such as object removal, introduction, or shifts are common. In this work, we propose a novel perceptive hierarchical-task model predictive control (HTMPC) framework for efficient sequential mobile manipulation in unstructured, changing environments. To tackle the challenge, we leverage a Bayesian inference framework to explicitly model object-level changes and thereby maintain a temporally accurate representation of the 3D environment; this up-to-date representation is embedded in a lexicographic optimization framework to enable efficient execution of sequential tasks. We validate our perceptive HTMPC approach through both simulated and real-robot experiments. In contrast to baseline methods, our approach systematically accounts for moved and phantom obstacles, successfully completing sequential tasks with higher efficiency and reactivity, without relying on prior maps or external infrastructure.
Robotic Ultrasound Makes CBCT Alive
Intraoperative Cone Beam Computed Tomography (CBCT) provides a reliable 3D anatomical context essential for interventional planning. However, its static nature fails to provide continuous monitoring of soft-tissue deformations induced by respiration, probe pressure, and surgical manipulation, leading to navigation discrepancies. We propose a deformation-aware CBCT updating framework that leverages robotic ultrasound as a dynamic proxy to infer tissue motion and update static CBCT slices in real time. Starting from calibration-initialized alignment with linear correlation of linear combination (LC2)-based rigid refinement, our method establishes accurate multimodal correspondence. To capture intraoperative dynamics, we introduce the ultrasound correlation UNet (USCorUNet), a lightweight network trained with optical flow-guided supervision to learn deformation-aware correlation representations, enabling accurate, real-time dense deformation field estimation from ultrasound streams. The inferred deformation is spatially regularized and transferred to the CBCT reference to produce deformation-consistent visualizations without repeated radiation exposure. We validate the proposed approach through deformation estimation and ultrasound-guided CBCT updating experiments. Results demonstrate real-time end-to-end CBCT slice updating and physically plausible deformation estimation, enabling dynamic refinement of static CBCT guidance during robotic ultrasound-assisted interventions. The source code is publicly available at https://github.com/anonymous-codebase/us-cbct-demo.
comment: 10 pages, 4 figures
Octopus-inspired Distributed Control for Soft Robotic Arms: A Graph Neural Network-Based Attention Policy with Environmental Interaction IROS 2026
This paper proposes SoftGM, an octopus-inspired distributed control architecture for segmented soft robotic arms that learn to reach targets in contact-rich environments using online obstacle discovery without relying on global obstacle geometry. SoftGM formulates each arm section as a cooperative agent and represents the arm-environment interaction as a graph. SoftGM uses a two-stage graph attention message passing scheme following a Centralised Training Decentralised Execution (CTDE) paradigm with a centralised critic and decentralised actor. We evaluate SoftGM in a Cosserat-rod simulator (PyElastica) across three tasks that increase the complexity of the environment: obstacle-free, structured obstacles, and a wall-with-hole scenario. Compared with six widely used MARL baselines (IDDPG, IPPO, ISAC, MADDPG, MAPPO, MASAC) under identical information content and training conditions, SoftGM matches strong CTDE methods in simpler settings and achieves the best performance in the wall-with-hole task. Robustness tests with observation noise, single-section actuation failure, and transient disturbances show that SoftGM preserves success while keeping control effort bounded, indicating resilient coordination driven by selective contact-relevant information routing.
comment: 9 pages, 6 figures, 2 tables, submitted for IROS 2026
Autonomous Search for Sparsely Distributed Visual Phenomena through Environmental Context Modeling ICRA 2026
Autonomous underwater vehicles (AUVs) are increasingly used to survey coral reefs, yet efficiently locating specific coral species of interest remains difficult: target species are often sparsely distributed across the reef, and an AUV with limited battery life cannot afford to search everywhere. When detections of the target itself are too sparse to provide directional guidance, the robot benefits from an additional signal to decide where to look next. We propose using the visual environmental context -- the habitat features that tend to co-occur with a target species -- as that signal. Because context features are spatially denser and often vary more smoothly than target detections, we hypothesize that a reward function targeted at broader environmental context will enable adaptive planners to make better decisions on where to go next, even in regions where no target has yet been observed. Starting from a single labeled image, our method uses patch-level DINOv2 embeddings to perform one-shot detections of both the target species and its surrounding context online. We validate our approach using real imagery collected by an AUV at two reef sites in St. John, U.S. Virgin Islands, simulating the robot's motion offline. Our results demonstrate that one-shot detection combined with adaptive context modeling enables efficient autonomous surveying, sampling up to 75$\%$ of the target in roughly half the time required by exhaustive coverage when the target is sparsely distributed, and outperforming search strategies that only use target detections.
comment: Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)
Characterizing Healthy & Post-Stroke Neuromotor Behavior During 6D Upper-Limb Isometric Gaming: Implications for Design of End-Effector Rehabilitation Robot Interfaces
Successful robot-mediated rehabilitation requires designing games and robot interventions that promote healthy motor practice. However, the interplay between a given user's neuromotor behavior, the gaming interface, and the physical robot makes designing system elements -- and even characterizing what behaviors are "healthy" or pathological -- challenging. We leverage our OpenRobotRehab 1.0 open access data set to assess the characteristics of 13 healthy and 2 post-stroke users' force output, muscle activations, and game performance while executing isometric trajectory tracking tasks using an end-effector rehabilitation robot. We present an assessment of how subtle aspects of interface design impact user behavior; an analysis of how pathological neuromotor behaviors are reflected in end-effector force dynamics; and a novel hidden Markov model (HMM)-based neuromotor behavior classification method based on surface electromyography (sEMG) signals during cyclic motions. We demonstrate that task specification (including which axes are constrained and how users interpret tracking instructions) shapes user behavior; that pathology-related features are detectable in 6D end-effector force data during isometric task execution (with significant differences between healthy and post-stroke profiles in force error and average force production at $p=0.05$); and that healthy neuromotor strategies are heterogeneous and inherently difficult to characterize. We also show that our HMM-based models discriminate healthy and post-stroke neuromotor dynamics where synergy-based decompositions reflect no such differentiation. Lastly, we discuss these results' implications for the design of adaptive end-effector rehabilitation robots capable of promoting healthier movement strategies across diverse user populations.
comment: This work has been submitted to the IEEE for possible publication
Dance2Hesitate: A Multi-Modal Dataset of Dancer-Taught Hesitancy for Understandable Robot Motion
In human-robot collaboration, a robot's expression of hesitancy is a critical factor that shapes human coordination strategies, attention allocation, and safety-related judgments. However, designing hesitant robot motion that generalizes is challenging because the observer's inference is highly dependent on embodiment and context. To address these challenges, we introduce and open-source a multi-modal, dancer-generated dataset of hesitant motion where we focus on specific context-embodiment pairs (i.e., manipulator/human upper-limb approaching a Jenga Tower, and anthropomorphic whole body motion in free space). The dataset includes (i) kinesthetic teaching demonstrations on a Franka Emika Panda reaching from a fixed start configuration to a fixed target (a Jenga tower) with three graded hesitancy levels (slight, significant, extreme) and (ii) synchronized RGB-D motion capture of dancers performing the same reaching behavior using their upper limb across three hesitancy levels, plus full human body sequences for extreme hesitancy. We further provide documentation to enable reproducible benchmarking across robot and human modalities. Across all dancers, we obtained 70 unique whole-body trajectories, 84 upper limb trajectories spanning over the three hesitancy levels, and 66 kinesthetic teaching trajectories spanning over the three hesitancy levels. The dataset can be accessed here: https://brsrikrishna.github.io/Dance2Hesitate/.
comment: Accepted to the Designing Transparent and Understandable Robots (D-TUR) Workshop at the ACM/IEEE International Conference on Human-Robot Interaction (HRI) 2026, Edinburgh, UK
Cross-Hand Latent Representation for Vision-Language-Action Models
Dexterous manipulation is essential for real-world robot autonomy, mirroring the central role of human hand coordination in daily activity. Humans rely on rich multimodal perception--vision, sound, and language-guided intent--to perform dexterous actions, motivating vision-based, language-conditioned manipulation systems for robots. However, training reliable vision-language-action (VLA) models for dexterous manipulation requires large-scale demonstrations across many robotic hands. In addition, as new dexterous embodiments appear rapidly, collecting data for each becomes costly and impractical, creating a need for scalable cross-embodiment learning. We introduce XL-VLA, a vision-language-action framework integrated with a unified latent action space shared across diverse dexterous hands. This embodiment-invariant latent space is directly pluggable into standard VLA architectures, enabling seamless cross-embodiment training and efficient reuse of both existing and newly collected data. Experimental results demonstrate that XL-VLA consistently outperforms baseline VLA models operating in raw joint spaces, establishing it as an effective solution for scalable cross-embodiment dexterous manipulation.
comment: Website: https://xl-vla.github.io
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
We propose a standalone autoregressive (AR) Action Expert that generates actions as a continuous causal sequence while conditioning on refreshable vision-language prefixes. In contrast to existing Vision-Language-Action (VLA) models and diffusion policies that reset temporal context with each new observation and predict actions reactively, our Action Expert maintains its own history through a long-lived memory and is inherently context-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize a re-anchoring mechanism that mathematically accounts for perception staleness during both training and inference. Experiments on simulated and real-robot manipulation tasks demonstrate that the proposed method can effectively replace traditional chunk-based action heads for both specialist and generalist policies. AR-VLA exhibits superior history awareness and substantially smoother action trajectories while maintaining or exceeding the task success rates of state-of-the-art reactive VLAs. Overall, our work introduces a scalable, context-aware action generation schema that provides a robust structural foundation for training effective robotic policies.
TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation
We present TiPToP, an extensible modular system that combines pretrained vision foundation models with an existing Task and Motion Planner (TAMP) to solve multi-step manipulation tasks directly from input RGB images and natural-language instructions. Our system aims to be simple and easy-to-use: it can be installed and run on a standard DROID setup in under one hour and adapted to new embodiments with minimal effort. We evaluate TiPToP -- which requires zero robot data -- over 28 tabletop manipulation tasks in simulation and the real world and find it matches or outperforms $π_{0.5}\text{-DROID}$, a vision-language-action (VLA) model fine-tuned on 350 hours of embodiment-specific demonstrations. TiPToP's modular architecture enables us to analyze the system's failure modes at the component level. We analyze results from an evaluation of 173 trials and identify directions for improvement. We release TiPToP open-source to further research on modular manipulation systems and tighter integration between learning and planning. Project website and code: https://tiptop-robot.github.io
comment: Project website: https://tiptop-robot.github.io
BEACON: Language-Conditioned Navigation Affordance Prediction under Occlusion
Language-conditioned local navigation requires a robot to infer a nearby traversable target location from its current observation and an open-vocabulary, relational instruction. Existing vision-language spatial grounding methods usually rely on vision-language models (VLMs) to reason in image space, producing 2D predictions tied to visible pixels. As a result, they struggle to infer target locations in occluded regions, typically caused by furniture or moving humans. To address this issue, we propose BEACON, which predicts an ego-centric Bird's-Eye View (BEV) affordance heatmap over a bounded local region including occluded areas. Given an instruction and surround-view RGB-D observations from four directions around the robot, BEACON predicts the BEV heatmap by injecting spatial cues into a VLM and fusing the VLM's output with depth-derived BEV features. Using an occlusion-aware dataset built in the Habitat simulator, we conduct detailed experimental analysis to validate both our BEV space formulation and the design choices of each module. Our method improves the accuracy averaged across geodesic thresholds by 22.74 percentage points over the state-of-the-art image-space baseline on the validation subset with occluded target locations. Our project page is: https://xin-yu-gao.github.io/beacon.
comment: 8 pages. Project page: https://xin-yu-gao.github.io/beacon
Kinodynamic Motion Retargeting for Humanoid Locomotion via Multi-Contact Whole-Body Trajectory Optimization
We present the KinoDynamic Motion Retargeting (KDMR) framework, a novel approach for humanoid locomotion that models the retargeting process as a multi-contact, whole-body trajectory optimization problem. Conventional kinematics-based retargeting methods rely solely on spatial motion capture (MoCap) data, inevitably introducing physically inconsistent artifacts, such as foot sliding and ground penetration, that severely degrade the performance of downstream imitation learning policies. To bridge this gap, KDMR extends beyond pure kinematics by explicitly enforcing rigid-body dynamics and contact complementarity constraints. Further, by integrating ground reaction force (GRF) measurements alongside MoCap data, our method automatically detects heel-toe contact events to accurately replicate complex human-like contact patterns. We evaluate KDMR against the state-of-the-art baseline, GMR, across three key dimensions: 1) the dynamic feasibility and smoothness of the retargeted motions, 2) the accuracy of GRF tracking compared to raw source data, and 3) the training efficiency and final performance of downstream control policies trained via the BeyondMimic framework. Experimental results demonstrate that KDMR significantly outperforms purely kinematic methods, yielding dynamically viable reference trajectories that accelerate policy convergence and enhance overall locomotion stability. Our end-to-end pipeline will be open-sourced upon publication.
NanoBench: A Multi-Task Benchmark Dataset for Nano-Quadrotor System Identification, Control, and State Estimation
Existing aerial-robotics benchmarks target vehicles from hundreds of grams to several kilograms and typically expose only high-level state data. They omit the actuator-level signals required to study nano-scale quadrotors, where low-Reynolds number aerodynamics, coreless DC motor nonlinearities, and severe computational constraints invalidate models and controllers developed for larger vehicles. We introduce NanoBench, an open-source multi-task benchmark collected on the commercially available Crazyflie 2.1 nano-quadrotor (takeoff weight 27 g) in a Vicon motion capture arena. The dataset contains over 170 flight trajectories spanning hover, multi-frequency excitation, standard tracking, and aggressive maneuvers across multiple speed regimes. Each trajectory provides synchronized Vicon ground truth, raw IMU data, onboard extended Kalman filter estimates, PID controller internals, and motor PWM commands at 100 Hz, alongside battery telemetry at 10 Hz, aligned with sub-0.5 ms consistency. NanoBench defines standardized evaluation protocols, train/test splits, and open-source baselines for three tasks: nonlinear system identification, closed-loop controller benchmarking, and onboard state estimation assessment. To our knowledge, it is the first public dataset to jointly provide actuator commands, controller internals, and estimator outputs with millimeter-accurate ground truth on a commercially available nano-scale aerial platform.
comment: 9 pages, 6 figures
Emerging Extrinsic Dexterity in Cluttered Scenes via Dynamics-aware Policy Learning
Extrinsic dexterity leverages environmental contact to overcome the limitations of prehensile manipulation. However, achieving such dexterity in cluttered scenes remains challenging and underexplored, as it requires selectively exploiting contact among multiple interacting objects with inherently coupled dynamics. Existing approaches lack explicit modeling of such complex dynamics and therefore fall short in non-prehensile manipulation in cluttered environments, which in turn limits their practical applicability in real-world environments. In this paper, we introduce a Dynamics-Aware Policy Learning (DAPL) framework that can facilitate policy learning with a learned representation of contact-induced object dynamics in cluttered environments. This representation is learned through explicit world modeling and used to condition reinforcement learning, enabling extrinsic dexterity to emerge without hand-crafted contact heuristics or complex reward shaping. We evaluate our approach in both simulation and the real world. Our method outperforms prehensile manipulation, human teleoperation, and prior representation-based policies by over 25% in success rate on unseen simulated cluttered scenes with varying densities. The real-world success rate reaches around 50% across 10 cluttered scenes, while a practical grocery deployment further demonstrates robust sim-to-real transfer and applicability.
comment: Project Page: https://pku-epic.github.io/DAPL/
Lightweight 3D LiDAR-Based UAV Tracking: An Adaptive Extended Kalman Filtering Approach
Accurate relative positioning is crucial for swarm aerial robotics, enabling coordinated flight and collision avoidance. Although vision-based tracking has been extensively studied, 3D LiDAR-based methods remain underutilized despite their robustness under varying lighting conditions. Existing systems often rely on bulky, power-intensive sensors, making them impractical for small UAVs with strict payload and energy constraints. This paper presents a lightweight LiDAR-based UAV tracking system incorporating an Adaptive Extended Kalman Filter (AEKF) framework. Our approach effectively addresses the challenges posed by sparse, noisy, and nonuniform point cloud data generated by non-repetitive scanning 3D LiDARs, ensuring reliable tracking while remaining suitable for small drones with strict payload constraints. Unlike conventional filtering techniques, the proposed method dynamically adjusts the noise covariance matrices using innovation and residual statistics, thereby enhancing tracking accuracy under real-world conditions. Additionally, a recovery mechanism ensures continuity of tracking during temporary detection failures caused by scattered LiDAR returns or occlusions. Experimental validation was performed using a Livox Mid-360 LiDAR mounted on a DJI F550 UAV in real-world flight scenarios. The proposed method demonstrated robust UAV tracking performance under sparse LiDAR returns and intermittent detections, consistently outperforming both standard Kalman filtering and particle filtering approaches during aggressive maneuvers. These results confirm that the framework enables reliable relative positioning in GPS-denied environments without the need for multi-sensor arrays or external infrastructure.
comment: Presented at the 19th International Conference on Intelligent Autonomous Systems, IAS-19, Genoa, Italy, June 30 to July 4, 2025. To appear in the Springer post-proceedings of the conference
TIMID: Time-Dependent Mistake Detection in Videos of Robot Executions IROS
As robotic systems execute increasingly difficult task sequences, so does the number of ways in which they can fail. Video Anomaly Detection (VAD) frameworks typically focus on singular, low-level kinematic or action failures, struggling to identify more complex temporal or spatial task violations, because they do not necessarily manifest as low-level execution errors. To address this problem, the main contribution of this paper is a new VAD-inspired architecture, TIMID, which is able to detect robot time-dependent mistakes when executing high-level tasks. Our architecture receives as inputs a video and prompts of the task and the potential mistake, and returns a frame-level prediction in the video of whether the mistake is present or not. By adopting a VAD formulation, the model can be trained with weak supervision, requiring only a single label per video. Additionally, to alleviate the problem of data scarcity of incorrect executions, we introduce a multi-robot simulation dataset with controlled temporal errors and real executions for zero-shot sim-to-real evaluation. Our experiments demonstrate that out-of-the-box VLMs lack the explicit temporal reasoning required for this task, whereas our framework successfully detects different types of temporal errors. Project: https://ropertunizar.github.io/TIMID/
comment: 8 pages, 5 figures , IROS submission
MuxGel: Simultaneous Dual-Modal Visuo-Tactile Sensing via Spatially Multiplexing and Deep Reconstruction IROS 2026
High-fidelity visuo-tactile sensing is important for precise robotic manipulation. However, most vision-based tactile sensors face a fundamental trade-off: opaque coatings enable tactile sensing but block pre-contact vision. To address this, we propose MuxGel, a spatially multiplexed sensor that captures both external visual information and contact-induced tactile signals through a single camera. By using a checkerboard coating pattern, MuxGel interleaves tactile-sensitive regions with transparent windows for external vision. This design maintains standard form factors, allowing for plug-and-play integration into GelSight-style sensors by simply replacing the gel pad. To recover full-resolution vision and tactile signals from the multiplexed inputs, we develop a U-Net-based reconstruction framework. Leveraging a sim-to-real pipeline, our model effectively decouples and restores high-fidelity tactile and visual fields simultaneously. Experiments on unseen objects demonstrate the framework's generalization and accuracy. Furthermore, we demonstrate MuxGel's utility in grasping tasks, where dual-modality feedback facilitates both pre-contact alignment and post-contact interaction. Results show that MuxGel enhances the perceptual capabilities of existing vision-based tactile sensors while maintaining compatibility with their hardware stacks. Project webpage: https://zhixianhu.github.io/muxgel/.
comment: Submitted to IROS 2026
PanoAffordanceNet: Towards Holistic Affordance Grounding in 360° Indoor Environments
Global perception is essential for embodied agents in 360° spaces, yet current affordance grounding remains largely object-centric and restricted to perspective views. To bridge this gap, we introduce a novel task: Holistic Affordance Grounding in 360° Indoor Environments. This task faces unique challenges, including severe geometric distortions from Equirectangular Projection (ERP), semantic dispersion, and cross-scale alignment difficulties. We propose PanoAffordanceNet, an end-to-end framework featuring a Distortion-Aware Spectral Modulator (DASM) for latitude-dependent calibration and an Omni-Spherical Densification Head (OSDH) to restore topological continuity from sparse activations. By integrating multi-level constraints comprising pixel-wise, distributional, and region-text contrastive objectives, our framework effectively suppresses semantic drift under low supervision. Furthermore, we construct 360-AGD, the first high-quality panoramic affordance grounding dataset. Extensive experiments demonstrate that PanoAffordanceNet significantly outperforms existing methods, establishing a solid baseline for scene-level perception in embodied intelligence. The source code and benchmark dataset will be made publicly available at https://github.com/GL-ZHU925/PanoAffordanceNet.
comment: The source code and benchmark dataset will be made publicly available at https://github.com/GL-ZHU925/PanoAffordanceNet
Caterpillar-Inspired Spring-Based Compressive Continuum Robot for Bristle-based Exploration
Exploration of confined spaces, such as pipelines and ducts, remains challenging for conventional rigid robots due to limited space, irregular geometry, and restricted access. Inspired by caterpillar locomotion and sensing, this paper presents a compact spring-based tendon-driven continuum robot that integrates with commercial robotic arms for confined-space inspection. The system combines a mechanically compliant continuum body with a tendon actuation module, enabling coupled bending and axial length change, and uses a constant-curvature kinematic model for positional control. Experiments show a mean position error of 4.32 mm under the proposed model and control pipeline. To extend the system from motion to inspection, we integrate an artificial bristle contact sensor and demonstrate surface perception and confined-space exploration through contact interactions. This compact and compliant design offers a cost-effective upgrade for commercial robots and promises effective exploration in challenging environments.
comment: Accepted by RoboSoft 2026
Let's Reward Step-by-Step: Step-Aware Contrastive Alignment for Vision-Language Navigation in Continuous Environments
Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to learn complex reasoning from long-horizon human interactions. While Multi-modal Large Language Models (MLLMs) have driven recent progress, current training paradigms struggle to balance generalization capability, error recovery and training stability. Specifically, (i) policies derived from SFT suffer from compounding errors, struggling to recover from out-of-distribution states, and (ii) Reinforcement Fine-Tuning (RFT) methods e.g. GRPO are bottlenecked by sparse outcome rewards. Their binary feedback fails to assign credit to individual steps, leading to gradient signal collapse in failure dominant batches. To address these challenges, we introduce Step-Aware Contrastive Alignment (SACA), a framework designed to extract dense supervision from imperfect trajectories. At its core, the Perception-Grounded Step-Aware auditor evaluates progress step-by-step, disentangling failed trajectories into valid prefixes and exact divergence points. Leveraging these signals, Scenario-Conditioned Group Construction mechanism dynamically routes batches to specialized resampling and optimization strategies. Extensive experiments on VLN-CE benchmarks demonstrate that SACA achieves state-of-the-art performance.
comment: 28 pages, 10 figures
$M^2$-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs
Semantic occupancy prediction enables dense 3D geometric and semantic understanding for autonomous driving. However, existing camera-based approaches implicitly assume complete surround-view observations, an assumption that rarely holds in real-world deployment due to occlusion, hardware malfunction, or communication failures. We study semantic occupancy prediction under incomplete multi-camera inputs and introduce $M^2$-Occ, a framework designed to preserve geometric structure and semantic coherence when views are missing. $M^2$-Occ addresses two complementary challenges. First, a Multi-view Masked Reconstruction (MMR) module leverages the spatial overlap among neighboring cameras to recover missing-view representations directly in the feature space. Second, a Feature Memory Module (FMM) introduces a learnable memory bank that stores class-level semantic prototypes. By retrieving and integrating these global priors, the FMM refines ambiguous voxel features, ensuring semantic consistency even when observational evidence is incomplete. We introduce a systematic missing-view evaluation protocol on the nuScenes-based SurroundOcc benchmark, encompassing both deterministic single-view failures and stochastic multi-view dropout scenarios. Under the safety-critical missing back-view setting, $M^2$-Occ improves the IoU by 4.93%. As the number of missing cameras increases, the robustness gap further widens; for instance, under the setting with five missing views, our method boosts the IoU by 5.01%. These gains are achieved without compromising full-view performance. The source code will be publicly released at https://github.com/qixi7up/M2-Occ.
comment: The source code will be publicly released at https://github.com/qixi7up/M2-Occ
Efficient and robust control with spikes that constrain free energy
Animal brains exhibit remarkable efficiency in perception and action, while being robust to both external and internal perturbations. The means by which brains accomplish this remains, for now, poorly understood, hindering our understanding of animal and human cognition, as well as our own implementation of efficient algorithms for control of dynamical systems.A potential candidate for a robust mechanism of state estimation and action computation is the free energy principle, but existing implementations of this principle have largely relied on conventional, biologically implausible approaches without spikes. We propose a novel, efficient, and robust spiking control framework with realistic biological characteristics. The resulting networks function as free energy constrainers, in which neurons only fire if they reduce the free energy of their internal representation. The networks offer efficient operation through highly sparse activity while matching performance with other similar spiking frameworks, and have high resilience against both external (e.g. sensory noise or collisions) and internal perturbations (e.g. synaptic noise and delays or neuron silencing) that such a network would be faced with when deployed by either an organism or an engineer. Overall, our work provides a novel mathematical account for spiking control through constraining free energy, providing both better insight into how brain networks might leverage their spiking substrate and a new route for implementing efficient control algorithms in neuromorphic hardware.
Robotic Scene Cloning:Advancing Zero-Shot Robotic Scene Adaptation in Manipulation via Visual Prompt Editing
Modern robots can perform a wide range of simple tasks and adapt to diverse scenarios in the well-trained environment. However, deploying pre-trained robot models in real-world user scenarios remains challenging due to their limited zero-shot capabilities, often necessitating extensive on-site data collection. To address this issue, we propose Robotic Scene Cloning (RSC), a novel method designed for scene-specific adaptation by editing existing robot operation trajectories. RSC achieves accurate and scene-consistent sample generation by leveraging a visual prompting mechanism and a carefully tuned condition injection module. Not only transferring textures but also performing moderate shape adaptations in response to the visual prompts, RSC demonstrates reliable task performance across a variety of object types. Experiments across various simulated and real-world environments demonstrate that RSC significantly enhances policy generalization in target environments.
DRIFT: Dual-Representation Inter-Fusion Transformer for Automated Driving Perception with 4D Radar Point Clouds
4D radars, which provide 3D point cloud data along with Doppler velocity, are attractive components of modern automated driving systems due to their low cost and robustness under adverse weather conditions. However, they provide a significantly lower point cloud density than LiDAR sensors. This makes it important to exploit not only local but also global contextual scene information. This paper proposes DRIFT, a model that effectively captures and fuses both local and global contexts through a dual-path architecture. The model incorporates a point path to aggregate fine-grained local features and a pillar path to encode coarse-grained global features. These two parallel paths are intertwined via novel feature-sharing layers at multiple stages, enabling full utilization of both representations. DRIFT is evaluated on the widely used View-of-Delft (VoD) dataset and a proprietary internal dataset. It outperforms the baselines on the tasks of object detection and/or free road estimation. For example, DRIFT achieves a mean average precision (mAP) of 52.6\% (compared to, say, 45.4\% of CenterPoint) on the VoD dataset.
SELF-VLA: A Skill Enhanced Agentic Vision-Language-Action Framework for Contact-Rich Disassembly
Disassembly automation has long been pursued to address the growing demand for efficient and proper recovery of valuable components from the end-of-life (EoL) electronic products. Existing approaches have demonstrated promising and regimented performance by decomposing the disassembly process into different subtasks. However, each subtask typically requires extensive data preparation, model training, and system management. Moreover, these approaches are often task- and component-specific, making them poorly suited to handle the variability and uncertainty of EoL products and limiting their generalization capabilities. All these factors restrict the practical deployment of current robotic disassembly systems and leave them highly reliant on human labor. With the recent development of foundation models in robotics, vision-language-action (VLA) models have shown impressive performance on standard robotic manipulation tasks, but their applicability to complex, contact-rich, and long-horizon industrial practices like disassembly, which requires sequential and precise manipulation, remains limited. To address this challenge, we propose SELF-VLA, an agentic VLA framework that integrates explicit disassembly skills. Experimental studies demonstrate that our framework significantly outperforms current state-of-the-art end-to-end VLA models on two contact-rich disassembly tasks. The video illustration can be found via https://zh.engr.tamu.edu/wp-content/uploads/sites/310/2026/03/IROS-VLA-Video.mp4.
TATIC: Task-Aware Temporal Learning for Human Intent Inference from Physical Corrections in Human-Robot Collaboration
In human-robot collaboration (HRC), robots must adapt online to dynamic task constraints and evolving human intent. While physical corrections provide a natural, low-latency channel for operators to convey motion-level adjustments, extracting task-level semantic intent from such brief interactions remains challenging. Existing foundation-model-based approaches primarily rely on vision and language inputs and lack mechanisms to interpret physical feedback. Meanwhile, traditional physical human-robot interaction (pHRI) methods leverage physical corrections for trajectory guidance but struggle to infer task-level semantics. To bridge this gap, we propose TATIC, a unified framework that utilizes torque-based contact force estimation and a task-aware Temporal Convolutional Network (TCN) to jointly infer discrete task-level intent and estimate continuous motion-level parameters from brief physical corrections. Task-aligned feature canonicalization ensures robust generalization across diverse layouts, while an intent-driven adaptation scheme translates inferred human intent into robot motion adaptations. Experiments achieve a 0.904 Macro-F1 score in intent recognition and demonstrate successful hardware validation in collaborative disassembly (see experimental video at https://youtu.be/xF8A52qwEc8).
From Demonstrations to Safe Deployment: Path-Consistent Safety Filtering for Diffusion Policies ICRA 2026
Diffusion policies (DPs) achieve state-of-the-art performance on complex manipulation tasks by learning from large-scale demonstration datasets, often spanning multiple embodiments and environments. However, they cannot guarantee safe behavior, requiring external safety mechanisms. These, however, alter actions in ways unseen during training, causing unpredictable behavior and performance degradation. To address these problems, we propose path-consistent safety filtering (PACS) for DPs. Our approach performs path-consistent braking on a trajectory computed from the sequence of generated actions. In this way, we keep the execution consistent with the training distribution of the policy, maintaining the learned, task-completing behavior. To enable real-time deployment and handle uncertainties, we verify safety using set-based reachability analysis. Our experimental evaluation in simulation and on three challenging real-world human-robot interaction tasks shows that PACS (a) provides formal safety guarantees in dynamic environments, (b) preserves task success rates, and (c) outperforms reactive safety approaches, such as control barrier functions, by up to 68 % in terms of task success. Videos are available at our project website: https://tum-lsy.github.io/pacs.
comment: Accepted to IEEE ICRA 2026. Project page: https://tum-lsy.github.io/pacs/. 8 pages, 4 figures
LLM-Advisor: An LLM Benchmark for Cost-efficient Path Planning across Multiple Terrains
Cost-efficient path planning across multiple terrains is a crucial task in robot navigation, requiring the identification of a path from the start to the goal that not only avoids obstacles but also minimizes the overall travel cost. This is especially crucial for real-world applications where robots need to navigate diverse terrains in outdoor environments with limited opportunities for recharging or refueling. Despite its practical importance, cost-efficient path planning across heterogeneous terrains has received relatively limited attention in prior work. In this paper, we propose LLM-Advisor, a prompt-based, planner-agnostic framework that leverages large language models (LLMs) as non-decisive post-processing advisors for cost refinement, without modifying the underlying planner. While we observe that LLMs may occasionally produce implausible suggestions, we introduce two effective hallucination-mitigation strategies. We further introduce two datasets, MultiTerraPath and RUGD_v2, for systematic evaluation of cost-efficient path planning. Extensive experiments reveal that state-of-the-art LLMs, including GPT-4o, GPT-4-turbo, Gemini-2.5-Flash, and Claude-Opus-4, perform poorly in zero-shot terrain-aware path planning, highlighting their limited spatial reasoning capability. In contrast, the proposed LLM-Advisor (with GPT-4o) improves cost efficiency for 72.37% of A*-planned paths, 69.47% of RRT*-planned paths, and 78.70% of LLM-A*-planned paths. On the MultiTerraPath dataset, LLM-Advisor demonstrates stronger performance on the hard subset, further validating its applicability to real-world scenarios.
Exploring Single Domain Generalization of LiDAR-based Semantic Segmentation under Imperfect Labels
Accurate perception is critical for vehicle safety, with LiDAR as a key enabler in autonomous driving. To ensure robust performance across environments, sensor types, and weather conditions without costly re-annotation, domain generalization in LiDAR-based 3D semantic segmentation is essential. However, LiDAR annotations are often noisy due to sensor imperfections, occlusions, and human errors. Such noise degrades segmentation accuracy and is further amplified under domain shifts, threatening system reliability. While noisy-label learning is well-studied in images, its extension to 3D LiDAR segmentation under domain generalization remains largely unexplored, as the sparse and irregular structure of point clouds limits direct use of 2D methods. To address this gap, we introduce the novel task Domain Generalization for LiDAR Semantic Segmentation under Noisy Labels (DGLSS-NL) and establish the first benchmark by adapting three representative noisy-label learning strategies from image classification to 3D segmentation. However, we find that existing noisy-label learning approaches adapt poorly to LiDAR data. We therefore propose DuNe, a dual-view framework with strong and weak branches that enforce feature-level consistency and apply cross-entropy loss based on confidence-aware filtering of predictions. Our approach shows state-of-the-art performance by achieving 56.86% mIoU on SemanticKITTI, 42.28% on nuScenes, and 52.58% on SemanticPOSS under 10% symmetric label noise, with an overall Arithmetic Mean (AM) of 49.57% and Harmonic Mean (HM) of 48.50%, thereby demonstrating robust domain generalization in DGLSS-NL tasks. The code is available on our project page.
VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness
Vision-and-Language Navigation (VLN) increasingly relies on large vision-language models, but their inference cost conflicts with real-time deployment. Token caching is a promising training-free strategy that avoids redundant computation by reusing stable visual tokens across frames. However, existing methods assume a static camera and fixed semantic focus, assumptions that VLN fundamentally violates. We identify two failure modes: (1) visual dynamics, where viewpoint shift displaces token positions across frames, causing position-wise matching to pair misaligned content; (2) semantic dynamics, where token relevance shifts across task stages as navigation progresses, making cached states stale. We propose VLN-Cache, a visual-dynamic-aware and semantic-dynamic-aware caching framework that introduces view-aligned remapping to recover geometric correspondences and a task-relevance saliency filter to veto reuse at semantic transitions. A layer-adaptive entropy policy further balances the per-layer reuse budget. Experiments on the R2R-CE simulation benchmark show up to 1.52x speedup while maintaining competitive navigation success rates.
Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning CVPR 2026
Despite strong results on recognition and segmentation, current 3D visual pre-training methods often underperform on robotic manipulation. We attribute this gap to two factors: the lack of state-action-state dynamics modeling and the unnecessary redundancy of explicit geometric reconstruction. We introduce AFRO, a self-supervised framework that learns dynamics-aware 3D representations without action or reconstruction supervision. AFRO casts state prediction as a generative diffusion process and jointly models forward and inverse dynamics in a shared latent space to capture causal transition structure. To prevent feature leakage in action learning, we employ feature differencing and inverse-consistency supervision, improving the quality and stability of visual features. When combined with Diffusion Policy, AFRO substantially increases manipulation success rates across 16 simulated and 4 real-world tasks, outperforming existing pre-training approaches. The framework also scales favorably with data volume and task complexity. Qualitative visualizations indicate that AFRO learns semantically rich, discriminative features, offering an effective pre-training solution for 3D representation learning in robotics. Project page: https://kolakivy.github.io/AFRO/
comment: Project Page: https://kolakivy.github.io/AFRO/, accepted by CVPR 2026
Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition ICLR 2026
Diffusion-based models for robotic control, including vision-language-action (VLA) and vision-action (VA) policies, have demonstrated significant capabilities. Yet their advancement is constrained by the high cost of acquiring large-scale interaction datasets. This work introduces an alternative paradigm for enhancing policy performance without additional model training. Perhaps surprisingly, we demonstrate that the composed policies can exceed the performance of either parent policy. Our contribution is threefold. First, we establish a theoretical foundation showing that the convex composition of distributional scores from multiple diffusion models can yield a superior one-step functional objective compared to any individual score. A Grönwall-type bound is then used to show that this single-step improvement propagates through entire generation trajectories, leading to systemic performance gains. Second, motivated by these results, we propose General Policy Composition (GPC), a training-free method that enhances performance by combining the distributional scores of multiple pre-trained policies via a convex combination and test-time search. GPC is versatile, allowing for the plug-and-play composition of heterogeneous policies, including VA and VLA models, as well as those based on diffusion or flow-matching, irrespective of their input visual modalities. Third, we provide extensive empirical validation. Experiments on Robomimic, PushT, and RoboTwin benchmarks, alongside real-world robotic evaluations, confirm that GPC consistently improves performance and adaptability across a diverse set of tasks. Further analysis of alternative composition operators and weighting strategies offers insights into the mechanisms underlying the success of GPC. These results establish GPC as a simple yet effective method for improving control performance by leveraging existing policies.
comment: Accepted to ICLR 2026. Project Page: https://sagecao1125.github.io/GPC-Site/
SPARC: Spatial-Aware Path Planning via Attentive Robot Communication
Efficient communication is critical for decentralized Multi-Robot Path Planning (MRPP), yet existing learned communication methods treat all neighboring robots equally regardless of their spatial proximity, leading to diluted attention in congested regions where coordination matters most. We propose Relation enhanced Multi Head Attention (RMHA), a communication mechanism that explicitly embeds pairwise Manhattan distances into the attention weight computation, enabling each robot to dynamically prioritize messages from spatially relevant neighbors. Combined with a distance-constrained attention mask and GRU gated message fusion, RMHA integrates seamlessly with MAPPO for stable end-to-end training. In zero-shot generalization from 8 training robots to 128 test robots on 40x40 grids, RMHA achieves approximately 75 percent success rate at 30 percent obstacle density outperforming the best baseline by over 25 percentage points. Ablation studies confirm that distance-relation encoding is the key contributor to success rate improvement in high-density environments. Index Terms-Multi-robot path planning, graph attention mechanism, multi-head attention, communication optimization, cooperative decision-making
comment: The manuscript is being withdrawn at the request of the first author for the purpose of revising content and re-uploading a revised version with updated data/figures/text . The revised manuscript will be resubmitted to arXiv promptly with the same author list and research theme
SynHLMA:Synthesizing Hand Language Manipulation for Articulated Object with Discrete Human Object Interaction Representation
Generating hand grasps with language instructions is a widely studied topic that benefits from embodied AI and VR/AR applications. While transferring into hand articulatied object interaction (HAOI), the hand grasps synthesis requires not only object functionality but also long-term manipulation sequence along the object deformation. This paper proposes a novel HAOI sequence generation framework SynHLMA, to synthesize hand language manipulation for articulated objects. Given a complete point cloud of an articulated object, we utilize a discrete HAOI representation to model each hand object interaction frame. Along with the natural language embeddings, the representations are trained by an HAOI manipulation language model to align the grasping process with its language description in a shared representation space. A joint-aware loss is employed to ensure hand grasps follow the dynamic variations of articulated object joints. In this way, our SynHLMA achieves three typical hand manipulation tasks for articulated objects of HAOI generation, HAOI prediction and HAOI interpolation. We evaluate SynHLMA on our built HAOI-lang dataset and experimental results demonstrate the superior hand grasp sequence generation performance comparing with state-of-the-art. We also show a robotics grasp application that enables dexterous grasps execution from imitation learning using the manipulation sequence provided by our SynHLMA. Our codes and datasets will be made publicly available.
StructBiHOI: Structured Articulation Modeling for Long--Horizon Bimanual Hand--Object Interaction Generation
Recent progress in 3D hand--object interaction (HOI) generation has primarily focused on single--hand grasp synthesis, while bimanual manipulation remains significantly more challenging. Long--horizon planning instability, fine--grained joint articulation, and complex cross--hand coordination make coherent bimanual generation difficult, especially under multimodal conditions. Existing approaches often struggle to simultaneously ensure temporal consistency, physical plausibility, and semantic alignment over extended sequences. We propose StructBiHOI, a Structured articulation modeling framework for long-horizon Bimanual HOI generation. Our key insight is to structurally disentangle temporal joint planning from frame--level manipulation refinement. Specifically, a jointVAE models long-term joint evolution conditioned on object geometry and task semantics, while a maniVAE refines fine-grained hand poses at the single--frame level. To enable stable and efficient long--sequence generation, we incorporate a state--space--inspired diffusion denoiser based on Mamba, which models long--range dependencies with linear complexity. This hierarchical design facilitates coherent dual-hand coordination and articulated object interaction. Extensive experiments on bimanual manipulation and single-hand grasping benchmarks demonstrate that our method achieves superior long--horizon stability, motion realism, and computational efficiency compared to strong baselines.
Morphological-Symmetry-Equivariant Heterogeneous Graph Neural Network for Robotic Dynamics Learning
We present a morphological-symmetry-equivariant heterogeneous graph neural network, namely MS-HGNN, for robotic dynamics learning, that integrates robotic kinematic structures and morphological symmetries into a single graph network. These structural priors are embedded into the learning architecture as constraints, ensuring high generalizability, sample and model efficiency. The proposed MS-HGNN is a versatile and general architecture that is applicable to various multi-body dynamic systems and a wide range of dynamics learning problems. We formally prove the morphological-symmetry-equivariant property of our MS-HGNN and validate its effectiveness across multiple quadruped robot learning problems using both real-world and simulated data. Our code is made publicly available at https://github.com/lunarlab-gatech/MorphSym-HGNN/.
From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors ICLR 2026
Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require specialized sensors and transfer poorly across modalities, or inject weak cues that lack geometry and degrade vision-language alignment. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head. FALCON leverages spatial foundation models to deliver strong geometric priors from RGB alone, and includes an Embodied Spatial Model that can optionally fuse depth, or pose for higher fidelity when available, without retraining or architectural changes. To preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced Action Head rather than being concatenated into the vision-language backbone. These designs enable FALCON to address limitations in spatial representation, modality transferability, and alignment. In comprehensive evaluations across three simulation benchmarks and eleven real-world tasks, our proposed FALCON achieves state-of-the-art performance, consistently surpasses competitive baselines, and remains robust under clutter, spatial-prompt conditioning, and variations in object scale and height.
comment: Accepted at ICLR 2026. Project page: https://falcon-vla.github.io/
Automated Coral Spawn Monitoring for Reef Restoration: The Coral Spawn and Larvae Imaging Camera System (CSLICS)
Coral aquaculture for reef restoration requires accurate and continuous spawn counting for resource distribution and larval health monitoring, but current methods are labor-intensive and represent a critical bottleneck in the coral production pipeline. We propose the Coral Spawn and Larvae Imaging Camera System (CSLICS), which uses low cost modular cameras and object detectors trained using human-in-the-loop labeling approaches for automated spawn counting in larval rearing tanks. This paper details the system engineering, dataset collection, and computer vision techniques to detect, classify and count coral spawn. Experimental results from mass spawning events demonstrate an F1 score of 82.4% for surface spawn detection at different embryogenesis stages, 65.3% F1 score for sub-surface spawn detection, and a saving of 5,720 hours of labor per spawning event compared to manual sampling methods at the same frequency. Comparison of manual counts with CSLICS monitoring during a mass coral spawning event on the Great Barrier Reef demonstrates CSLICS' accurate measurement of fertilization success and sub-surface spawn counts. These findings enhance the coral aquaculture process and enable upscaling of coral reef restoration efforts to address climate change threats facing ecosystems like the Great Barrier Reef.
comment: 8 pages, 7 figures, accepted for presentation at the IEEE International Conference on Robotics and Automation, 2026
A 26-Gram Butterfly-Inspired Robot Achieving Autonomous Tailless Flight
The flight of biological butterflies represents a unique aerodynamic regime where high-amplitude, low-frequency wingstrokes induce significant body undulations and inertial fluctuations. While existing tailless flapping-wing micro air vehicles typically employ high-frequency kinematics to minimize such perturbations, the lepidopteran flight envelope remains a challenging and underexplored frontier for autonomous robotics. Here, we present \textit{AirPulse}, a 26-gram butterfly-inspired robot that achieves the first onboard, closed-loop controlled flight for a tailless two-winged platform at this scale. It replicates key biomechanical traits of butterfly flight, utilizing low-aspect-ratio, compliant carbon-fiber-reinforced wings and low-frequency flapping that reproduces characteristic biological body undulations. Leveraging a quantitative mapping of control effectiveness, we introduce a hierarchical control architecture featuring state estimator, attitude controller, and central pattern generator with Stroke Timing Asymmetry Rhythm (STAR), which translates attitude control demands into smooth and stable wingstroke timing and angle-offset modulations. Free-flight experiments demonstrate stable climbing and directed turning maneuvers, proving that autonomous locomotion is achievable even within oscillatory dynamical regimes. By bridging biological morphology with a minimalist control architecture, \textit{AirPulse} serves as both a hardware-validated model for decoding butterfly flight dynamics and a prototype for a new class of collision-resilient aerial robots. Its lightweight and compliant structure offers a non-invasive solution for a wide range of applications, such as ecological monitoring and confined-space inspection, where traditional drones may fall short.
Enhancing Heterogeneous Multi-Agent Cooperation in Decentralized MARL via GNN-driven Intrinsic Rewards AAMAS 2025
Multi-agent Reinforcement Learning (MARL) is emerging as a key framework for various sequential decision-making and control tasks. Unlike their single-agent counterparts, multi-agent systems necessitate successful cooperation among the agents. The deployment of these systems in real-world scenarios often requires decentralized training, a diverse set of agents, and learning from infrequent environmental reward signals. These challenges become more pronounced under partial observability and the lack of prior knowledge about agent heterogeneity. While notable studies use intrinsic motivation (IM) to address reward sparsity or cooperation in decentralized settings, those dealing with heterogeneity typically assume centralized training, parameter sharing, and agent indexing. To overcome these limitations, we propose the CoHet algorithm, which utilizes a novel Graph Neural Network (GNN) based intrinsic motivation to facilitate the learning of heterogeneous agent policies in decentralized settings, under the challenges of partial observability and reward sparsity. Evaluation of CoHet in the Multi-agent Particle Environment (MPE) and Vectorized Multi-Agent Simulator (VMAS) benchmarks demonstrates superior performance compared to the state-of-the-art in a range of cooperative multi-agent scenarios. Our research is supplemented by an analysis of the impact of the agent dynamics model on the intrinsic motivation module, insights into the performance of different CoHet variants, and its robustness to an increasing number of heterogeneous agents.
comment: Full paper version for AAMAS 2025, 9 pages, 5 figures
You Only Pose Once: A Minimalist's Detection Transformer for Monocular RGB Category-level 9D Multi-Object Pose Estimation ICRA 2026
Accurately recovering the full 9-DoF pose of unseen instances within specific categories from a single RGB image remains a core challenge for robotics and automation. Most existing solutions still rely on pseudo-depth, CAD models, or multi-stage cascades that separate 2D detection from pose estimation. Motivated by the need for a simpler, RGB-only alternative that learns directly at the category level, we revisit a longstanding question: Can object detection and 9-DoF pose estimation be unified with high performance, without any additional data? We show that they can with our method, YOPO, a single-stage, query-based framework that treats category-level 9-DoF estimation as a natural extension of 2D detection. YOPO augments a transformer detector with a lightweight pose head, a bounding-box-conditioned translation module, and a 6D-aware Hungarian matching cost. The model is trained end-to-end only with RGB images and category-level pose labels. Despite its minimalist design, YOPO sets a new state of the art on three benchmarks. On the REAL275 dataset, it achieves 79.6% $\rm{IoU}_{50}$ and 54.1% under the $10^\circ$$10{\rm{cm}}$ metric, surpassing prior RGB-only methods and closing much of the gap to RGB-D systems. The code, models, and additional qualitative results can be found on https://mikigom.github.io/YOPO-project-page.
comment: This paper has been accepted by IEEE ICRA 2026
Image Compression Using Novel View Synthesis Priors
Real-time visual feedback is essential for tetherless control of remotely operated vehicles, particularly during inspection and manipulation tasks. Though acoustic communication is the preferred choice for medium-range communication underwater, its limited bandwidth renders it impractical to transmit images or videos in real-time. To address this, we propose a model-based image compression technique that leverages prior mission information. Our approach employs trained machine-learning based novel view synthesis models, and uses gradient descent optimization to refine latent representations to help generate compressible differences between camera images and rendered images. We evaluate the proposed compression technique using a dataset from an artificial ocean basin, demonstrating superior compression ratios and image quality over existing techniques. Moreover, our method exhibits robustness to introduction of new objects within the scene, highlighting its potential for advancing tetherless remotely operated vehicle operations.
comment: Preprint submitted to IEEE Journal of Oceanic Engineering (v2.0)
RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning
Real-world robotic manipulation in homes and factories demands reliability, efficiency, and robustness that approach or surpass those of skilled human operators. We present RL-100, a real-world reinforcement learning framework built on diffusion visuomotor policies. RL-100 unifies imitation and reinforcement learning under a single clipped PPO surrogate objective applied within the denoising process, yielding conservative and stable improvements across offline and online stages. To meet deployment latency requirements, a lightweight consistency distillation method compresses multi-step diffusion into a one-step controller for high-frequency control. The framework is task-, embodiment-, and representation-agnostic, and supports both single-action and action-chunking control. We evaluate RL-100 on eight diverse real-robot tasks, from dynamic pushing and agile bowling to pouring, cloth folding, unscrewing, multi-stage juicing, and long-horizon box folding. RL-100 attains 100 percent success across evaluated trials, for a total of 1000 out of 1000 episodes, including up to 250 out of 250 consecutive trials on one task. It matches or surpasses expert teleoperators in time to completion. Without retraining, a single policy attains approximately 90 percent zero-shot success under environmental and dynamics shifts, adapts in a few-shot regime to significant task variations (86.7 percent), and remains robust to aggressive human perturbations (about 96 percent). Notably, our juicing robot served random customers continuously for about seven hours without failure when deployed zero-shot in a shopping mall. These results suggest a practical path to deployment-ready robot learning by starting from human priors, aligning training objectives with human-grounded metrics, and reliably extending performance beyond human demonstrations.
comment: https://lei-kun.github.io/RL-100/
Revisiting Replanning from Scratch: Real-Time Incremental Planning with Fast Almost-Surely Asymptotically Optimal Planners ICRA
Robots operating in changing environments either predict obstacle changes and/or plan quickly enough to react to them. Predictive approaches require a strong prior about the position and motion of obstacles. Reactive approaches require no assumptions about their environment but must replan quickly and find high-quality paths to navigate effectively. Reactive approaches often reuse information between queries to reduce planning cost. These techniques are conceptually sound but updating dense planning graphs when information changes can be computationally prohibitive. It can also require significant effort to detect the changes in some applications. This paper revisits the long-held assumption that reactive replanning requires updating existing plans. It shows that the incremental planning problem can alternatively be solved more efficiently as a series of independent problems using fast almost-surely asymptotically optimal (ASAO) planning algorithms. These ASAO algorithms quickly find an initial solution and converge towards an optimal solution which allows them to find consistent global plans in the presence of changing obstacles without requiring explicit plan reuse. This is demonstrated with simulated experiments where Effort Informed Trees (EIT*) finds shorter median solution paths than the tested reactive planning algorithms and is further validated using Asymptotically Optimal RRT-Connect (AORRTC) on a real-world planning problem on a robot arm.
comment: IEEE International Conference on Robotics and Automation (ICRA) 2026, 8 pages, 5 figures, 1 table. A video of this work can be found at https://www.youtube.com/watch?v=XaZrFy8wGZs
RoboRouter: Training-Free Policy Routing for Robotic Manipulation
Research on robotic manipulation has developed a diverse set of policy paradigms, including vision-language-action (VLA) models, vision-action (VA) policies, and code-based compositional approaches. Concrete policies typically attain high success rates on specific task distributions but lim-ited generalization beyond it. Rather than proposing an other monolithic policy, we propose to leverage the complementary strengths of existing approaches through intelligent policy routing. We introduce RoboRouter, a training-free framework that maintains a pool of heterogeneous policies and learns to select the best-performing policy for each task through accumulated execution experience. Given a new task, RoboRouter constructs a semantic task representation, retrieves historical records of similar tasks, predicts the optimal policy choice without requiring trial-and-error, and incorporates structured feedback to refine subsequent routing decisions. Integrating a new policy into the system requires only lightweight evaluation and incurs no training overhead. Across simulation benchmark and real-world evaluations, RoboRouter consistently outperforms than in-dividual policies, improving average success rate by more than 3% in simulation and over 13% in real-world settings, while preserving execution efficiency. Our results demonstrate that intelligent routing across heterogeneous, off-the-shelf policies provides a practical and scalable pathway toward building more capable robotic systems.
comment: We need to withdraw the paper as some of the reference papers are incorrect and need to be removed
Multimodal Adversarial Quality Policy for Safe Grasping
Vision-guided robot grasping based on Deep Neural Networks (DNNs) generalizes well but poses safety risks in the Human-Robot Interaction (HRI). Recent works solved it by designing benign adversarial attacks and patches with RGB modality, yet depth-independent characteristics limit their effectiveness on RGBD modality. In this work, we propose the Multimodal Adversarial Quality Policy (MAQP) to realize multimodal safe grasping. Our framework introduces two key components. First, the Heterogeneous Dual-Patch Optimization Scheme (HDPOS) mitigates the distribution discrepancy between RGB and depth modalities in patch generation by adopting modality-specific initialization strategies, employing a Gaussian distribution for depth patches and a uniform distribution for RGB patches, while jointly optimizing both modalities under a unified objective function. Second, the Gradient-Level Modality Balancing Strategy (GLMBS) is designed to resolve the optimization imbalance from RGB and Depth patches in patch shape adaptation by reweighting gradient contributions based on per-channel sensitivity analysis and applying distance-adaptive perturbation bounds. We conduct extensive experiments on the benchmark datasets and a cobot, showing the effectiveness of MAQP.
comment: submitted
Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation
Humans learn not only how their bodies move, but also how the surrounding world responds to their actions. In contrast, while recent Vision-Language-Action (VLA) models exhibit impressive semantic understanding, they often fail to capture the spatiotemporal dynamics governing physical interaction. In this paper, we introduce Pri4R, a simple yet effective approach that endows VLA models with an implicit understanding of world dynamics by leveraging privileged 4D information during training. Specifically, Pri4R augments VLAs with a lightweight point track head that predicts 3D point tracks. By injecting VLA features into this head to jointly predict future 3D trajectories, the model learns to incorporate evolving scene geometry within its shared representation space, enabling more physically aware context for precise control. Due to its architectural simplicity, Pri4R is compatible with dominant VLA design patterns with minimal changes. During inference, we run the model using the original VLA architecture unchanged; Pri4R adds no extra inputs, outputs, or computational overhead. Across simulation and real-world evaluations, Pri4R significantly improves performance on challenging manipulation tasks, including a +10% gain on LIBERO-Long and a +40% gain on RoboCasa. We further show that 3D point track prediction is an effective supervision target for learning action-world dynamics, and validate our design choices through extensive ablations. Project page: https://jiiiisoo.github.io/Pri4R/
Vectorized Online POMDP Planning ICRA 2026
Planning under partial observability is an essential capability of autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for planning under partial observability problems, capturing the stochastic effects of actions and the limited information available through noisy observations. POMDP solving could benefit tremendously from massive parallelization on today's hardware, but parallelizing POMDP solvers has been challenging. Most solvers rely on interleaving numerical optimization over actions with the estimation of their values, which creates dependencies and synchronization bottlenecks between parallel processes that can offset the benefits of parallelization. In this paper, we propose Vectorized Online POMDP Planner (VOPP), a novel parallel online solver that leverages a recent POMDP formulation which analytically solves part of the optimization component, leaving numerical computations to consist of only estimation of expectations. VOPP represents all data structures related to planning as a collection of tensors, and implements all planning steps as fully vectorized computations over this representation. The result is a massively parallel online solver with no dependencies or synchronization bottlenecks between concurrent processes. Experimental results indicate that VOPP is at least $20\times$ more efficient in computing near-optimal solutions compared to an existing state-of-the-art parallel online solver. Moreover, VOPP outperforms state-of-the-art sequential online solvers, while using a planning budget that is $1000\times$ smaller.
comment: 8 pages, 3 figures. Accepted at ICRA 2026
NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions ICRA 2026
Instruction-following navigation is a key step toward embodied intelligence. Prior benchmarks mainly focus on semantic understanding but overlook systematically evaluating navigation agents' spatial perception and reasoning capabilities. In this work, we introduce the NavSpace benchmark, which contains six task categories and 1,228 trajectory-instruction pairs designed to probe the spatial intelligence of navigation agents. On this benchmark, we comprehensively evaluate 22 navigation agents, including state-of-the-art navigation models and multimodal large language models. The evaluation results lift the veil on spatial intelligence in embodied navigation. Furthermore, we propose SNav, a new spatially intelligent navigation model. SNav outperforms existing navigation agents on NavSpace and real robot tests, establishing a strong baseline for future work.
comment: ICRA 2026
EgoMI: Learning Active Vision and Whole-Body Manipulation from Egocentric Human Demonstrations
Imitation learning from human demonstrations offers a promising approach for robot skill acquisition, but egocentric human data introduces fundamental challenges due to the embodiment gap. During manipulation, humans actively coordinate head and hand movements, continuously reposition their viewpoint and use pre-action visual fixation search strategies to locate relevant objects. These behaviors create dynamic, task-driven head motions that static robot sensing systems cannot replicate, leading to a significant distribution shift that degrades policy performance. We present EgoMI (Egocentric Manipulation Interface), a framework that captures synchronized end-effector and active head trajectories during manipulation tasks, resulting in data that can be retargeted to compatible semi-humanoid robot embodiments. To handle rapid and wide-spanning head viewpoint changes, we introduce a memory-augmented policy that selectively incorporates historical observations. We evaluate our approach on a bimanual robot equipped with an actuated camera head and find that policies with explicit head-motion modeling consistently outperform baseline methods. Results suggest that coordinated hand-eye learning with EgoMI effectively bridges the human-robot embodiment gap for robust imitation learning on semi-humanoid embodiments. Project page: https://egocentric-manipulation-interface.github.io
Score Matching Diffusion Based Feedback Control and Planning of Nonlinear Systems
In this paper, we propose a deterministic diffusion-based framework for controlling the probability density of nonlinear control-affine systems, with theoretical guarantees for drift-free and linear time-invariant (LTI) dynamics. The central idea is to first excite the system with white noise so that a forward diffusion process explores the reachable regions of state space, and then to design a deterministic feedback law that acts as a denoising mechanism driving the system back toward a desired target distribution supported on the target set. This denoising phase provides a feedback controller that steers the control system to the target set. In this framework, control synthesis reduces to constructing a deterministic reverse process that reproduces the desired evolution of state densities. We derive existence conditions ensuring such deterministic realizations of time-reversals for controllable drift-free and LTI systems, and show that the resulting feedback laws provide a tractable alternative to nonlinear control by viewing density control as a relaxation of controlling a system to target sets. Numerical studies on a unicycle model with obstacles, a five-dimensional driftless system, and a four-dimensional LTI system demonstrate reliable diffusion-inspired density control.
Automated Layout and Control Co-Design of Robust Multi-UAV Transportation Systems
The joint optimization of physical parameters and controllers in robotic systems is challenging. This is due to the difficulties of predicting the effect that changes in physical parameters have on final performances. At the same time, physical and morphological modifications can improve robot capabilities, perhaps completely unlocking new skills and tasks. We present a novel approach to co-optimize the physical layout and the control of a cooperative aerial transportation system. The goal is to achieve the most precise and robust flight when carrying a payload. We assume the agents are connected to the payload through rigid attachments, essentially transforming the whole system into a larger flying object with ``thrust modules" at the attachment locations of the quadcopters. We investigate the optimal arrangement of the thrust modules around the payload, so that the resulting system achieves the best disturbance rejection capabilities. We propose a novel metric of robustness inspired by H2 control, and propose an algorithm to optimize the layout of the vehicles around the object and their controller altogether. We experimentally validate the effectiveness of our approach using fleets of three and four quadcopters and payloads of diverse shapes.
comment: 7 pages, 7 figures, journal paper (IEEE RA-L)
Global End-Effector Pose Control of an Underactuated Aerial Manipulator via Reinforcement Learning ICRA 2026
Aerial manipulators, which combine robotic arms with multi-rotor drones, face strict constraints on arm weight and mechanical complexity. In this work, we study a lightweight 2-degree-of-freedom (DoF) arm mounted on a quadrotor via a differential mechanism, capable of full six-DoF end-effector pose control. While the minimal design enables simplicity and reduced payload, it also introduces challenges such as underactuation and sensitivity to external disturbances. To address these, we employ reinforcement learning, training a Proximal Policy Optimization (PPO) agent in simulation to generate feedforward commands for quadrotor acceleration and body rates, along with joint angle targets. These commands are tracked by an incremental nonlinear dynamic inversion (INDI) attitude controller and a PID joint controller, respectively. Flight experiments demonstrate centimeter-level position accuracy and degree-level orientation precision, with robust performance under external force disturbances, including manipulation of heavy loads and pushing tasks. The results highlight the potential of learning-based control strategies for enabling contact-rich aerial manipulation using simple, lightweight platforms. Videos of the experiment and the method are summarized in https://youtu.be/bWLTPqKcCOA.
comment: 8 pages, 6 figures, accepted by IEEE ICRA 2026
World Models That Know When They Don't Know - Controllable Video Generation with Calibrated Uncertainty
Recent advances in generative video models have led to significant breakthroughs in high-fidelity video synthesis, specifically in controllable video generation where the generated video is conditioned on text and action inputs, e.g., in instruction-guided video editing and world modeling in robotics. Despite these exceptional capabilities, controllable video models often hallucinate - generating future video frames that are misaligned with physical reality - which raises serious concerns in many tasks such as robot policy evaluation and planning. However, state-of-the-art video models lack the ability to assess and express their confidence, impeding hallucination mitigation. To rigorously address this challenge, we propose C3, an uncertainty quantification (UQ) method for training continuous-scale calibrated controllable video models for dense confidence estimation at the subpatch level, precisely localizing the uncertainty in each generated video frame. Our UQ method introduces three core innovations to empower video models to estimate their uncertainty. First, our method develops a novel framework that trains video models for correctness and calibration via strictly proper scoring rules. Second, we estimate the video model's uncertainty in latent space, avoiding training instability and prohibitive training costs associated with pixel-space approaches. Third, we map the dense latent-space uncertainty to interpretable pixel-level uncertainty in the RGB space for intuitive visualization, providing high-resolution uncertainty heatmaps that identify untrustworthy regions. Through extensive experiments on large-scale robot learning datasets (Bridge and DROID) and real-world evaluations, we demonstrate that our method not only provides calibrated uncertainty estimates within the training distribution, but also enables effective out-of-distribution detection.
Dull, Dirty, Dangerous: Understanding the Past, Present, and Future of a Key Motivation for Robotics
In robotics, the concept of "dull, dirty, and dangerous" (DDD) work has been used to motivate where robots might be useful. In this paper, we conduct an empirical analysis of robotics publications between 1980 and 2024 that mention DDD, and find that only 2.7% of publications define DDD and 8.7% of publications provide concrete examples of tasks or jobs that are DDD. We then review the social science literature on "dull," "dirty," and "dangerous" work to provide definitions and guidance on how to conceptualize DDD for robotics. Finally, we propose a framework that helps the robotics community consider the job context for our technology, encouraging a more informed perspective on how robotics may impact human labor.
REI-Bench: Can Embodied Agents Understand Vague Human Instructions in Task Planning? ICLR 2026
Robot task planning decomposes human instructions into executable action sequences that enable robots to complete a series of complex tasks. Although recent large language model (LLM)-based task planners achieve amazing performance, they assume that human instructions are clear and straightforward. However, real-world users are not experts, and their instructions to robots often contain significant vagueness. Linguists suggest that such vagueness frequently arises from referring expressions (REs), whose meanings depend heavily on dialogue context and environment. This vagueness is even more prevalent among the elderly and children, who are the groups that robots should serve more. This paper studies how such vagueness in REs within human instructions affects LLM-based robot task planning and how to overcome this issue. To this end, we propose the first robot task planning benchmark that systematically models vague REs grounded in pragmatic theory (REI-Bench), where we discover that the vagueness of REs can severely degrade robot planning performance, leading to success rate drops of up to 36.9%. We also observe that most failure cases stem from missing objects in planners. To mitigate the REs issue, we propose a simple yet effective approach: task-oriented context cognition, which generates clear instructions for robots, achieving state-of-the-art performance compared to aware prompts, chains of thought, and in-context learning. By tackling the overlooked issue of vagueness, this work contributes to the research community by advancing real-world task planning and making robots more accessible to non-expert users, e.g., the elderly and children.
comment: Accepted at ICLR 2026
Safe and Optimal Learning from Preferences via Weighted Temporal Logic with Applications in Robotics and Formula 1
Autonomous systems increasingly rely on human feedback to align their behavior, expressed as pairwise comparisons, rankings, or demonstrations. While existing methods can adapt behaviors, they often fail to guarantee safety in safety-critical domains. We propose a safety-guaranteed, optimal, and efficient approach for solving the learning problem from preferences, rankings, or demonstrations using Weighted Signal Temporal Logic (WSTL). WSTL learning problems, when implemented naively, lead to multi-linear constraints in the weights to be learned. By introducing structural pruning and log-transform procedures, we reduce the problem size and recast it as a Mixed-Integer Linear Program while preserving safety guarantees. Experiments on robotic navigation and real-world Formula 1 data demonstrate that the method captures nuanced preferences and models complex task objectives.
comment: 8 pages, 2 figures
NaviGait: Navigating Dynamically Feasible Gait Libraries using Deep Reinforcement Learning
Reinforcement learning (RL) has emerged as a powerful method to learn robust control policies for bipedal locomotion. Yet, it can be difficult to tune desired robot behaviors due to unintuitive and complex reward design. In comparison, trajectory optimization-based methods offer more tuneable, interpretable, and mathematically grounded motion plans for high-dimensional legged systems. However, these methods often remain brittle to real-world disturbances like external perturbations. In this work, we present NaviGait, a hierarchical framework that combines the structure of trajectory optimization with the adaptability of RL for robust and intuitive locomotion control. NaviGait leverages RL to synthesize new motions by selecting, minimally morphing, and stabilizing gaits taken from an offline-generated gait library. NaviGait results in walking policies that match the reference motion well while maintaining robustness comparable to other locomotion controllers. Additionally, the structure imposed by NaviGait drastically simplifies the RL reward composition. Our experimental results demonstrate that NaviGait enables faster training compared to conventional and imitation-based RL, and produces motions that remain closest to the original reference. Overall, by decoupling high-level motion generation from low-level correction, NaviGait offers a more scalable and generalizable approach for achieving dynamic and robust locomotion. Videos and the full framework are publicly available at https://dynamicmobility.github.io/navigait/
comment: Accepted to the International Conference on Robotics and Automation (2026). 8 pages, 9 figures
A Distributional Treatment of Real2Sim2Real for Object-Centric Agent Adaptation in Vision-Driven Deformable Linear Object Manipulation
We present an integrated (or end-to-end) framework for the Real2Sim2Real problem of manipulating deformable linear objects (DLOs) based on visual perception. Working with a parameterised set of DLOs, we use likelihood-free inference (LFI) to compute the posterior distributions for the physical parameters using which we can approximately simulate the behaviour of each specific DLO. We use these posteriors for domain randomisation while training, in simulation, object-specific visuomotor policies (i.e. assuming only visual and proprioceptive sensory) for a DLO reaching task, using model-free reinforcement learning. We demonstrate the utility of this approach by deploying sim-trained DLO manipulation policies in the real world in a zero-shot manner, i.e. without any further fine-tuning. In this context, we evaluate the capacity of a prominent LFI method to perform fine classification over the parametric set of DLOs, using only visual and proprioceptive data obtained in a dynamic manipulation trajectory. We then study the implications of the resulting domain distributions in sim-based policy learning and real-world performance.
Physics-Conditioned Grasping for Stable Tool Use
Tool use often fails not because robots misidentify tools, but because grasps cannot withstand task-induced wrench. Existing vision-language manipulation systems ground tools and contact regions from language yet select grasps under quasi-static or geometry-only assumptions. During interaction, inertial impulse and lever-arm amplification generate wrist torque and tangential loads that trigger slip and rotation. We introduce inverse Tool-use Planning (iTuP), which selects grasps by minimizing predicted interaction wrench along a task-conditioned trajectory. From rigid-body mechanics, we derive torque, slip, and alignment penalties, and train a Stable Dynamic Grasp Network (SDG-Net) to approximate these trajectory-conditioned costs for real-time scoring. Across hammering, sweeping, knocking, and reaching in simulation and on hardware, SDG-Net suppresses induced torque up to 17.6%, shifts grasps below empirically observed instability thresholds, and improves real-world success by 17.5% over a compositional baseline. Improvements concentrate where wrench amplification dominates, showing that robot tool use requires wrench-aware grasp selection, not perception alone.
comment: In submission and under review
Asset-Centric Metric-Semantic Maps of Indoor Environments
Large Language Models (LLMs) can help robots reason about abstract task specifications. This requires augmenting classical representations of the environment used by robots, such as point-clouds and meshes, with natural language-based priors. There are a number of approaches to do so in the existing literature. While some navigation frameworks leverage scene-level semantics at the expense of object-level detail, others such as language-guided neural radiance fields (NeRFs) or segment-anything 3D (SAM3D) prioritize object accuracy over global scene context. This paper argues that we can get the best of both worlds. We use a Unitree Go2 quadruped with a RealSense stereo camera (RGB-D data) to build an explicit metric-semantic representation of indoor environments. This is a scene-scale representation with each object (e.g., chairs, couches, doors, of various shapes and sizes) represented by a detailed mesh, its category, and a pose. We show that this representation is more accurate than foundation-model-based maps such as those built by SAM3D, as well as state-of-the-art scene-level robotics mapping pipelines such as Clio (Maggio et al., 2024). Our implementation is about 25$\times$ faster than SAM3D and is about 10$\times$ slower than Clio. We can also adapt our approach to enable open-set scene-level mapping, i.e., when object meshes are not known a priori, by building upon SAM3D to further improve precision and recall. We show how this representation can be readily used with LLMs such as Google's Gemini to demonstrate scene understanding, complex inferences, and planning. We also display the utility of having these representations for semantic navigation in simulated warehouse and hospital settings using Nvidia's Issac Sim.
comment: 9 pages, 8 figures, 3 tables
Robot Control Stack: A Lean Ecosystem for Robot Learning at Scale ICRA 2026
Vision-Language-Action models (VLAs) mark a major shift in robot learning. They replace specialized architectures and task-tailored components of expert policies with large-scale data collection and setup-specific fine-tuning. In this machine learning-focused workflow that is centered around models and scalable training, traditional robotics software frameworks become a bottleneck, while robot simulations offer only limited support for transitioning from and to real-world experiments. In this work, we close this gap by introducing Robot Control Stack (RCS), a lean ecosystem designed from the ground up to support research in robot learning with large-scale generalist policies. At its core, RCS features a modular and easily extensible layered architecture with a unified interface for simulated and physical robots, facilitating sim-to-real transfer. Despite its minimal footprint and dependencies, it offers a complete feature set, enabling both real-world experiments and large-scale training in simulation. Our contribution is twofold: First, we introduce the architecture of RCS and explain its design principles. Second, we evaluate its usability and performance along the development cycle of VLA and RL policies. Our experiments also provide an extensive evaluation of Octo, OpenVLA, and Pi Zero on multiple robots and shed light on how simulation data can improve real-world policy performance. Our code, datasets, weights, and videos are available at: https://robotcontrolstack.github.io/
comment: Accepted at ICRA 2026
Relative Localization System Design for SnailBot: A Modular Self-reconfigurable Robot
This paper presents the design and implementation of a relative localization system for SnailBot, a modular self reconfigurable robot. The system integrates ArUco marker recognition, optical flow analysis, and IMU data processing into a unified fusion framework, enabling robust and accurate relative positioning for collaborative robotic tasks. Experimental validation demonstrates the effectiveness of the system in realtime operation, with a rule based fusion strategy ensuring reliability across dynamic scenarios. The results highlight the potential for scalable deployment in modular robotic systems.
comment: The paper contains factual error and logic flaws, which needs to be repaired before submitting
Learning responsibility allocations for multi-agent interactions: A differentiable optimization approach with control barrier functions
From autonomous driving to package delivery, ensuring safe yet efficient multi-agent interaction is challenging as the interaction dynamics are influenced by hard-to-model factors such as social norms and contextual cues. Understanding these influences can aid in the design and evaluation of socially-aware autonomous agents whose behaviors are aligned with human values. In this work, we seek to codify factors governing safe multi-agent interactions via the lens of responsibility, i.e., an agent's willingness to deviate from their desired control to accommodate safe interaction with others. Specifically, we propose a data-driven modeling approach based on control barrier functions and differentiable optimization that efficiently learns agents' responsibility allocation from data. We demonstrate on synthetic and real-world datasets that we can obtain an interpretable and quantitative understanding of how much agents adjust their behavior to ensure the safety of others given their current environment.
comment: 8 pages, 7 figures
UniBYD: A Unified Framework for Learning Robotic Manipulation Across Embodiments Beyond Imitation of Human Demonstrations
In embodied intelligence, the embodiment gap between robotic and human hands brings significant challenges for learning from human demonstrations. Although some studies have attempted to bridge this gap using reinforcement learning, they remain confined to merely reproducing human manipulation, resulting in limited task performance. Moreover, current methods struggle to support diverse robotic hand configurations. In this paper, we propose UniBYD, a unified framework that uses a dynamic reinforcement learning algorithm to discover manipulation policies aligned with the robot's physical characteristics. To enable consistent modeling across diverse robotic hand morphologies, UniBYD incorporates a unified morphological representation (UMR). Building on UMR, we design a dynamic PPO with an annealed reward schedule, enabling reinforcement learning to transition from offline-informed imitation of human demonstrations to online-adaptive exploration of policies better adapted to diverse robotic morphologies, thereby going beyond mere imitation of human hands. To address the severe state drift caused by the incapacity of early-stage policies, we design a hybrid Markov-based shadow engine that provides fine-grained guidance to anchor the imitation within the expert's manifold. To evaluate UniBYD, we propose UniManip, the first benchmark for cross-embodiment manipulation spanning diverse robotic morphologies. Experiments demonstrate a 44.08% average improvement in success rate over the current state-of-the-art. Upon acceptance, we will release our code and benchmark.
Reactive Slip Control in Multifingered Grasping: Hybrid Tactile Sensing and Internal-Force Optimization ICRA
We present a hybrid learning and model-based approach for reactive internal-force adaptation to halt in-hand slip in a multifingered robotic gripper. A multimodal tactile stack combines piezoelectric (PzE) sensing for fast slip cues with piezoresistive (PzR) arrays for contact localization, enabling online construction of the grasp matrix. Upon slip detection, internal forces are updated in the null space of the grasp through a quadratic program that reinforces normal forces while preserving the object wrench. We demonstrate reactive stabilization of multifingered grasps under external perturbations. Augmenting analytic force control with learned tactile cues enables fast and reliable closed-loop stabilization in the evaluated grasp scenarios. The pipeline yields a theoretical sensing-to-command latency of 35-40 ms, including 5 ms for PzR-based grasp geometry updates and approximately 4 ms for solving the quadratic program. In controlled trials, slip onset is detected after ~ 20 ms. The analysis supports the feasibility of sub-50 ms integrated closed-loop stabilization.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA), 2026
Unveiling the Potential of iMarkers: Invisible Fiducial Markers for Advanced Robotics
Fiducial markers are widely used in robotics for navigation, object recognition, and scene understanding. While offering significant advantages for robots and Augmented Reality (AR) applications, they often disrupt the visual aesthetics of environments, as they are visible to humans, making them unsuitable for many everyday use cases. To address this gap, this paper presents iMarkers, innovative, unobtrusive fiducial markers detectable exclusively by robots and AR devices equipped with adequate sensors and detection algorithms. These markers offer high flexibility in production, allowing customization of their visibility range and encoding algorithms to suit various demands. The paper also introduces the hardware designs and open-sourced software algorithms developed for detecting iMarkers, highlighting their adaptability and robustness in the detection and recognition stages. Numerous evaluations have demonstrated the effectiveness of iMarkers relative to conventional (printed) and blended fiducial markers and have confirmed their applicability across diverse robotics scenarios.
comment: 19 pages, 10 figures, 4 tables
Latent Policy Steering with Embodiment-Agnostic Pretrained World Models
The performance of learned robot visuomotor policies is heavily dependent on the size and quality of the training dataset. Although large-scale robot and human datasets are increasingly available, embodiment gaps and mismatched action spaces make them difficult to leverage. Our main insight is that skills performed across different embodiments produce visual similarities in motions that can be captured using off-the-shelf action representations such as optical flow. Moreover, World Models (WMs) can leverage sub-optimal data since they focus on modeling dynamics. In this work, we aim to improve visuomotor policies in low-data regimes by first pretraining a WM using optical flow as an embodiment-agnostic action representation to leverage accessible or easily collected data from multiple embodiments (robots, humans). Given a small set of demonstrations on a target embodiment, we finetune the WM on this data to better align the WM predictions, train a base policy, and learn a robust value function. Using our finetuned WM and value function, our approach evaluates action candidates from the base policy and selects the best one to improve performance. Our approach, which we term Latent Policy Steering (LPS), improves behavior-cloned policies by 10.6% on average across four Robomimic tasks, even though most of the pretraining data comes from the real world. In the real-world experiments, LPS achieves larger gains: 70% relative improvement with 30-50 target-embodiment demonstrations, and 44% relative improvement with 60-100 demonstrations, compared to a behavior-cloned baseline.
Multiagent Systems
Context Engineering: From Prompts to Corporate Multi-Agent Architecture
As artificial intelligence (AI) systems evolve from stateless chatbots to autonomous multi-step agents, prompt engineering (PE), the discipline of crafting individual queries, proves necessary but insufficient. This paper introduces context engineering (CE) as a standalone discipline concerned with designing, structuring, and managing the entire informational environment in which an AI agent makes decisions. Drawing on vendor architectures (Google ADK, Anthropic, LangChain), current academic work (ACE framework, Google DeepMind's intelligent delegation), enterprise research (Deloitte, 2026; KPMG, 2026), and the author's experience building a multi-agent system, the paper proposes five context quality criteria: relevance, sufficiency, isolation, economy, and provenance, and frames context as the agent's operating system. Two higher-order disciplines follow. Intent engineering (IE) encodes organizational goals, values, and trade-off hierarchies into agent infrastructure. Specification engineering (SE) creates a machine-readable corpus of corporate policies and standards enabling autonomous operation of multi-agent systems at scale. Together these four disciplines form a cumulative pyramid maturity model of agent engineering, in which each level subsumes the previous one as a necessary foundation. Enterprise data reveals a gap: while 75% of enterprises plan agentic AI deployment within two years (Deloitte, 2026), deployment has surged and retreated as organizations confront scaling complexity (KPMG, 2026). The Klarna case illustrates a dual deficit, contextual and intentional. Whoever controls the agent's context controls its behavior; whoever controls its intent controls its strategy; whoever controls its specifications controls its scale.
comment: 15 pages, 1 figure
ToolRosetta: Bridging Open-Source Repositories and Large Language Model Agents through Automated Tool Standardization
Reusing and invoking existing code remains costly and unreliable, as most practical tools are embedded in heterogeneous code repositories and lack standardized, executable interfaces. Although large language models (LLMs) and Model Context Protocol (MCP)-based tool invocation frameworks enable natural language task execution, current approaches rely heavily on manual tool curation and standardization, which fundamentally limits scalability. In this paper, we propose ToolRosetta, a unified framework that automatically translates open-source code repositories and APIs into MCP-compatible tools that can be reliably invoked by LLMs. Given a user task, ToolRosetta autonomously plans toolchains, identifies relevant codebases, and converts them into executable MCP services, enabling end-to-end task completion with minimal human intervention. In addition, ToolRosetta incorporates a security inspection layer to mitigate risks inherent in executing arbitrary code. Extensive experiments across diverse scientific domains demonstrate that ToolRosetta can automatically standardize a large number of open-source tools and reduce the human effort required for code reproduction and deployment. Notably, by seamlessly leveraging specialized open-source tools, ToolRosetta-powered agents consistently improve task completion performance compared to commercial LLMs and existing agent systems.
comment: 20 pages
Strategically Robust Multi-Agent Reinforcement Learning with Linear Function Approximation
Provably efficient and robust equilibrium computation in general-sum Markov games remains a core challenge in multi-agent reinforcement learning. Nash equilibrium is computationally intractable in general and brittle due to equilibrium multiplicity and sensitivity to approximation error. We study Risk-Sensitive Quantal Response Equilibrium (RQRE), which yields a unique, smooth solution under bounded rationality and risk sensitivity. We propose \texttt{RQRE-OVI}, an optimistic value iteration algorithm for computing RQRE with linear function approximation in large or continuous state spaces. Through finite-sample regret analysis, we establish convergence and explicitly characterize how sample complexity scales with rationality and risk-sensitivity parameters. The regret bounds reveal a quantitative tradeoff: increasing rationality tightens regret, while risk sensitivity induces regularization that enhances stability and robustness. This exposes a Pareto frontier between expected performance and robustness, with Nash recovered in the limit of perfect rationality and risk neutrality. We further show that the RQRE policy map is Lipschitz continuous in estimated payoffs, unlike Nash, and RQRE admits a distributionally robust optimization interpretation. Empirically, we demonstrate that \texttt{RQRE-OVI} achieves competitive performance under self-play while producing substantially more robust behavior under cross-play compared to Nash-based approaches. These results suggest \texttt{RQRE-OVI} offers a principled, scalable, and tunable path for equilibrium learning with improved robustness and generalization.
AgenticCyOps: Securing Multi-Agentic AI Integration in Enterprise Cyber Operations
Multi-agent systems (MAS) powered by LLMs promise adaptive, reasoning-driven enterprise workflows, yet granting agents autonomous control over tools, memory, and communication introduces attack surfaces absent from deterministic pipelines. While current research largely addresses prompt-level exploits and narrow individual vectors, it lacks a holistic architectural model for enterprise-grade security. We introduce AgenticCyOps (Securing Multi-Agentic AI Integration in Enterprise Cyber Operations), a framework built on a systematic decomposition of attack surfaces across component, coordination, and protocol layers, revealing that documented vectors consistently trace back to two integration surfaces: tool orchestration and memory management. Building on this observation, we formalize these integration surfaces as primary trust boundaries and define five defensive principles: authorized interfaces, capability scoping, verified execution, memory integrity & synchronization, and access-controlled data isolation; each aligned with established compliance standards (NIST, ISO 27001, GDPR, EU AI Act). We apply the framework to a Security Operations Center (SOC) workflow, adopting the Model Context Protocol (MCP) as the structural basis, with phase-scoped agents, consensus validation loops, and per-organization memory boundaries. Coverage analysis, attack path tracing, and trust boundary assessment confirm that the design addresses the documented attack vectors with defense-in-depth, intercepts three of four representative attack chains within the first two steps, and reduces exploitable trust boundaries by a minimum of 72% compared to a flat MAS, positioning AgenticCyOps as a foundation for securing enterprise-grade integration.
comment: 17 pages, 4 figures, 5 tables
Chaotic Dynamics in Multi-LLM Deliberation
Collective AI systems increasingly rely on multi-LLM deliberation, but their stability under repeated execution remains poorly characterized. We model five-agent LLM committees as random dynamical systems and quantify inter-run sensitivity using an empirical Lyapunov exponent ($\hatλ$) derived from trajectory divergence in committee mean preferences. Across 12 policy scenarios, a factorial design at $T=0$ identifies two independent routes to instability: role differentiation in homogeneous committees and model heterogeneity in no-role committees. Critically, these effects appear even in the $T=0$ regime where practitioners often expect deterministic behavior. In the HL-01 benchmark, both routes produce elevated divergence ($\hatλ=0.0541$ and $0.0947$, respectively), while homogeneous no-role committees also remain in a positive-divergence regime ($\hatλ=0.0221$). The combined mixed+roles condition is less unstable than mixed+no-role ($\hatλ=0.0519$ vs $0.0947$), showing non-additive interaction. Mechanistically, Chair-role ablation reduces $\hatλ$ most strongly, and targeted protocol variants that shorten memory windows further attenuate divergence. These results support stability auditing as a core design requirement for multi-LLM governance systems.
comment: Main text: 6 pages, 4 figures; Supplementary Information: 14 pages, 7 supplementary figures
Latent World Models for Automated Driving: A Unified Taxonomy, Evaluation Framework, and Open Challenges
Emerging generative world models and vision-language-action (VLA) systems are rapidly reshaping automated driving by enabling scalable simulation, long-horizon forecasting, and capability-rich decision making. Across these directions, latent representations serve as the central computational substrate: they compress high-dimensional multi-sensor observations, enable temporally coherent rollouts, and provide interfaces for planning, reasoning, and controllable generation. This paper proposes a unifying latent-space framework that synthesizes recent progress in world models for automated driving. The framework organizes the design space by the target and form of latent representations (latent worlds, latent actions, latent generators; continuous states, discrete tokens, and hybrids) and by structural priors for geometry, topology, and semantics. Building on this taxonomy, the paper articulates five cross-cutting internal mechanics (i.e, structural isomorphism, long-horizon temporal stability, semantic and reasoning alignment, value-aligned objectives and post-training, as well as adaptive computation and deliberation) and connects these design choices to robustness, generalization, and deployability. The work also proposes concrete evaluation prescriptions, including a closed-loop metric suite and a resource-aware deliberation cost, designed to reduce the open-loop / closed-loop mismatch. Finally, the paper identifies actionable research directions toward advancing latent world model for decision-ready, verifiable, and resource-efficient automated driving.
comment: 17 pages, 6 figures, under review by IEEE Transactions on Intelligent Transportation Systems (IEEE-T-ITS)
Emotional Modulation in Swarm Decision Dynamics
Collective decision-making in biological and human groups often emerges from simple interaction rules that amplify minor differences into consensus. The bee equation, developed initially to describe nest-site selection in honeybee swarms, captures this dynamic through recruitment and inhibition processes. Here, we extend the bee equation into an agent-based model in which emotional valence (positive-negative) and arousal (low-high) act as modulators of interaction rates, effectively altering the recruitment and cross-inhibition parameters. Agents display simulated facial expressions mapped from their valence-arousal states, allowing the study of emotional contagion in consensus formation. Three scenarios are explored: (1) the joint effect of valence and arousal on consensus outcomes and speed, (2) the role of arousal in breaking ties when valence is matched, and (3) the "snowball effect" in which consensus accelerates after surpassing intermediate support thresholds. Results show that emotional modulation can bias decision outcomes and alter convergence times by shifting effective recruitment and inhibition rates. At the same time, intrinsic non-linear amplification can produce decisive wins even in fully symmetric emotional conditions. These findings link classical swarm decision theory with affective and social modelling, highlighting how both emotional asymmetries and structural tipping points shape collective outcomes. The proposed framework offers a flexible tool for studying the emotional dimensions of collective choice in both natural and artificial systems.
comment: Accepted for presentation at the International Conference on Agents and Artificial Intelligence (ICAART 2026)
Influencing LLM Multi-Agent Dialogue via Policy-Parameterized Prompts
Large Language Models (LLMs) have emerged as a new paradigm for multi-agent systems. However, existing research on the behaviour of LLM-based multi-agents relies on ad hoc prompts and lacks a principled policy perspective. Different from reinforcement learning, we investigate whether prompt-as-action can be parameterized so as to construct a lightweight policy which consists of a sequence of state-action pairs to influence conversational behaviours without training. Our framework regards prompts as actions executed by LLMs, and dynamically constructs prompts through five components based on the current state of the agent. To test the effectiveness of parameterized control, we evaluated the dialogue flow based on five indicators: responsiveness, rebuttal, evidence usage, non-repetition, and stance shift. We conduct experiments using different LLM-driven agents in two discussion scenarios related to the general public and show that prompt parameterization can influence the dialogue dynamics. This result shows that policy-parameterised prompts offer a simple and effective mechanism to influence the dialogue process, which will help the research of multi-agent systems in the direction of social simulation.
The Bureaucracy of Speed: Structural Equivalence Between Memory Consistency Models and Multi-Agent Authorization Revocation
The temporal assumptions underpinning conventional Identity and Access Management collapse under agentic execution regimes. A sixty-second revocation window permits on the order of $6 \times 10^3$ unauthorized API calls at 100 ops/tick; at AWS Lambda scale, the figure approaches $6 \times 10^5$. This is a coherence problem, not merely a latency problem. We define a Capability Coherence System (CCS) and construct a state-mapping $\varphi : Σ_{\rm MESI} \to Σ_{\rm auth}$ preserving transition structure under bounded-staleness semantics. A safety theorem bounds unauthorized operations for the execution-count Release Consistency-directed Coherence (RCC) strategy at $D_{\rm rcc} \leq n$, independent of agent velocity $v$ -- a qualitative departure from the $O(v \cdot \mathrm{TTL})$ scaling of time-bounded strategies. Tick-based discrete event simulation across three business-contextualised scenarios (four strategies, ten deterministic seeds each) confirms: RCC achieves a $120\times$ reduction versus TTL-based lease in the high-velocity scenario (50 vs. 6,000 unauthorized operations), and $184\times$ under anomaly-triggered revocation. Zero bound violations across all 120 runs confirm the per-capability safety guarantee. Simulation code: https://github.com/hipvlady/prizm
comment: 18 pages, 3 figures. Simulation code at https://github.com/hipvlady/prizm
FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis
Fetal ultrasound (US) is the primary imaging modality for prenatal screening, yet its interpretation relies heavily on the expertise of the clinician. Despite advances in deep learning and foundation models, existing automated tools for fetal US analysis struggle to balance task-specific accuracy with the whole-process versatility required to support end-to-end clinical workflows. To address these limitations, we propose FetalAgents, the first multi-agent system for comprehensive fetal US analysis. Through a lightweight, agentic coordination framework, FetalAgents dynamically orchestrates specialized vision experts to maximize performance across diagnosis, measurement, and segmentation. Furthermore, FetalAgents advances beyond static image analysis by supporting end-to-end video stream summarization, where keyframes are automatically identified across multiple anatomical planes, analyzed by coordinated experts, and synthesized with patient metadata into a structured clinical report. Extensive multi-center external evaluations across eight clinical tasks demonstrate that FetalAgents consistently delivers the most robust and accurate performance when compared against specialized models and multimodal large language models (MLLMs), ultimately providing an auditable, workflow-aligned solution for fetal ultrasound analysis and reporting.
KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization
Improving GPU kernel efficiency is crucial for advancing AI systems. Recent work has explored leveraging large language models (LLMs) for GPU kernel generation and optimization. However, existing LLM-based kernel optimization pipelines typically rely on opaque, implicitly learned heuristics within the LLMs to determine optimization strategies. This leads to inefficient trial-and-error and weakly interpretable optimizations. Our key insight is to replace implicit heuristics with expert optimization skills that are knowledge-driven and aware of task trajectories. Specifically, we present KernelSkill, a multi-agent framework with a dual-level memory architecture. KernelSkill operates by coordinating agents with long-term memory of reusable expert skills and short-term memory to prevent repetitive backtracking. On KernelBench Levels 1-3, KernelSkill achieves a 100% success rate and average speedups of 5.44x, 2.82x, and 1.92x over Torch Eager on Levels 1, 2, and 3, respectively, outperforming prior baselines. Code is available at https://github.com/0satan0/KernelMem/.
Noncooperative Human-AI Agent Dynamics
This paper investigates the dynamics of noncooperative interactions between artificial intelligence agents and human decision-makers in strategic environments. In particular, motivated by extensive literature in behavioral Economics, human agents are more faithfully modeled with respect to the state of the art using Prospect Theoretic preferences, while AI agents are modeled with standard expected utility maximization. Prospect Theory incorporates known cognitive heuristics employed by humans, including reference dependence and greater loss aversion relative to utility to relative gains. This paper runs different combinations of expected utility and prospect theoretic agents in a number of classic matrix games as well as examples specialized to tease out distinctions in strategic behavior with respect to preference functions, to explore the emergent behaviors from mixed population (human vs. AI) competition. Extensive numerical simulations are performed across AI, aware humans (those with full knowledge of the game structure and payoffs), and learning Prospect Agents (i.e., for AIs representing humans). A number of interesting observations and patterns show up, spanning barely distinguishable behavior, behavior corroborating Prospect preference anomalies in the theoretical literature, and unexpected surprises. Code can be found at https://github.com/dylanwaldner/noncooperative-human-AI.
comment: 41 pages
Cooperative Game-Theoretic Credit Assignment for Multi-Agent Policy Gradients via the Core
This work focuses on the credit assignment problem in cooperative multi-agent reinforcement learning (MARL). Sharing the global advantage among agents often leads to insufficient policy optimization, as it fails to capture the coalitional contributions of different agents. In this work, we revisit the policy update process from a coalitional perspective and propose CORA, an advantage allocation method guided by a cooperative game-theoretic core allocation. By evaluating the marginal contributions of different coalitions and combining clipped double Q-learning to mitigate overestimation bias, CORA estimates coalition-wise advantages. The core formulation enforces coalition-wise lower bounds on allocated credits, so that coalitions with higher advantages receive stronger total incentives for their participating agents, enabling the global advantage to be attributed to different coalition strategies and promoting coordinated optimal behavior. To reduce computational overhead, we employ random coalition sampling to approximate the core allocation efficiently. Experiments on matrix games, differential games, and multi-agent collaboration benchmarks demonstrate that our method outperforms baselines. These findings highlight the importance of coalition-level credit assignment and cooperative games for advancing multi-agent learning.
Polynomial-time Configuration Generator for Connected Unlabeled Multi-Agent Pathfinding ICAPS-26
We consider Connected Unlabeled Multi-Agent Pathfinding (CUMAPF), a variant of MAPF where interchangeable agents must be connected at all times. This problem is fundamental to swarm robotics applications such as self-reconfiguration and marching, where standard MAPF is insufficient as it does not guarantee the connectivity constraint. Despite its simple structure, CUMAPF remains understudied and lacks practical algorithms. We first develop an Integer Linear Programming (ILP) reduction to solve CUMAPF. Although this formulation provides a makespan-optimal plan, it is severely limited in terms of scalability and real-time responsiveness due to the large number of variables. We therefore propose a suboptimal but complete algorithm named PULL. It is based on a rule-based one-step function that computes a subsequent configuration that preserves connectivity and advances towards the target configuration. PULL is lightweight, and runs in $O(n^2)$ time per step in a 2D grid, where $n$ is the number of agents. Empirically, PULL can quickly solve randomly generated instances containing hundreds of agents, which ILP cannot handle. Furthermore, PULL's solution substantially improves upon a naive approach to CUMAPF.
comment: Accepted by ICAPS-26
Multi-Agent Reinforcement Learning with Communication-Constrained Priors
Communication is one of the effective means to improve the learning of cooperative policy in multi-agent systems. However, in most real-world scenarios, lossy communication is a prevalent issue. Existing multi-agent reinforcement learning with communication, due to their limited scalability and robustness, struggles to apply to complex and dynamic real-world environments. To address these challenges, we propose a generalized communication-constrained model to uniformly characterize communication conditions across different scenarios. Based on this, we utilize it as a learning prior to distinguish between lossy and lossless messages for specific scenarios. Additionally, we decouple the impact of lossy and lossless messages on distributed decision-making, drawing on a dual mutual information estimatior, and introduce a communication-constrained multi-agent reinforcement learning framework, quantifying the impact of communication messages into the global reward. Finally, we validate the effectiveness of our approach across several communication-constrained benchmarks.
Enhancing Heterogeneous Multi-Agent Cooperation in Decentralized MARL via GNN-driven Intrinsic Rewards AAMAS 2025
Multi-agent Reinforcement Learning (MARL) is emerging as a key framework for various sequential decision-making and control tasks. Unlike their single-agent counterparts, multi-agent systems necessitate successful cooperation among the agents. The deployment of these systems in real-world scenarios often requires decentralized training, a diverse set of agents, and learning from infrequent environmental reward signals. These challenges become more pronounced under partial observability and the lack of prior knowledge about agent heterogeneity. While notable studies use intrinsic motivation (IM) to address reward sparsity or cooperation in decentralized settings, those dealing with heterogeneity typically assume centralized training, parameter sharing, and agent indexing. To overcome these limitations, we propose the CoHet algorithm, which utilizes a novel Graph Neural Network (GNN) based intrinsic motivation to facilitate the learning of heterogeneous agent policies in decentralized settings, under the challenges of partial observability and reward sparsity. Evaluation of CoHet in the Multi-agent Particle Environment (MPE) and Vectorized Multi-Agent Simulator (VMAS) benchmarks demonstrates superior performance compared to the state-of-the-art in a range of cooperative multi-agent scenarios. Our research is supplemented by an analysis of the impact of the agent dynamics model on the intrinsic motivation module, insights into the performance of different CoHet variants, and its robustness to an increasing number of heterogeneous agents.
comment: Full paper version for AAMAS 2025, 9 pages, 5 figures
Characterizations of voting rules based on majority margins
In the context of voting with ranked ballots, an important class of voting rules is the class of margin-based rules (also called pairwise rules). A voting rule is margin-based if whenever two elections generate the same head-to-head margins of victory or loss between candidates, the voting rule yields the same outcome in both elections. Although this is a mathematically natural invariance property to consider, whether it should be regarded as a normative axiom on voting rules is less clear. In this paper, we address this question for voting rules with any kind of output, whether a set of candidates, a ranking, a probability distribution, etc. We prove that a voting rule is margin-based if and only if it satisfies some axioms with clearer normative content. A key axiom is what we call Preferential Equality, stating that if two voters both rank a candidate $x$ immediately above a candidate $y$, then either voter switching to rank $y$ immediately above $x$ will have the same effect on the election outcome as if the other voter made the switch, so each voter's preference for $y$ over $x$ is treated equally.
comment: Updated Fact 3.10
ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System
Large language models(LLMs) are now used to power complex multi-turn agentic workflows. Existing systems run agentic inference by loosely assembling isolated components: an LLM inference engine (e.g., vLLM) and a tool orchestrator (e.g., Kubernetes). Although agentic workflows involve multiple LLM and tool requests, these systems schedule and allocate resources separately on a per-request basis, without end-to-end knowledge of the workflow. This leads to sub-optimal management of KV cache and tool execution environments. To address the challenges, we propose ThunderAgent, a fast, simple, and program-aware agentic inference system. We first abstract agentic workflows as LLM Programs, enabling a unified view of heterogeneous resources, including KV caches, system states, and external tool assets such as disk memory and network ports. Built upon this abstraction, ThunderAgent introduces a program-aware scheduler and a tool resource manager designed to maximize KV cache hit rates, mitigate memory imbalances, and enable asynchronous environment preparation. Evaluations across coding, routing, and scientific discovery agents demonstrate that ThunderAgent achieves 1.5-3.6x throughput improvements in serving, 1.8-3.9x in RL rollout, and up to 4.2x disk memory savings compared to state-of-the-art inference systems. To facilitate reproducibility and support future development, we open-source the system implementations of the whole ThunderAgent at: https://github.com/Agentic-Kinetics/ThunderAgent.
What Do Agents Think One Another Want? Level-2 Inverse Games for Inferring Agents' Estimates of Others' Objectives
Effectively interpreting strategic interactions among multiple agents requires us to infer each agent's objective from limited information. Existing inverse game-theoretic approaches frame this challenge in terms of a "level-1" inference problem, in which we take the perspective of a third-party observer and assume that individual agents share complete knowledge of one another's objectives. However, this assumption breaks down in decentralized, real-world scenarios like urban driving and bargaining, in which agents may act based on conflicting views of one another's objectives. We demonstrate the necessity of inferring agents' different estimates of each other's objectives through empirical examples, and by theoretically characterizing the prediction error of level-1 inference on fictitious gameplay data from linear-quadratic games. To address this fundamental issue, we propose a framework for level-2 inference to address the question: "What does each agent believe about other agents' objectives?" We prove that the level-2 inference problem is non-convex even in benign settings like linear-quadratic games, and we develop an efficient gradient-based approach for identifying local solutions. Experiments on a synthetic urban driving example show that our approach uncovers nuanced misalignments that level-1 methods miss.
comment: 6 pages + appendix with supplements
The Coordination Gap: Alternation Metrics for Temporal Dynamics in Multi-Agent Battle of the Exes
Multi-agent coordination dilemmas expose a fundamental tension between individual optimization and collective welfare, yet characterizing such coordination requires metrics sensitive to temporal structure and collective dynamics. As a diagnostic testbed, we study a BoE-derived multi-agent variant of the Battle of the Exes, formalizing it as a Markov game in which turn-taking emerges as a periodic coordination regime. Conventional outcome-based metrics (e.g., efficiency and min/max fairness) are temporally blind (they cannot distinguish structured alternation from monopolistic or random access patterns) and fairness ratios lose discriminative power as n grows, obscuring inequities. To address this limitation, we introduce Perfect Alternation (PA) as a reference coordination regime and propose six novel Alternation (ALT) metrics designed as temporally sensitive observables of coordination quality. Using Q-learning agents as a minimal adaptive diagnostic baseline, and comparing against random-policy null processes, we uncover a clear measurement failure: despite exhibiting deceptively high traditional metrics (e.g., reward fairness often exceeding 0.9), learned policies perform up to 81% below random baselines under ALT-variant evaluation, a deficit already present in the two-agent case and intensifying as n grows. These results demonstrate, in this setting, that high aggregate payoffs can coexist with poor temporal coordination, and that conventional metrics may severely mischaracterize emergent dynamics. Our findings underscore the necessity of temporally aware observables for analyzing coordination in multi-agent games and highlight random-policy baselines as essential null processes for interpreting coordination outcomes relative to chance-level behavior.
comment: 38 pages, 5 figures, 4 tables, 1 supplementary pdf. Submitted to Mathematical Social Sciences
GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics
Ensuring reliable data-driven decisions is crucial in domains where analytical accuracy directly impacts safety, compliance, or operational outcomes. Decision support in such domains relies on large tabular datasets, where manual analysis is slow, costly, and error-prone. While Large Language Models (LLMs) offer promising automation potential, they face challenges in analytical reasoning, structured data handling, and ambiguity resolution. This paper introduces GateLens, an LLM-based architecture for reliable analysis of complex tabular data. Its key innovation is the use of Relational Algebra (RA) as a formal intermediate representation between natural-language reasoning and executable code, addressing the reasoning-to-code gap that can arise in direct generation approaches. In our automotive instantiation, GateLens translates natural language queries into RA expressions and generates optimized Python code. Unlike traditional multi-agent or planning-based systems that can be slow, opaque, and costly to maintain, GateLens emphasizes speed, transparency, and reliability. We validate the architecture in automotive software release analytics, where experimental results show that GateLens outperforms the existing Chain-of-Thought (CoT) + Self-Consistency (SC) based system on real-world datasets, particularly in handling complex and ambiguous queries. Ablation studies confirm the essential role of the RA layer. Industrial deployment demonstrates over 80% reduction in analysis time while maintaining high accuracy across domain-specific tasks. GateLens operates effectively in zero-shot settings without requiring few-shot examples or agent orchestration. This work advances deployable LLM system design by identifying key architectural features--intermediate formal representations, execution efficiency, and low configuration overhead--crucial for domain-specific analytical applications.
Learning responsibility allocations for multi-agent interactions: A differentiable optimization approach with control barrier functions
From autonomous driving to package delivery, ensuring safe yet efficient multi-agent interaction is challenging as the interaction dynamics are influenced by hard-to-model factors such as social norms and contextual cues. Understanding these influences can aid in the design and evaluation of socially-aware autonomous agents whose behaviors are aligned with human values. In this work, we seek to codify factors governing safe multi-agent interactions via the lens of responsibility, i.e., an agent's willingness to deviate from their desired control to accommodate safe interaction with others. Specifically, we propose a data-driven modeling approach based on control barrier functions and differentiable optimization that efficiently learns agents' responsibility allocation from data. We demonstrate on synthetic and real-world datasets that we can obtain an interpretable and quantitative understanding of how much agents adjust their behavior to ensure the safety of others given their current environment.
comment: 8 pages, 7 figures
Boltzmann-based Exploration for Robust Decentralized Multi-Agent Planning (Extended Version) ICAPS 2026
Decentralized Monte Carlo Tree Search (Dec-MCTS) is widely used for cooperative multi-agent planning but struggles in sparse or skewed reward environments. We introduce Coordinated Boltzmann MCTS (CB-MCTS), which replaces deterministic UCT with a stochastic Boltzmann policy and a decaying entropy bonus for sustained yet focused exploration. While Boltzmann exploration has been studied in single-agent MCTS, applying it in multi-agent systems poses unique challenges. CB-MCTS is the first to address this. We analyze CB-MCTS in the simple-regret setting and show in simulations that it outperforms Dec-MCTS in deceptive scenarios and remains competitive on standard benchmarks, providing a robust solution for multi-agent planning.
comment: To appear in ICAPS 2026
Systems and Control (EESS)
Embedded Model Predictive Control for EMS-type Maglev Vehicles
Current developments of high-speed magnetic levitation technology using the principle of the electromagnet suspension (EMS) focus on reaching vehicle speeds of more than 600 km/h. With increasing vehicle speeds, however, updated control algorithms need to be investigated to reliably stabilize the system and meet the demands in terms of ride comfort. This article examines the modern and popular approach of model predictive control and its application to the magnetic levitation control system. Investigated key aspects are the parameterization of the model predictive controller and its implementation on embedded, resource constrained hardware. The results reveal that model predictive control is capable to robustly stabilize the highly nonlinear and constrained system even at very high speed. Furthermore, processor-in-the-loop studies are carried out to validate the designed control algorithms on a microcontroller.
Constrained finite-time stabilization by model predictive control: an infinite control horizon framework
Existing results on finite-time model predictive control (MPC) often rely on terminal equality constraint, switching inside one-step region, or terminal cost with short control horizon, leading to limited initial feasibility. This paper proposes an infinite-horizon Model Predictive Control (MPC) framework for the constrained finite-time stabilization of discrete-time systems, overcoming limitations found in existing finite-time MPC results. The proposed framework is built upon a terminal cost strategy, but expands it by replacing the short-horizon terminal cost with the sum of stage costs over an infinite control horizon. This design choice significantly enlarges the initial feasibility region and avoids the need for terminal equality constraints or switching strategies during implementation. It is proved that the proposed finite-time MPC guarantees finite-time stabilization performance once the state trajectory enters the predefined terminal set. The infinite-horizon finite-time MPC is shown to be equivalently implementable as a finite-horizon MPC with a terminal cost, thereby ensuring computational tractability. The proposed finite-time MPC is systematically extended and shown to be applicable to both constrained multi-input linear systems and a class of constrained nonlinear systems that are feedback linearizable.
comment: 10 pages, 5 figures
A Variational Latent Equilibrium for Learning in Cortex
Brains remain unrivaled in their ability to recognize and generate complex spatiotemporal patterns. While AI is able to reproduce some of these capabilities, deep learning algorithms remain largely at odds with our current understanding of brain circuitry and dynamics. This is prominently the case for backpropagation through time (BPTT), the go-to algorithm for learning complex temporal dependencies. In this work we propose a general formalism to approximate BPTT in a controlled, biologically plausible manner. Our approach builds on, unifies and extends several previous approaches to local, time-continuous, phase-free spatiotemporal credit assignment based on principles of energy conservation and extremal action. Our starting point is a prospective energy function of neuronal states, from which we calculate real-time error dynamics for time-continuous neuronal networks. In the general case, this provides a simple and straightforward derivation of the adjoint method result for neuronal networks, the time-continuous equivalent to BPTT. With a few modifications, we can turn this into a fully local (in space and time) set of equations for neuron and synapse dynamics. Our theory provides a rigorous framework for spatiotemporal deep learning in the brain, while simultaneously suggesting a blueprint for physical circuits capable of carrying out these computations. These results reframe and extend the recently proposed Generalized Latent Equilibrium (GLE) model.
Vector-field guided constraint-following control for path following of uncertain mechanical systems
This note proposes a general control approach, called vector-field guided constraint-following control, to solve the dynamics control problem of geometric path-following for a class of uncertain mechanical systems. More specifically, it operates at the dynamics level and can handle both fully-actuated and underactuated mechanical systems, heterogeneous (possibly fast) time-varying uncertainties with unknown bounds, and geometric desired paths that may be self-intersecting. Simulations are conducted to demonstrate the effectiveness of the approach.
Fairness in Robust Unit Commitment Problem Considering Suppression of Renewable Energy
Power company operators make power generation plans one day in advance, in what is known as the Unit Commitment (UC) problem. UC is exposed to uncertainties, such as unknown electricity load and disturbances caused by renewable energy sources, especially PVs. In previous research, we proposed the Renewable Energy Robust Optimization Problem (RE-RP), which solves these uncertainties by considering suppression. In this paper, we propose a new model called RE-RP with fairness (RE-RPfair), which aims to achieve fair allocation among PVs allocation. This model is an expansion of the original RE-RP, and we prove its effectiveness through simulation. To measure the degree of fairness, we use the Gini Index, which is well-known in social science.
System-wide Dynamic Performance Metric for IBR-based Power Networks
In power networks based on Inverter-Based Resources (IBRs), fast controllers cause frequency and voltage dynamics to overlap. Thus, it becomes critical to assess the overall dynamic performance of such networks through a combined system-wide metric. This letter presents a unified metric designed to evaluate dynamic performance in such cases. The proposed metric consists of a weighted sum of local voltage phasor variations at each bus, where the weights are the complex powers injected at the buses. The proposed metric is further decomposed into device-driven and network-driven components, enabling a more comprehensive assessment of grid dynamics. A case study based on a modified version of the IEEE 39-bus system is presented, in which synchronous machines are replaced by inverter-based resources. A sensitivity analysis of the R/X ratio is utilized to evaluate the metric in conventional grids, as well as in those characterized by strong voltage-frequency coupling with complex power flows.
Distributionally robust two-stage model predictive control: adaptive constraint tightening with stability guarantee
Model Predictive Control (MPC) is widely recognized for its ability to explicitly handle system constraints. In practice, system states are often affected by disturbances with unknown distributions. While robust MPC guarantees constraint satisfaction under worst-case scenarios, it tends to be overly conservative. Stochastic MPC balances conservatism and performance but relies on precise knowledge of the disturbance distribution, which is often unavailable. To address this challenge, this paper introduces Distributionally Robust Optimization (DRO) into the MPC framework and proposes a novel Two-Stage Distributionally Robust MPC (TSDR-MPC) scheme. The key innovation lies in formulating constraint violation penalties as a second-stage optimization problem, which, combined with the first-stage quadratic cost, constitutes a two-stage distributionally robust program. This structure enables adaptive constraint tightening against disturbances with unknown time-varying means and covariances. Utilizing a Wasserstein ambiguity set, we derive a tractable reformulation via strong duality and develop a cutting-plane algorithm that converges in a finite number of iterations, suitable for real-time implementation. To ensure closed-loop stability even under non-zero mean disturbances, we introduce a terminal constraint applied solely to the nominal system; this constraint is proportional to the current state and independent of distributional uncertainty, thus preserving overall feasibility. We provide rigorous theoretical guarantees, including recursive feasibility, finite-time algorithm termination, and an asymptotic performance bound on the average closed-loop cost. Numerical simulations validate the adaptability and robustness of the proposed framework under various disturbance scenarios.
Existence and Design of Functional Observers for Time-Delay Systems with Delayed Output Measurements
This paper investigates the problem of functional state estimation for linear time-delay systems in which the delay affecting the state evolution differs from the delay affecting the output measurements. While existing observer designs typically assume instantaneous output availability, practical systems often exhibit measurement delays that are distinct from and not aligned with the intrinsic state delay. We explicitly distinguish between the state delay $τ$ and the measurement delay $h$ and address the problem of estimating a desired functional $z(t)=Fx(t)$ under such mismatched delay conditions. Three functional observer structures are proposed to accommodate different delay configurations, each capable of realizing functional observers of different orders. This flexibility is important since a functional observer whose order equals the number of estimated functionals may not always exist. For each structure, algebraic existence conditions are established together with constructive synthesis procedures. A functional augmentation framework is developed to derive verifiable rank-based conditions for observers of various orders. In addition, the notion of generalized functionals, defined over an augmented delayed state vector, is introduced to provide greater flexibility in satisfying observer existence conditions and facilitating systematic design. Numerical examples illustrate the proposed theory.
comment: Submitted to a journal
Amplitude Dependent Bode Diagrams via Scaled Relative Graphs
Scaled Relative Graphs (SRGs) provide an intuitive graphical frequency-domain method for the analysis of Nonlinear (NL) systems, generalizing the Nyquist diagram. In this paper, we develop a method for computing $L_2$-gain bounds for Lur'e systems over bounded frequency and amplitude ranges. We do this by restricting the input space of the SRG both in frequency and energy content, and combining with methods from Sobolev theory. The resulting gain bounds over restricted sets of inputs are less conservative than bounds computed over all of $L_2$, and yield three-dimensional NL generalization of the Bode diagram, plotting $L_2$-gain as function of both input frequency and energy content. In the zero-energy limit, the Linear Time-Invariant (LTI) Bode diagram is recovered, and at the infinite-energy zero-frequency limit, we recover the $L_2$-gain. The effectiveness of our method is demonstrated on an example that resembles Phase-Locked Loop dynamics.
comment: Submitted for Conference on Decision and Control 2026
Safe or Slow? The Illusion of Thermal Stability Under Reduced-Velocity Nail Intrusion
This study investigates the effects of nail penetration speed on the safety outcomes of large-format automotive lithium-ion pouch cells. Through six controlled tests varying the speed of nail insertion, we observed that lower penetration speeds did not induce thermal runaway; instead, the cells exhibited self-discharge while the nail remained embedded. These findings suggest that penetration speed is a critical factor in the onset of thermal runaway, providing valuable insights for the development of safer battery systems and more effective safety testing protocols.
Optimization-Based Formation Flight on Libration Point Orbits
A model predictive control (MPC) framework is developed for station-keeping in spacecraft formation flight along libration point orbits. At each control period, the MPC policy solves a multi-vehicle optimal control problem (MVOCP) that tracks a reference trajectory, while enforcing path constraints on the relative motion of the formation. The control policy makes use of a limited set of control nodes consistent with operational constraints that allow only a small number of maneuver opportunities per revolution. To promote recursive feasibility, path constraints are progressively tightened across the prediction horizon. An isoperimetric reformulation of the constraints is used to prevent inter-sample violations. The resulting MVOCP is a nonconvex program, which is solved via sequential convex programming. The proposed approach is evaluated in a high-fidelity ephemeris model under realistic uncertainties for a formation along the near-rectilinear halo orbit (NRHO), and subject to path constraints on interspacecraft separation and relative Sun phase angle. The results demonstrate maintenance of a spacecraft formation that satisfies the path constraints with cumulative propellant consumption comparable to that of existing methods
Differentiable Stochastic Traffic Dynamics: Physics-Informed Generative Modelling in Transportation
Macroscopic traffic flow is stochastic, but the physics-informed deep learning methods currently used in transportation literature embed deterministic PDEs and produce point-valued outputs; the stochasticity of the governing dynamics plays no role in the learned representation. This work develops a framework in which the physics constraint itself is distributional and directly derived from stochastic traffic-flow dynamics. Starting from an Ito-type Lighthill-Whitham-Richards model with Brownian forcing, we derive a one-point forward equation for the marginal traffic density at each spatial location. The spatial coupling induced by the conservation law appears as an explicit conditional drift term, which makes the closure requirement transparent. Based on this formulation, we derive an equivalent deterministic Probability Flow ODE that is pointwise evaluable and differentiable once a closure is specified. Incorporating this as a physics constraint, we then propose a score network with an advection-closure module, trainable by denoising score matching together with a Fokker-Planck residual loss. The resulting model targets a data-conditioned density distribution, from which point estimates, credible intervals, and congestion-risk measures can be computed. The framework provides a basis for distributional traffic-state estimation and for stochastic fundamental-diagram analysis in a physics-informed generative setting.
comment: 29 pages
Dynamic Stability Assessment of Grid-Connected Data Centers Powered by Small Modular Reactors
The accelerating growth of computational demand in modern data centers has further heightened the need for power infrastructures that are highly reliable, environmentally sustainable, and capable of supporting grid stability. Small Modular Reactors (SMRs) as a clean source of energy are particularly attractive for next-generation hyperscale data centers with significant electrical and cooling demands. This paper presents a comprehensive dynamic modeling and stability analysis of a grid-connected Integrated Energy System (IES) designed for data center applications. The proposed IES integrates an SMR and a battery energy storage system to jointly supply electricity for computational and cooling load while providing stability support to the main grid. A coupled computational-thermal load model is developed to capture the real-time power demand of the data center, incorporating CPU utilization, cooling efficiency, and ambient temperature effects. The integrated SMR-powered data center model is implemented in PSSE and tested on the IEEE 118-bus system under various fault scenarios. Simulation results demonstrate that the IES substantially enhances voltage and frequency stability compared to a conventionally grid-connected data center, minimizing disturbance-induced deviations and improving post-fault recovery.
Latent World Models for Automated Driving: A Unified Taxonomy, Evaluation Framework, and Open Challenges
Emerging generative world models and vision-language-action (VLA) systems are rapidly reshaping automated driving by enabling scalable simulation, long-horizon forecasting, and capability-rich decision making. Across these directions, latent representations serve as the central computational substrate: they compress high-dimensional multi-sensor observations, enable temporally coherent rollouts, and provide interfaces for planning, reasoning, and controllable generation. This paper proposes a unifying latent-space framework that synthesizes recent progress in world models for automated driving. The framework organizes the design space by the target and form of latent representations (latent worlds, latent actions, latent generators; continuous states, discrete tokens, and hybrids) and by structural priors for geometry, topology, and semantics. Building on this taxonomy, the paper articulates five cross-cutting internal mechanics (i.e, structural isomorphism, long-horizon temporal stability, semantic and reasoning alignment, value-aligned objectives and post-training, as well as adaptive computation and deliberation) and connects these design choices to robustness, generalization, and deployability. The work also proposes concrete evaluation prescriptions, including a closed-loop metric suite and a resource-aware deliberation cost, designed to reduce the open-loop / closed-loop mismatch. Finally, the paper identifies actionable research directions toward advancing latent world model for decision-ready, verifiable, and resource-efficient automated driving.
comment: 17 pages, 6 figures, under review by IEEE Transactions on Intelligent Transportation Systems (IEEE-T-ITS)
On the solvability of parameter estimation-based observers for nonlinear systems
Parameter estimation-based observer (PEBO) is a recently developed constructive tool to design state observers for nonlinear systems. It reformulates the state estimation problem as one of online parameter identification, effectively addressing many open estimation challenges in practical applications. The feasibility of a PEBO design relies on two fundamental properties: transformability and identifiability. The former pertains to the existence of an injective solution to a suitable partial differential equation, whereas the latter characterizes the uniqueness of the parameterization induced by the resulting nonlinear regression model. In this paper, we analyze the existence of PEBOs for general nonlinear systems by studying these two properties in detail and by providing sufficient conditions under which they hold.
Avoiding Semi-Infinite Programming in Distributionally Robust Control Based on Mean-Variance Metrics
Conventional stochastic control methods have several limitations. They focus on optimizing the average performance and, in some cases, performance variability; however, their problem settings still require an explicit specification of the probability distributions that determine the system's stochastic behavior. Distributionally robust control (DRC) methods have recently been developed to address these challenges. However, many DRC approaches involve handling infinitely many inequalities. For instance, DRC problems based on the Wasserstein distance are commonly obtained by solving semi-infinite programming (SIP) problems. Our proposed method eliminates the need for SIP when solving discrete-time, discounted, distributionally robust optimal control problems. By introducing a penalty term based on a specific distributional distance, we establish upper bounds, and under appropriate conditions, demonstrate the equivalence between distributionally robust optimization problems and mean-variance minimization problems. This reformulation reduces the original DRC problem to a discounted mean-variance cost optimization problem. In linear-quadratic regulator settings, the corresponding control laws are obtained by solving the Riccati equation. Numerical experiments demonstrate that the theoretical maximum value of the discounted cumulative cost for the proposed method is lower than that for the conventional method.
comment: 6 pages, 1 figure, This paper is submitted to the IEEE for possible publication
Optimal Control Synthesis of Closed-Loop Recommendation Systems over Social Networks
This paper addresses the problem of designing recommendation systems for social networks and e-commerce platforms from a control-theoretic perspective. We treat the design of recommendation systems as a state-feedback infinite-horizon optimal control problem with a performance index that (i) rewards alignment and engagement, (ii) penalizes polarization and large deviations from an uncontrolled baseline, and (iii) regularizes exposure across neighboring users. The recommendation entries are fed to the platform users, who are assumed to follow a networked, multi-topic, continuous-time opinion dynamics. We show that the designed control yields a stabilizing recommendation system under simple algebraic spectral conditions on the weights that encode the platform's preference for engagement, stability of preferences, polarization, and cross-user diversity. Conversely, we show that when ill-posed weights are selected in the optimal control problem (namely, when engagement is excessively rewarded), the closed-loop system can exhibit destabilizing, pathological behaviors that conflict with the design objectives.
High-Fidelity Digital Twin Dataset Generation for Inverter-Based Microgrids Under Multi-Scenario Disturbances
Public power-system datasets often lack electromagnetic transient (EMT) waveforms, inverter control dynamics, and diverse disturbance coverage, which limits their usefulness for training surrogate models and studying cyber-physical behavior in inverter-based microgrids. This paper presents a high-fidelity digital twin dataset generated from a MATLAB/Simulink EMT model of a low-voltage AC microgrid with ten inverter-based distributed generators. The dataset records synchronized three-phase PCC voltages and currents, per-DG active power, reactive power, and frequency, together with embedded scenario labels, producing 38 aligned channels sampled at $Δt = 2~μ$s over $T = 1$~s ($N = 500{,}001$ samples) per scenario. Eleven operating and disturbance scenarios are included: normal operation, load step, voltage sag (temporary three-phase fault), load ramp, frequency ramp, DG trip, tie-line trip, reactive power step, single-line-to-ground faults, measurement noise injection, and communication delay. To ensure numerical stability without altering sequence length, invalid samples (NaN, Inf, and extreme outliers) are repaired using linear interpolation. Each scenario is further validated using system-level evidence from mean frequency, PCC voltage magnitude, total active power, voltage unbalance, and zero-sequence current to confirm physical observability and correct timing. The resulting dataset provides a consistent, labeled EMT benchmark for surrogate modeling, disturbance classification, robustness testing under noise and delay, and cyber-physical resilience analysis in inverter-dominated microgrids. The dataset and processing scripts will be released upon acceptance
comment: 12 pages
Over-the-Air Consensus-based Formation Control of Heterogeneous Agents: Communication-Rate and Geometry-Aware Convergence Guarantees
This paper investigates the formation control problem of heterogeneous, autonomous agents that communicate over a wireless multiple access channel. Instead of avoiding interference through orthogonal node-to-node transmissions, we exploit the superposition property of the wireless channel to compute, at each receiver, normalized convex combinations of simultaneously broadcast neighbor signals. At every communication instant, agents update their reference positions from these aggregates, and track the references in continuous time between updates. The only assumption on the agent dynamics is that each agent tracks constant reference positions exponentially, which accommodates a broad class of platforms. Under this assumption, we analyze the resulting jump-flow system under time-varying communication graphs and unknown channel coefficients. We derive a communication-rate based sufficient condition that guarantees convergence to a prescribed formation. We then provide a geometry-aware refinement showing how favorable tracking transients can relax the required condition. Simulations with unicycle agents illustrate the theoretical results and demonstrate a substantial reduction in the number of required orthogonal transmissions compared to interference-avoiding node-to-node communication protocols.
A neural operator for predicting vibration frequency response curves from limited data
In the design of engineered components, rigorous vibration testing is essential for performance validation and identification of resonant frequencies and amplitudes encountered during operation. Performing this evaluation numerically via machine learning has great potential to accelerate design iteration and make testing workflows more efficient. However, dynamical systems are conventionally difficult to solve via machine learning methods without using physics-based regularizing loss functions. To properly perform this forecasting task, a structure that has an inspectable physical obedience can be devised without the use of regularizing terms from first principles. The method employed in this work is a neural operator integrated with an implicit numerical scheme. This architecture enables operators to learn of the underlying state-space dynamics from limited data, allowing generalization to untested driving frequencies and initial conditions. This network can infer the system's global frequency response by training on a small set of input conditions. As a foundational proof of concept, this investigation verifies the machine learning algorithm with a linear, single-degree-of-freedom system, demonstrating implicit obedience of dynamics. This approach demonstrates 99.87% accuracy in predicting the Frequency Response Curve (FRC), forecasting the frequency and amplitude of linear resonance training on 7% of the bandwidth of the solution. By training machine learning models to internalize physics information rather than trajectory, better generalization accuracy can be realized, vastly improving the timeframe for vibration studies on engineered components.
Data-Driven Successive Linearization for Optimal Voltage Control
Power distribution systems are increasingly exposed to large voltage fluctuations driven by intermittent solar photovoltaic generation and rapidly varying loads (e.g., electric vehicles and storage). To address this challenge, a number of advanced controllers have been proposed for voltage regulation. However, these controllers typically rely on fixed linear approximations of voltage dynamics. As a result, the solutions may become infeasible when applied to the actual voltage behavior governed by nonlinear power flow equations, particularly under heavy power injection from distributed energy resources. This paper proposes a data-driven successive linearization approach for voltage control under nonlinear power flow constraints. By leveraging the fact that the deviation between the nonlinear power flow solution and its linearization is bounded by the distance from the operating point, we perform data-driven linearization around the most recent operating point. Convergence of the proposed method to a neighborhood of KKT points is established by exploiting the convexity of the objective function and the structural properties of the nonlinear constraints. Case studies show that the proposed approach achieves fast convergence and adapts quickly to changes in net load.
On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer
A central question in modern deep learning is how to design optimizers whose behavior remains stable as the network width $w$ increases. We address this question by interpreting several widely used neural-network optimizers, including \textrm{AdamW} and \textrm{Muon}, as instances of steepest descent under matrix operator norms. This perspective links optimizer geometry with the Lipschitz structure of the network forward map, and enables width-independent control of both Lipschitz and smoothness constants. However, steepest-descent rules induced by standard $p \to q$ operator norms lack layerwise composability and therefore cannot provide width-independent bounds in deep architectures. We overcome this limitation by introducing a family of mean-normalized operator norms, denoted $\pmean \to \qmean$, that admit layerwise composability, yield width-independent smoothness bounds, and give rise to practical optimizers such as \emph{rescaled} \textrm{AdamW}, row normalization, and column normalization. The resulting learning rate width-aware scaling rules recover $μ$P scaling~\cite{yang2021tensor} as a special case and provide a principled mechanism for cross-width learning-rate transfer across a broad class of optimizers. We further show that \textrm{Muon} can suffer an $\mathcal{O}(\sqrt{w})$ worst-case growth in the smoothness constant, whereas a new family of row-normalized optimizers we propose achieves width-independent smoothness guarantees. Based on the observations, we propose MOGA (Matrix Operator Geometry Aware), a width-aware optimizer based only on row/column-wise normalization that enables stable learning-rate transfer across model widths. Large-scale pre-training on GPT-2 and LLaMA shows that MOGA, especially with row normalization, is competitive with Muon while being notably faster in large-token and low-loss regimes.
Towards Flexible Spectrum Access: Data-Driven Insights into Spectrum Demand
In the diverse landscape of 6G networks, where wireless connectivity demands surge and spectrum resources remain limited, flexible spectrum access becomes paramount. The success of crafting such schemes hinges on our ability to accurately characterize spectrum demand patterns across space and time. This paper presents a data-driven methodology for estimating spectrum demand variations over space and identifying key drivers of these variations in the mobile broadband landscape. By leveraging geospatial analytics and machine learning, the methodology is applied to a case study in Canada to estimate spectrum demand dynamics in urban regions. Our proposed model captures 70\% of the variability in spectrum demand when trained on one urban area and tested on another. These insights empower regulators to navigate the complexities of 6G networks and devise effective policies to meet future network demands.
comment: 7 pages, 5 figures. Presented at IEEE VTC 2024, Washington, DC. Published in the IEEE conference proceedings
Emergency Locator Transmitters in the Era of More Electric Aircraft: A Comprehensive Review of Energy, Integration and Safety Challenges
The progressive electrification of aircraft systems under the more electric aircraft (MEA) paradigm is reshaping the design and qualification constraints of safety-critical avionics. Emergency locator transmitters (ELTs), which are essential for post-accident localization and search and rescue (SAR) operations, have evolved from legacy 121.5/243 MHz beacons to digitally encoded 406 MHz systems, typically retaining 121.5 MHz as a homing signal in combined units. In parallel, the modernization of the Cospas-Sarsat infrastructure, especially MEOSAR, together with multi-constellation global navigation satellite system (GNSS) integration and second-generation beacon capabilities, is reducing detection latency and enabling richer distress messaging. However, MEA platforms impose stricter constraints on available power, thermal management, wiring density, and electromagnetic compatibility (EMC). As a result, ELT performance increasingly depends not only on the device itself, but also on its installation conditions and on the aircraft's overall electrical environment. This review summarizes the ELT architectures and activation/operational cycles, outlines key technological milestones, and consolidates the main integration challenges for MEA, with emphasis on energy autonomy, battery qualification frameworks, EMC and installation practices, and survivability-driven failure modes (e.g., antenna/feedline damage, mounting, and post-impact shielding). Finally, emerging trends include ELT for distress tracking (DT), energy-based designs, advanced health monitoring, and certification-ready pathways for next-generation SAR services are discussed, highlighting research directions that can deliver demonstrable, certifiable gains in reliability, energy efficiency, and robust integration for future electrified aircraft.
AI-Enabled Data-driven Intelligence for Spectrum Demand Estimation
Accurately forecasting spectrum demand is a key component for efficient spectrum resource allocation and management. With the rapid growth in demand for wireless services, mobile network operators and regulators face increasing challenges in ensuring adequate spectrum availability. This paper presents a data-driven approach leveraging artificial intelligence (AI) and machine learning (ML) to estimate and manage spectrum demand. The approach uses multiple proxies of spectrum demand, drawing from site license data and derived from crowdsourced data. These proxies are validated against real-world mobile network traffic data to ensure reliability, achieving an R$^2$ value of 0.89 for an enhanced proxy. The proposed ML models are tested and validated across five major Canadian cities, demonstrating their generalizability and robustness. These contributions assist spectrum regulators in dynamic spectrum planning, enabling better resource allocation and policy adjustments to meet future network demands.
comment: Presented at an IEEE ICC 2025 Workshop and published in the conference proceedings
NanoBench: A Multi-Task Benchmark Dataset for Nano-Quadrotor System Identification, Control, and State Estimation
Existing aerial-robotics benchmarks target vehicles from hundreds of grams to several kilograms and typically expose only high-level state data. They omit the actuator-level signals required to study nano-scale quadrotors, where low-Reynolds number aerodynamics, coreless DC motor nonlinearities, and severe computational constraints invalidate models and controllers developed for larger vehicles. We introduce NanoBench, an open-source multi-task benchmark collected on the commercially available Crazyflie 2.1 nano-quadrotor (takeoff weight 27 g) in a Vicon motion capture arena. The dataset contains over 170 flight trajectories spanning hover, multi-frequency excitation, standard tracking, and aggressive maneuvers across multiple speed regimes. Each trajectory provides synchronized Vicon ground truth, raw IMU data, onboard extended Kalman filter estimates, PID controller internals, and motor PWM commands at 100 Hz, alongside battery telemetry at 10 Hz, aligned with sub-0.5 ms consistency. NanoBench defines standardized evaluation protocols, train/test splits, and open-source baselines for three tasks: nonlinear system identification, closed-loop controller benchmarking, and onboard state estimation assessment. To our knowledge, it is the first public dataset to jointly provide actuator commands, controller internals, and estimator outputs with millimeter-accurate ground truth on a commercially available nano-scale aerial platform.
comment: 9 pages, 6 figures
Dynamic Average Consensus with Privacy Guarantees and Its Application to Battery Energy Storage Systems
A privacy-preserving dynamic average consensus (DAC) algorithm is proposed that achieves consensus while preventing external eavesdroppers from inferring the reference signals and their derivatives. During the initialization phase, each agent generates a set of sinusoidal signals with randomly selected frequencies and exchanges them with its neighboring agents to construct a masking signal. Each agent masks its reference signals using this composite masking signal before executing the DAC update rule. It is shown that the developed scheme preserves the convergence properties of the conventional DAC framework while preventing information leakage to external eavesdroppers. Furthermore, the developed algorithm is applied to state-of-charge (SoC) balancing in a networked battery energy storage system to demonstrate its practical applicability. Simulation results validate the theoretical findings.
A Survey on Cloud-Based 6G Deployments: Current Solutions, Future Directions and Open Challenges
The next generation of cellular networks is designed to provide ubiquitous connectivity to a wide range of devices. As Telecommunication Service Providers (TSPs) increasingly collaborate with public cloud providers to deploy 5G and beyond networks, a fundamental shift is underway, from hardware-bound Physical Network Functions (PNFs) to cloud-native, containerized deployments managed through platforms like Kubernetes. While this transition promises greater scalability, flexibility, and cost efficiency, it also introduces a complex set of technical and operational challenges that must be thoroughly understood before large-scale cellular deployments can take place in cloud environments. In this survey, we present a structured taxonomy that categorizes the design space of cloud-based cellular deployments across four dimensions: deployment architecture, resource management and orchestration, multi-tenancy and isolation, and economic and ownership models. Using this taxonomy as a foundation, we critically analyze six key investigation areas, security and privacy, scalability and elasticity, performance and latency, cost optimization, resilience and fault management, and compliance and sovereignty, examining each through a cloud-native lens. To benchmark the state of industry adoption, we examine the deployment strategies of leading Infrastructure-as-a-Service (IaaS) providers, namely Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). Finally, we identify emerging trends such as AI-driven orchestration, quantum-safe protocols for virtualized network functions, and serverless networking for 6G, while articulating the open challenges that remain in realizing robust, scalable cloud-based cellular networks.
comment: 47 pages, 403 citations, 21 figures, journal
Field Free Novel Architecture for Spintronic Flash Analog to Digital Converter
A 3 bit Analog to Digital Converter (ADC) is designed using perpendicular Spin Orbit Torque Magnetic Tunnel Junction (SOT MTJ). A sampled analog input signal is transmitted as a spin orbit torque current (Iin) to a perpendicular SOT MTJ, and deterministic switching is supported by the Voltage Controlled Magnetic Anisotropy (VCMA) and Spin Transfer Torque (STT) switching methods. Analog to digital conversion is done by comparing input signal with varied critical current of SOT MTJs. The critical current of each is SOT MTJ governed by varying widths of Heavy Metal (HM). In the 3 bit ADC, there are two sets of 7 SOT MTJs for quantizing input value, a conversion set and dummy set for comparing the change in resistance state. As input signal passed through conversion set SOT MTJs switches from Parallel (P) to AntiParallel (AP) state if the input signal exceeds its critical current. The conversion set change in state is converted to thermometer codes by StrongARM latch comparator by comparing the resistance with dummy set SOT MTJs, where all the in P state or low resistance. A novel architecture is proposed for increasing speed of throughput, by utilizing the dummy set of as a conversion set and conversion set as dummy set, thus eliminating the reset step from analog to digital conversion. And by improving SOT-MTJ and timing blocks a field free spin flash ADC has a power consumption of 476 uW with a conversion rate of 304.1 MHz is produced.
comment: 9 pages incluinding 2 pages of reference, 11 figures and 2 tables. Invited and presented at conference(ICMAGMA,2024)
A Graph-Based Approach to Spectrum Demand Prediction Using Hierarchical Attention Networks
The surge in wireless connectivity demand, coupled with the finite nature of spectrum resources, compels the development of efficient spectrum management approaches. Spectrum sharing presents a promising avenue, although it demands precise characterization of spectrum demand for informed policy-making. This paper introduces HR-GAT, a hierarchical resolution graph attention network model, designed to predict spectrum demand using geospatial data. HR-GAT adeptly handles complex spatial demand patterns and resolves issues of spatial autocorrelation that usually challenge standard machine learning models, often resulting in poor generalization. Tested across five major Canadian cities, HR-GAT improves predictive accuracy of spectrum demand by 21% over eight baseline models, underscoring its superior performance and reliability.
comment: 7 pages, 6 figures. Presented at IEEE GLOBECOM 2025, Taiwan. To appear in the conference proceedings
Learning-Augmented Primal-Dual Control Design for Secondary Frequency Regulation
Frequency stability is fundamental to the secure operation of power systems. With growing uncertainty and volatility introduced by renewable generation, secondary frequency regulation must now deliver enhanced performance not only in the steady state but also during transients. This paper presents a systematic framework to embed learning in the design of a primal-dual controller that provides provable (potentially exponential) stability and steady-state optimality, while simultaneously improving key transient metrics, including frequency nadir and control effort, in a data-driven manner. In particular, we employ the primal-dual dynamics of an optimization problem that encodes steady-state objectives to realize secondary frequency control with asymptotic stability guarantee. To augment transient performance of the controller via learning, a change of variables on control inputs, which will be deployed by neural networks, is proposed such that under mild conditions, stability and steady-state optimality are preserved. It further allows us to define a learning goal that accounts for the exponential convergence rate, frequency nadir and accumulated control effort, and use sample trajectories to enhance these metrics. Simulation results validate the theories and demonstrate superior transient performance of the learning-augmented primal-dual controller.
Experimental Characterization of Biological Tissue Dielectric Properties through THz Time-Domain Spectroscopy
Terahertz (THz) radiation provides a non-ionizing, highly sensitive probe of the dielectric properties of biological tissues. In this study, we present a comprehensive experimental characterization of dielectric properties using pork skin tissue, a widely used surrogate for human tissue, as a biological sample. Measurements are conducted employing THz time-domain spectroscopy in the 0.1-11 THz frequency range with photoconductive antennas for both signal generation and detection. Frequency-dependent refractive indices, absorption, and complex permittivity are extracted from transmitted time-domain signals. Our results confirm strong absorption and low transmittance at low THz frequencies due to water content, while highlighting frequency-dependent dispersion and narrowband transmission features at higher frequencies. This work provides one of the first extended-frequency datasets of biological tissue dielectric properties, supporting realistic channel modeling for the design and development of intra-body nanosensor networks in the THz band.
comment: To be published in EAI BODYNETS 2025
Lightweight 3D LiDAR-Based UAV Tracking: An Adaptive Extended Kalman Filtering Approach
Accurate relative positioning is crucial for swarm aerial robotics, enabling coordinated flight and collision avoidance. Although vision-based tracking has been extensively studied, 3D LiDAR-based methods remain underutilized despite their robustness under varying lighting conditions. Existing systems often rely on bulky, power-intensive sensors, making them impractical for small UAVs with strict payload and energy constraints. This paper presents a lightweight LiDAR-based UAV tracking system incorporating an Adaptive Extended Kalman Filter (AEKF) framework. Our approach effectively addresses the challenges posed by sparse, noisy, and nonuniform point cloud data generated by non-repetitive scanning 3D LiDARs, ensuring reliable tracking while remaining suitable for small drones with strict payload constraints. Unlike conventional filtering techniques, the proposed method dynamically adjusts the noise covariance matrices using innovation and residual statistics, thereby enhancing tracking accuracy under real-world conditions. Additionally, a recovery mechanism ensures continuity of tracking during temporary detection failures caused by scattered LiDAR returns or occlusions. Experimental validation was performed using a Livox Mid-360 LiDAR mounted on a DJI F550 UAV in real-world flight scenarios. The proposed method demonstrated robust UAV tracking performance under sparse LiDAR returns and intermittent detections, consistently outperforming both standard Kalman filtering and particle filtering approaches during aggressive maneuvers. These results confirm that the framework enables reliable relative positioning in GPS-denied environments without the need for multi-sensor arrays or external infrastructure.
comment: Presented at the 19th International Conference on Intelligent Autonomous Systems, IAS-19, Genoa, Italy, June 30 to July 4, 2025. To appear in the Springer post-proceedings of the conference
Efficient and robust control with spikes that constrain free energy
Animal brains exhibit remarkable efficiency in perception and action, while being robust to both external and internal perturbations. The means by which brains accomplish this remains, for now, poorly understood, hindering our understanding of animal and human cognition, as well as our own implementation of efficient algorithms for control of dynamical systems.A potential candidate for a robust mechanism of state estimation and action computation is the free energy principle, but existing implementations of this principle have largely relied on conventional, biologically implausible approaches without spikes. We propose a novel, efficient, and robust spiking control framework with realistic biological characteristics. The resulting networks function as free energy constrainers, in which neurons only fire if they reduce the free energy of their internal representation. The networks offer efficient operation through highly sparse activity while matching performance with other similar spiking frameworks, and have high resilience against both external (e.g. sensory noise or collisions) and internal perturbations (e.g. synaptic noise and delays or neuron silencing) that such a network would be faced with when deployed by either an organism or an engineer. Overall, our work provides a novel mathematical account for spiking control through constraining free energy, providing both better insight into how brain networks might leverage their spiking substrate and a new route for implementing efficient control algorithms in neuromorphic hardware.
The Epistemic Support-Point Filter: Jaynesian Maximum Entropy Meets Popperian Falsification
The Epistemic Support-Point Filter (ESPF) was designed around a single epistemological commitment: be quick to embrace ignorance and slow to assert certainty. This paper proves that this commitment has a precise mathematical form and that the ESPF is the unique optimal filter implementing it within the class of epistemically admissible evidence-only filters. The ESPF synthesizes two complementary principles acting at different phases of the recursion. In propagation, it enacts Jaynesian maximum entropy: the support spreads as widely as the dynamics allow, assuming maximal ignorance consistent with known constraints. In the measurement update, it enacts Popperian falsification: hypotheses are eliminated by evidence alone. Any rule incorporating prior possibility is strictly suboptimal and risks race-to-bottom bias. The optimality criterion is possibilistic minimax entropy: among all evidence-only selection rules, minimum-q selection minimizes log det(MVEE), the worst-case possibilistic entropy. Three lemmas establish the result: the Possibilistic Entropy Lemma identifies the ignorance functional; the Possibilistic Cramér-Rao Lemma bounds entropy reduction per measurement; the Evidence-Optimality Lemma proves minimum-q selection is the unique minimizer. The ESPF differs from Bayesian filters by minimizing worst-case epistemic ignorance rather than expected uncertainty. The Kalman filter is recovered in the Gaussian limit. Numerical validation over a 2-day 877-step Smolyak Level-3 orbital tracking run confirms the regime structure under both nominal and stress conditions.
TATIC: Task-Aware Temporal Learning for Human Intent Inference from Physical Corrections in Human-Robot Collaboration
In human-robot collaboration (HRC), robots must adapt online to dynamic task constraints and evolving human intent. While physical corrections provide a natural, low-latency channel for operators to convey motion-level adjustments, extracting task-level semantic intent from such brief interactions remains challenging. Existing foundation-model-based approaches primarily rely on vision and language inputs and lack mechanisms to interpret physical feedback. Meanwhile, traditional physical human-robot interaction (pHRI) methods leverage physical corrections for trajectory guidance but struggle to infer task-level semantics. To bridge this gap, we propose TATIC, a unified framework that utilizes torque-based contact force estimation and a task-aware Temporal Convolutional Network (TCN) to jointly infer discrete task-level intent and estimate continuous motion-level parameters from brief physical corrections. Task-aligned feature canonicalization ensures robust generalization across diverse layouts, while an intent-driven adaptation scheme translates inferred human intent into robot motion adaptations. Experiments achieve a 0.904 Macro-F1 score in intent recognition and demonstrate successful hardware validation in collaborative disassembly (see experimental video at https://youtu.be/xF8A52qwEc8).
DRAFTO: Decoupled Reduced-space and Adaptive Feasibility-repair Trajectory Optimization for Robotic Manipulators
This paper introduces a new algorithm for trajectory optimization, Decoupled Reduced-space and Adaptive Feasibility-repair Trajectory Optimization (DRAFTO). It first constructs a constrained objective that accounts for smoothness, safety, joint limits, and task requirements. Then, it optimizes the coefficients, which are the coordinates of a set of basis functions for trajectory parameterization. To reduce the number of repeated constrained optimizations while handling joint-limit feasibility, the optimization is decoupled into a reduced-space Gauss-Newton (GN) descent for the main iterations and constrained quadratic programming for initialization and terminal feasibility repair. The two-phase acceptance rule with a non-monotone policy is applied to the GN model, which uses a hinge-squared penalty for inequality constraints, to ensure globalizability. The results of our benchmark tests against optimization-based planners, such as CHOMP, TrajOpt, GPMP2, and FACTO, and sampling-based planners, such as RRT-Connect, RRT*, and PRM, validate the high efficiency and reliability across diverse scenarios and tasks. The experiment involving grabbing an object from a drawer further demonstrates the potential for implementation in complex manipulation tasks. The supplemental video is available at https://youtu.be/XisFI37YyTQ.
From Demonstrations to Safe Deployment: Path-Consistent Safety Filtering for Diffusion Policies ICRA 2026
Diffusion policies (DPs) achieve state-of-the-art performance on complex manipulation tasks by learning from large-scale demonstration datasets, often spanning multiple embodiments and environments. However, they cannot guarantee safe behavior, requiring external safety mechanisms. These, however, alter actions in ways unseen during training, causing unpredictable behavior and performance degradation. To address these problems, we propose path-consistent safety filtering (PACS) for DPs. Our approach performs path-consistent braking on a trajectory computed from the sequence of generated actions. In this way, we keep the execution consistent with the training distribution of the policy, maintaining the learned, task-completing behavior. To enable real-time deployment and handle uncertainties, we verify safety using set-based reachability analysis. Our experimental evaluation in simulation and on three challenging real-world human-robot interaction tasks shows that PACS (a) provides formal safety guarantees in dynamic environments, (b) preserves task success rates, and (c) outperforms reactive safety approaches, such as control barrier functions, by up to 68 % in terms of task success. Videos are available at our project website: https://tum-lsy.github.io/pacs.
comment: Accepted to IEEE ICRA 2026. Project page: https://tum-lsy.github.io/pacs/. 8 pages, 4 figures
Continual uncertainty learning
Robust control of mechanical systems with multiple uncertainties remains a fundamental challenge, particularly when nonlinear dynamics and operating-condition variations are intricately intertwined. Although deep reinforcement learning (DRL) combined with domain randomization has shown promise in mitigating the sim-to-real gap, simultaneously handling all the sources of uncertainty often leads to sub-optimal policies and poor learning efficiency. This study formulates a new curriculum-based continual learning framework for robust control problems involving nonlinear dynamical systems in which multiple sources of uncertainty are simultaneously superimposed. The key idea is to decompose a complex control problem with multiple uncertainties into a sequence of continual learning tasks, in which the strategies for handling each uncertainty are acquired sequentially. The original system is extended into a finite set of plants whose dynamic uncertainties are gradually expanded and diversified as learning progresses. The policy is stably updated across the entire plant sets associated with tasks defined by different uncertainty configurations without catastrophic forgetting. To ensure high learning efficiency, we jointly incorporate a model-based controller (MBC), which guarantees a shared baseline performance across the plant sets, into the learning process in order to accelerate the convergence. This residual learning scheme facilitates task-specific optimization of the DRL agent for each uncertainty, thereby enhancing sample efficiency. Finally, this study adopts the proposed method to design an active vibration controller for automotive powertrains as a practical industrial application. We verify that the resulting controller is robust against structural nonlinearities and dynamic variations; thus, it can realize successful sim-to-real transfer.
Hardware test and validation of the angular droop control: Analysis and experiments
We present a hardware-based validation of angular droop control for grid-forming DC/AC converters, a control strategy that establishes active power-to-angle droop. Angular droop control enables exact frequency regulation at steady state, thereby combining primary and secondary control into a single layer. We provide traceable analysis and suggest solutions to the main implementation challenges with angular droop control, specifically addressing the challenges concerning discretization and clock drift in hardware experiments. This is illustrated in two different scenarios. Experimental results from the single converter to load scenario demonstrate black start capability and power-to-angle droop behavior for two different implementation schemes. A multi-converter setup validates frequency synchronization and power-sharing properties, proving the ancillary services that angular droop control provides in the real-world experimental setup.
EMFusion: Conditional Diffusion Framework for Trustworthy Frequency Selective EMF Forecasting in Wireless Networks
The rapid growth in wireless infrastructure has increased the need to accurately estimate and forecast electromagnetic field (EMF) levels to ensure ongoing compliance, assess potential health impacts, and support efficient network planning. While existing studies rely on univariate forecasting of wideband aggregate EMF data, frequency-selective multivariate forecasting is needed to capture the inter-operator and inter-frequency variations essential for proactive network planning. To this end, this paper introduces EMFusion, a conditional multivariate diffusion-based probabilistic forecasting framework that integrates diverse contextual factors (e.g., time of day, season, and holidays) while providing explicit uncertainty estimates. The proposed architecture features a residual U-Net backbone enhanced by a cross-attention mechanism that dynamically integrates external conditions to guide the generation process. Furthermore, EMFusion integrates an imputation-based sampling strategy that treats forecasting as a structural inpainting task, ensuring temporal coherence even with irregular measurements. Unlike standard point forecasters, EMFusion generates calibrated probabilistic prediction intervals directly from the learned conditional distribution, providing explicit uncertainty quantification essential for trustworthy decision-making. Numerical experiments conducted on frequency-selective EMF datasets demonstrate that EMFusion with the contextual information of working hours outperforms the baseline models with or without conditions. The EMFusion outperforms the best baseline by 23.85% in continuous ranked probability score (CRPS), 13.93% in normalized root mean square error, and reduces prediction CRPS error by 22.47%.
comment: Submission for possible publication
On finite-horizon approximation of a feedback Nash equilibrium in LQ games
Dynamic games provide a fundamental framework for multi-agent decision-making over time, yet computing feedback Nash equilibria (FNEs) in infinite-horizon discrete-time linear-quadratic (LQ) settings remains computationally challenging. Motivated by the need for tractable and implementable strategies, this paper studies a finite-horizon strategy for approximating a certain infinite-horizon equilibrium. Specifically, at each stage, each player solves a T-stage game and implements only the first-stage control, thereby avoiding the direct solution of coupled infinite-horizon Riccati equations. We first analyze the finite-horizon game and characterize the structure of the associated coupled generalized discrete Riccati difference equations. Based on this analysis, we establish a sufficient condition for uniqueness of the FNE and propose an efficient algorithm that computes it via a sequence of linear equations. We then consider the infinite-horizon game in which players adopt the finite-horizon strategies with heterogeneous prediction horizons and show that, under suitable conditions, the total cost under the finite-horizon strategies converges to the cost under the limiting infinite-horizon FNE. Moreover, we derive an explicit upper bound on this cost gap in terms of the distance between the corresponding strategy matrices. These results provide theoretical justification and quantitative performance guarantees for finite-horizon strategies in infinite-horizon LQ dynamic games. A nonscalar numerical example illustrates the effectiveness of the proposed framework.
comment: 10 pages, 2 figures
Score Matching Diffusion Based Feedback Control and Planning of Nonlinear Systems
In this paper, we propose a deterministic diffusion-based framework for controlling the probability density of nonlinear control-affine systems, with theoretical guarantees for drift-free and linear time-invariant (LTI) dynamics. The central idea is to first excite the system with white noise so that a forward diffusion process explores the reachable regions of state space, and then to design a deterministic feedback law that acts as a denoising mechanism driving the system back toward a desired target distribution supported on the target set. This denoising phase provides a feedback controller that steers the control system to the target set. In this framework, control synthesis reduces to constructing a deterministic reverse process that reproduces the desired evolution of state densities. We derive existence conditions ensuring such deterministic realizations of time-reversals for controllable drift-free and LTI systems, and show that the resulting feedback laws provide a tractable alternative to nonlinear control by viewing density control as a relaxation of controlling a system to target sets. Numerical studies on a unicycle model with obstacles, a five-dimensional driftless system, and a four-dimensional LTI system demonstrate reliable diffusion-inspired density control.
Automated Layout and Control Co-Design of Robust Multi-UAV Transportation Systems
The joint optimization of physical parameters and controllers in robotic systems is challenging. This is due to the difficulties of predicting the effect that changes in physical parameters have on final performances. At the same time, physical and morphological modifications can improve robot capabilities, perhaps completely unlocking new skills and tasks. We present a novel approach to co-optimize the physical layout and the control of a cooperative aerial transportation system. The goal is to achieve the most precise and robust flight when carrying a payload. We assume the agents are connected to the payload through rigid attachments, essentially transforming the whole system into a larger flying object with ``thrust modules" at the attachment locations of the quadcopters. We investigate the optimal arrangement of the thrust modules around the payload, so that the resulting system achieves the best disturbance rejection capabilities. We propose a novel metric of robustness inspired by H2 control, and propose an algorithm to optimize the layout of the vehicles around the object and their controller altogether. We experimentally validate the effectiveness of our approach using fleets of three and four quadcopters and payloads of diverse shapes.
comment: 7 pages, 7 figures, journal paper (IEEE RA-L)
Synthesizing Interpretable Control Policies through Large Language Model Guided Search
The combination of Large Language Models (LLMs), systematic evaluation, and evolutionary algorithms has enabled breakthroughs in combinatorial optimization and scientific discovery. We propose to extend this powerful combination to the control of dynamical systems, generating interpretable control policies capable of complex behaviors. With our novel method, we represent control policies as programs in standard languages like Python. We evaluate candidate controllers in simulation and evolve them using a pre-trained LLM. Unlike conventional learning-based control techniques, which rely on black-box neural networks to encode control policies, our approach enhances transparency and interpretability. We still take advantage of the power of large AI models, but only at the policy design phase, ensuring that all system components remain interpretable and easily verifiable at runtime. Additionally, the use of standard programming languages makes it straightforward for humans to finetune or adapt the controllers based on their expertise and intuition. We illustrate our method through its application to the synthesis of an interpretable control policy for the \textit{pendulum swing-up} and the \textit{ball in cup} tasks. We make the code available at https://github.com/muellerlab/synthesizing_interpretable_control_policies.git.
comment: 8 pages, 7 figures, conference paper
Max-Consensus with Deterministic Convergence in Directed Graphs with Unreliable Communication Links
We present DMaC, a novel distributed, finite-time algorithm that guarantees max-consensus in directed networks with unreliable communication links experiencing packet drops. Unlike existing methods, DMaC ensures all nodes compute the exact maximum state under arbitrary packet loss patterns. It incorporates a fully distributed termination mechanism, enabling nodes to autonomously determine whether convergence has occurred. Our algorithm leverages narrowband error-free feedback channels to acknowledge successful (single-bit) transmissions with minimal communication overhead. We analyze our algorithm's operation, and we provide a convergence proof establishing explicit bounds on the required time steps. We validate its correctness in a wireless sensor network for environmental monitoring, and finally, we compare against existing approaches highlighting our algorithm's operational advantages.
A Predictive Flexibility Aggregation Method for Low Voltage Distribution System Control
This paper presents a method for predictive aggregation of the available flexibility at the residential unit level into a flexibility chart that represents the admissible active and reactive powers, along with the associated flexibility value. The method is also combined with centralized optimization to design a predictive privacy-preserving control scheme to manage low-voltage distribution systems in real-time. Similarly to hierarchical control strategies, this approach divides the optimization horizon into a real-time stage, responsible for decisions in the current market period, and an operational planning stage, which deals with decisions outside of this interval. First, a multiparametric optimization problem is solved offline at the residential unit level. Then, an operational planning problem, also formulated as a parametric optimization problem, is solved to account for the forecasts. The method generates the desired flexibility chart by combining the results of these two problems with measurements. The resulting approach is compatible with real-time control requirements, as heavy computations are performed offline in a decentralized manner. By linking real-time flexibility assessment with energy scheduling, our approach enables efficient and cost-effective management of low-voltage distribution systems. We validate this method on a low-voltage network of 43 buses by comparing it with a fully centralized optimization formulation with perfect foresight and a future-agnostic aggregation method.
comment: 10 pages, 10 figures
Safe and Optimal Learning from Preferences via Weighted Temporal Logic with Applications in Robotics and Formula 1
Autonomous systems increasingly rely on human feedback to align their behavior, expressed as pairwise comparisons, rankings, or demonstrations. While existing methods can adapt behaviors, they often fail to guarantee safety in safety-critical domains. We propose a safety-guaranteed, optimal, and efficient approach for solving the learning problem from preferences, rankings, or demonstrations using Weighted Signal Temporal Logic (WSTL). WSTL learning problems, when implemented naively, lead to multi-linear constraints in the weights to be learned. By introducing structural pruning and log-transform procedures, we reduce the problem size and recast it as a Mixed-Integer Linear Program while preserving safety guarantees. Experiments on robotic navigation and real-world Formula 1 data demonstrate that the method captures nuanced preferences and models complex task objectives.
comment: 8 pages, 2 figures
Analysis and Synthesis of Switched Optimization Algorithms
Deployment of optimization algorithms over communication networks face challenges associated with time delays and corruptions. Fixed time delays can destabilize popular gradient-based algorithms, and this degradation is exacerbated by time-varying delays that may arise from packet drops. This work concentrates on the analysis and synthesis of discrete-time optimization algorithms with certified exponential convergence rates that are robust against switched network dynamics between the optimizer and the gradient oracle. Analysis is accomplished by solving linear matrix inequalities under bisection in the exponential convergence rate, searching over Zames-Falb filter coefficients that can certify convergence. Synthesis is performed by alternating between a search over filter coefficient for a fixed controller, and a search over controllers for a fixed filter. Effectiveness is demonstrated by the synthesis of convergent optimization algorithms over networks with time-varying delays, and networks with unstable channel dynamics.
Safety-Critical Control with Guaranteed Lipschitz Continuity via Filtered Control Barrier Functions
In safety-critical control systems, ensuring both system safety and smooth control input is essential for practical deployment. Existing Control Barrier Function (CBF) frameworks, especially High-Order CBFs (HOCBFs), effectively enforce safety constraints, but also raise concerns about the smoothness of the resulting control inputs. While smoothness typically refers to continuity and differentiability, it does not by itself ensure bounded input variation. In contrast, Lipschitz continuity is a stronger form of continuity that not only is necessary for the theoretical guarantee of safety, but also bounds the rate of variation and eliminates abrupt changes in the control input. Such abrupt changes can degrade system performance or even violate actuator limitations, yet current CBF-based methods do not provide Lipschitz continuity guarantees. This paper introduces Filtered Control Barrier Functions (FCBFs), which extend HOCBFs by incorporating an auxiliary dynamic system-referred to as an input regularization filter-to produce Lipschitz continuous control inputs. The proposed framework ensures safety, control bounds, and Lipschitz continuity of the control inputs simultaneously by integrating FCBFs and HOCBFs within a unified quadratic program (QP). Theoretical guarantees are provided and simulations on a unicycle model demonstrate the effectiveness of the proposed method compared to standard and smoothness-penalized HOCBF approaches.
comment: 8 pages, 4 figures
$K-$Lorentzian Polynomials, Semipositive Cones, and Cone-Stable EVI Systems
Lorentzian and completely log-concave polynomials have recently emerged as a unifying framework for negative dependence, log-concavity, and convexity in combinatorics and probability. We extend this theory to variational analysis and cone-constrained dynamics by studying $K$-Lorentzian and $K$-completely log-concave polynomials over a proper convex cone $K\subset\mathbb{R}^n$. For a $K$-Lorentzian form $f$ and $v\in\operatorname{int}K$, we define an open cone $K^\circ(f,v)$ and a closed cone $K(f,v)$ via directional derivatives along $v$, recovering the usual hyperbolicity cone when $f$ is hyperbolic. We prove that $K^\circ(f,v)$ is a proper cone and equals $\operatorname{int}K(f,v)$. If $f$ is $K(f,v)$-Lorentzian, then $K(f,v)$ is convex and maximal among convex cones on which $f$ is Lorentzian. Using the Rayleigh matrix $M_f(x)=\nabla f(x)\nabla f(x)^T - f(x)\nabla^2 f(x)$, we obtain cone-restricted Rayleigh inequalities and show that two-direction Rayleigh inequalities on $K$ are equivalent to an acuteness condition for the bilinear form $v^T M_f(x) w$. This yields a cone-restricted negative-dependence interpretation linking the curvature of $\log f$ to covariance properties of associated Gibbs measures. For determinantal generating polynomials, we identify the intersection of the hyperbolicity cone with the nonnegative orthant as the classical semipositive cone, and we extend this construction to general proper cones via $K$-semipositive cones. Finally, for linear evolution variational inequality (LEVI) systems, we show that if $q(x)=x^T A x$ is (strictly) $K$-Lorentzian, then $A$ is (strictly) $K$-copositive and yields Lyapunov (semi-)stability on $K$, giving new Lyapunov criteria for cone-constrained dynamics.
comment: 23 pages, 5 figures
Relative Localization System Design for SnailBot: A Modular Self-reconfigurable Robot
This paper presents the design and implementation of a relative localization system for SnailBot, a modular self reconfigurable robot. The system integrates ArUco marker recognition, optical flow analysis, and IMU data processing into a unified fusion framework, enabling robust and accurate relative positioning for collaborative robotic tasks. Experimental validation demonstrates the effectiveness of the system in realtime operation, with a rule based fusion strategy ensuring reliability across dynamic scenarios. The results highlight the potential for scalable deployment in modular robotic systems.
comment: The paper contains factual error and logic flaws, which needs to be repaired before submitting
Learning responsibility allocations for multi-agent interactions: A differentiable optimization approach with control barrier functions
From autonomous driving to package delivery, ensuring safe yet efficient multi-agent interaction is challenging as the interaction dynamics are influenced by hard-to-model factors such as social norms and contextual cues. Understanding these influences can aid in the design and evaluation of socially-aware autonomous agents whose behaviors are aligned with human values. In this work, we seek to codify factors governing safe multi-agent interactions via the lens of responsibility, i.e., an agent's willingness to deviate from their desired control to accommodate safe interaction with others. Specifically, we propose a data-driven modeling approach based on control barrier functions and differentiable optimization that efficiently learns agents' responsibility allocation from data. We demonstrate on synthetic and real-world datasets that we can obtain an interpretable and quantitative understanding of how much agents adjust their behavior to ensure the safety of others given their current environment.
comment: 8 pages, 7 figures
Improved Robustness of Deep Reinforcement Learning for Control of Time-Varying Systems by Bounded Extremum Seeking
In this paper, we study the use of robust model independent bounded extremum seeking (ES) feedback control to improve the robustness of deep reinforcement learning (DRL) controllers for a class of nonlinear time-varying systems. DRL has the potential to learn from large datasets to quickly control or optimize the outputs of many-parameter systems, but its performance degrades catastrophically when the system model changes rapidly over time. Bounded ES can handle time-varying systems with unknown control directions, but its convergence speed slows down as the number of tuned parameters increases and, like all local adaptive methods, it can get stuck in local minima. We demonstrate that together, DRL and bounded ES result in a hybrid controller whose performance exceeds the sum of its parts with DRL taking advantage of historical data to learn how to quickly control a many-parameter system to a desired setpoint while bounded ES ensures its robustness to time variations. We present a numerical study of a general time-varying system and a combined ES-DRL controller for automatic tuning of the Low Energy Beam Transport section at the Los Alamos Neutron Science Center linear particle accelerator.
Dampening parameter distributional shifts under robust control and gain scheduling
Many traditional robust control approaches assume linearity of the system and independence between the system state-input and the parameters of its approximant (possibly lower-order) model. This assumption implies that the application of robust control design to the underlying system introduces no distributional shifts in the parameters of its approximant model. This is generally not true when the underlying system is nonlinear, which may require different approximant models with different parameter distributions when operated at different regions of the state-input space. Therefore, a robust controller has to be robust under the approximant model with parameter distribution that will be experienced in the future data, after applying this control, not the parameter distribution seen in the learning data or assumed in the design. In this paper, we seek a solution to this problem by restricting the newly designed closed-loop system to be consistent with the learning data and slowing down any distributional shifts in the state-input space of the underlying system, and therefore, in the parameter space of its approximant model. In computational terms, the objective of dampening the shifts in the parameter distribution is formulated as a convex semi-definite program that can be solved efficiently by standard software packages. We evaluate the proposed approach on a simple yet telling gain-scheduling problem, which can be equivalently posed as a robust control problem.
Reactive Slip Control in Multifingered Grasping: Hybrid Tactile Sensing and Internal-Force Optimization ICRA
We present a hybrid learning and model-based approach for reactive internal-force adaptation to halt in-hand slip in a multifingered robotic gripper. A multimodal tactile stack combines piezoelectric (PzE) sensing for fast slip cues with piezoresistive (PzR) arrays for contact localization, enabling online construction of the grasp matrix. Upon slip detection, internal forces are updated in the null space of the grasp through a quadratic program that reinforces normal forces while preserving the object wrench. We demonstrate reactive stabilization of multifingered grasps under external perturbations. Augmenting analytic force control with learned tactile cues enables fast and reliable closed-loop stabilization in the evaluated grasp scenarios. The pipeline yields a theoretical sensing-to-command latency of 35-40 ms, including 5 ms for PzR-based grasp geometry updates and approximately 4 ms for solving the quadratic program. In controlled trials, slip onset is detected after ~ 20 ms. The analysis supports the feasibility of sub-50 ms integrated closed-loop stabilization.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA), 2026
Active Learning-Based Input Design for Angle-Only Initial Relative Orbit Determination
Accurate relative orbit determination is a significant challenge in modern space operations, particularly when relying only on angular measurements. The inherent observability limitations of this approach make initial state estimation difficult, directly impacting mission safety and performance. This work proposes a hybrid estimation and control strategy for autonomous rendezvous. An active learning (AL) based algorithm designs the initial input control sequence by maximizing the exploration of the output space, thereby enhancing the observability of the initial relative state for the angle-only initial relative orbit determination (IROD) problem. The IROD solution provides a batch estimate of the initial relative state and its analytical covariance, which quantifies the estimation quality and determines the transition point to recursive filtering. Once the uncertainty is sufficiently low, an Extended Kalman Filter (EKF) is initialized with the IROD solution and takes over for sequential estimation, providing state estimates to a Model Predictive Controller (MPC) to complete the rendezvous. The proposed framework is validated through numerical simulations, demonstrating its ability to reliably resolve the scale ambiguity, outperform baseline excitation strategies, and successfully execute an end-to-end rendezvous from initial estimation to final approach.
Do Spatial Descriptors Improve Multi-DoF Finger Movement Decoding from HD sEMG?
Restoring hand function requires simultaneous and proportional control (SPC) of multiple degrees of freedom (DoFs). This study evaluated the multichannel linear descriptors-based block field method (MLD-BFM) against conventional feature extraction approaches for continuous decoding of five finger-joint DoFs using high-density surface electromyography (HD sEMG). Twenty-one healthy participants performed dynamic sinusoidal finger movements while HD sEMG signals were recorded from the proximal forearm. MLD-BFM extracted spatial descriptors including effective field strength ($Σ$), field-strength variation rate ($Φ$), and spatial complexity ($Ω$). Performance was optimized (block size: $2\times2$; window: 0.15,s) and compared with conventional time-domain features, root mean square (RMS) and mean absolute value plus waveform length (MAV-WL), as well as dimensionality reduction methods (PCA and NMF), using multi-output regression models. MLD-BFM achieved the highest mean variance-weighted coefficient of determination ($\mathrm{R}^2_\mathrm{vw}$) across all models, with the multilayer perceptron yielding the best result ($86.68 \pm 0.33 \%$). However, the improvement was not statistically significant relative to time-domain features, suggesting that dense multichannel recordings already encode spatial information through amplitude-based descriptors. MLD-BFM significantly outperformed dimensionality reduction approaches, indicating that preserving the spatial resolution of HD sEMG is critical for accurate multi-DoF finger movement regression.
comment: 14 pages, 12 figures, 1 table
Geometric SSM: LTI State Space Models for Selective Tasks
A key claim in recent work on Selective State Space Models is that selectivity, the ability to focus on relevant information while filtering irrelevant inputs, requires breaking the Linear Time-Invariant (LTI) property through time-varying dynamics. We challenge this claim by demonstrating that LTI systems can achieve selectivity when designed using principles from geometric control. We introduce the Geometric SSM, in which different input patterns excite distinct invariant subspaces of the dynamics. Unlike Mamba's memoryless selection mechanism, our approach employs a dynamic residual generator that maintains temporal memory, enabling recognition of multi-token patterns without time-varying system matrices. The Geometric SSM achieves near-perfect performance on a novel extended induction head task where Mamba fails, while preserving efficient FFT-based training. Our results demonstrate that geometric control theory can inform the design of novel selective sequence models that combine theoretical rigor with practical efficiency.
comment: 10 pages, 5 figures
Robotics
Exp-Force: Experience-Conditioned Pre-Grasp Force Selection with Vision-Language Models
Accurate pre-contact grasp force selection is critical for safe and reliable robotic manipulation. Adaptive controllers regulate force after contact but still require a reasonable initial estimate. Starting a grasp with too little force requires reactive adjustment, while starting a grasp with too high a force risks damaging fragile objects. This trade-off is particularly challenging for compliant grippers, whose contact mechanics are difficult to model analytically. We propose Exp-Force, an experience-conditioned framework that predicts the minimum feasible grasping force from a single RGB image. The method retrieves a small set of relevant prior grasping experiences and conditions a vision-language model on these examples for in-context inference, without analytic contact models or manually designed heuristics. On 129 object instances, ExpForce achieves a best-case MAE of 0.43 N, reducing error by 72% over zero-shot inference. In real-world tests on 30 unseen objects, it improves appropriate force selection rate from 63% to 87%. These results demonstrate that Exp-Force enables reliable and generalizable pre-grasp force selection by leveraging prior interaction experiences. http://expforcesubmission.github.io/Exp-Force-Website/
Embedding Classical Balance Control Principles in Reinforcement Learning for Humanoid Recovery
Humanoid robots remain vulnerable to falls and unrecoverable failure states, limiting their practical utility in unstructured environments. While reinforcement learning has demonstrated stand-up behaviors, existing approaches treat recovery as a pure task-reward problem without an explicit representation of the balance state. We present a unified RL policy that addresses this limitation by embedding classical balance metrics: capture point, center-of-mass state, and centroidal momentum, as privileged critic inputs and shaping rewards directly around these quantities during training, while the actor relies solely on proprioception for zero-shot hardware transfer. Without reference trajectories or scripted contacts, a single policy spans the full recovery spectrum: ankle and hip strategies for small disturbances, corrective stepping under large pushes, and compliant falling with multi-contact stand-up using the hands, elbows, and knees. Trained on the Unitree H1-2 in Isaac Lab, the policy achieves a 93.4% recovery rate across randomized initial poses and unscripted fall configurations. An ablation study shows that removing the balance-informed structure causes stand-up learning to fail entirely, confirming that these metrics provide a meaningful learning signal rather than incidental structure. Sim-to-sim transfer to MuJoCo and preliminary hardware experiments further demonstrate cross-environment generalization. These results show that embedding interpretable balance structure into the learning framework substantially reduces time spent in failure states and broadens the envelope of autonomous recovery.
Diff-Muscle: Efficient Learning for Musculoskeletal Robotic Table Tennis
Musculoskeletal robots provide superior advantages in flexibility and dexterity, positioning them as a promising frontier towards embodied intelligence. However, current research is largely confined to relative simple tasks, restricting the exploration of their full potential in multi-segment coordination. Furthermore, efficient learning remains a challenge, primarily due to the high-dimensional action space and inherent overactuated structures. To address these challenges, we propose Diff-Muscle, a musculoskeletal robot control algorithm that leverages differential flatness to reformulate policy learning from the redundant muscle-activation space into a significantly lower-dimensional joint space. Furthermore, we utilize the highly dynamic robotic table tennis task to evaluate our algorithm. Specifically, we propose a hierarchical reinforcement learning framework that integrates a Kinematics-based Muscle Actuation Controller (K-MAC) with high-level trajectory planning, enabling a musculoskeletal robot to perform dexterous and precise rallies. Experimental results demonstrate that Diff-Muscle significantly outperforms state-of-the-art baselines in success rates while maintaining minimal muscle activation. Notably, the proposed framework successfully enables the musculoskeletal robots to achieve continuous rallies in a challenging dual-robot setting.
comment: 8 pages, 7 figures
FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection
In order to navigate complex traffic environments, self-driving vehicles must recognize many semantic classes pertaining to vulnerable road users or traffic control devices. However, many safety-critical objects (e.g., construction worker) appear infrequently in nominal traffic conditions, leading to a severe shortage of training examples from driving data alone. Recent vision foundation models, which are trained on a large corpus of data, can serve as a good source of external prior knowledge to improve generalization. We propose FOMO-3D, the first multi-modal 3D detector to leverage vision foundation models for long-tailed 3D detection. Specifically, FOMO-3D exploits rich semantic and depth priors from OWLv2 and Metric3Dv2 within a two-stage detection paradigm that first generates proposals with a LiDAR-based branch and a novel camera-based branch, and refines them with attention especially to image features from OWL. Evaluations on real-world driving data show that using rich priors from vision foundation models with careful multi-modal fusion designs leads to large gains for long-tailed 3D detection. Project website is at https://waabi.ai/fomo3d/.
comment: Published at 9th Annual Conference on Robot Learning (CoRL 2025)
Bilevel Planning with Learned Symbolic Abstractions from Interaction Data
Intelligent agents must reason over both continuous dynamics and discrete representations to generate effective plans in complex environments. Previous studies have shown that symbolic abstractions can emerge from neural effect predictors trained with a robot's unsupervised exploration. However, these methods rely on deterministic symbolic domains, lack mechanisms to verify the generated symbolic plans, and operate only at the abstract level, often failing to capture the continuous dynamics of the environment. To overcome these limitations, we propose a bilevel neuro-symbolic framework in which learned probabilistic symbolic rules generate candidate plans rapidly at the high level, and learned continuous effect models verify these plans and perform forward search when necessary at the low level. Our experiments on multi-object manipulation tasks demonstrate that the proposed bilevel method outperforms symbolic-only approaches, reliably identifying failing plans through verification, and achieves planning performance statistically comparable to continuous forward search while resolving most problems via efficient symbolic reasoning.
MetaWorld-X: Hierarchical World Modeling via VLM-Orchestrated Experts for Humanoid Loco-Manipulation
Learning natural, stable, and compositionally generalizable whole-body control policies for humanoid robots performing simultaneous locomotion and manipulation (loco-manipulation) remains a fundamental challenge in robotics. Existing reinforcement learning approaches typically rely on a single monolithic policy to acquire multiple skills, which often leads to cross-skill gradient interference and motion pattern conflicts in high-degree-of-freedom systems. As a result, generated behaviors frequently exhibit unnatural movements, limited stability, and poor generalization to complex task compositions. To address these limitations, we propose MetaWorld-X, a hierarchical world model framework for humanoid control. Guided by a divide-and-conquer principle, our method decomposes complex control problems into a set of specialized expert policies (Specialized Expert Policies, SEP). Each expert is trained under human motion priors through imitation-constrained reinforcement learning, introducing biomechanically consistent inductive biases that ensure natural and physically plausible motion generation. Building upon this foundation, we further develop an Intelligent Routing Mechanism (IRM) supervised by a Vision-Language Model (VLM), enabling semantic-driven expert composition. The VLM-guided router dynamically integrates expert policies according to high-level task semantics, facilitating compositional generalization and adaptive execution in multi-stage loco-manipulation tasks.
comment: 8 figures, https://syt2004.github.io/metaworldX/
CONTACT: CONtact-aware TACTile Learning for Robotic Disassembly IROS 2026
Robotic disassembly involves contact-rich interactions in which successful manipulation depends not only on geometric alignment but also on force-dependent state transitions. While vision-based policies perform well in structured settings, their reliability often degrades in tight-tolerance, contact-dominated, or deformable scenarios. In this work, we systematically investigate the role of tactile sensing in robotic disassembly through both simulation and real-world experiments. We construct five rigid-body disassembly tasks in simulation with increasing geometric constraints and extraction difficulty. We further design five real-world tasks, including three rigid and two deformable scenarios, to evaluate contact-dependent manipulation. Within a unified learning framework, we compare three sensing configurations: Vision Only, Vision + tactile RGB (TacRGB), and Vision + tactile force field (TacFF). Across both simulation and real-world experiments, TacFF-based policies consistently achieve the highest success rates, with particularly notable gains in contact-dependent and deformable settings. Notably, naive fusion of TacRGB and TacFF underperforms either modality alone, indicating that simple concatenation can dilute task-relevant force information. Our results show that tactile sensing plays a critical, task-dependent role in robotic disassembly, with structured force-field representations being particularly effective in contact-dominated scenarios.
comment: Submitted to IROS 2026, 8 pages, 6 figures
Interactive World Simulator for Robot Policy Training and Evaluation
Action-conditioned video prediction models (often referred to as world models) have shown strong potential for robotics applications, but existing approaches are often slow and struggle to capture physically consistent interactions over long horizons, limiting their usefulness for scalable robot policy training and evaluation. We present Interactive World Simulator, a framework for building interactive world models from a moderate-sized robot interaction dataset. Our approach leverages consistency models for both image decoding and latent-space dynamics prediction, enabling fast and stable simulation of physical interactions. In our experiments, the learned world models produce interaction-consistent pixel-level predictions and support stable long-horizon interactions for more than 10 minutes at 15 FPS on a single RTX 4090 GPU. Our framework enables scalable demonstration collection solely within the world models to train state-of-the-art imitation policies. Through extensive real-world evaluation across diverse tasks involving rigid objects, deformable objects, object piles, and their interactions, we find that policies trained on world-model-generated data perform comparably to those trained on the same amount of real-world data. Additionally, we evaluate policies both within the world models and in the real world across diverse tasks, and observe a strong correlation between simulated and real-world performance. Together, these results establish the Interactive World Simulator as a stable and physically consistent surrogate for scalable robotic data generation and faithful, reproducible policy evaluation.
comment: Project Page: https://yixuanwang.me/interactive_world_sim
The Neural Compass: Probabilistic Relative Feature Fields for Robotic Search IROS 2026
Object co-occurrences provide a key cue for finding objects successfully and efficiently in unfamiliar environments. Typically, one looks for cups in kitchens and views fridges as evidence of being in a kitchen. Such priors have also been exploited in artificial agents, but they are typically learned from explicitly labeled data or queried from language models. It is still unclear whether these relations can be learned implicitly from unlabeled observations alone. In this work, we address this problem and propose ProReFF, a feature field model trained to predict relative distributions of features obtained from pre-trained vision language models. In addition, we introduce a learning-based strategy that enables training from unlabeled and potentially contradictory data by aligning inconsistent observations into a coherent relative distribution. For the downstream object search task, we propose an agent that leverages predicted feature distributions as a semantic prior to guide exploration toward regions with a high likelihood of containing the object. We present extensive evaluations demonstrating that ProReFF captures meaningful relative feature distributions in natural scenes and provides insight into the impact of our proposed alignment step. We further evaluate the performance of our search agent in 100 challenges in the Matterport3D simulator, comparing with feature-based baselines and human participants. The proposed agent is 20% more efficient than the strongest baseline and achieves up to 80% of human performance.
comment: 9 pages, 7 figures, 2 tables, submitted to IROS 2026
EquiBim: Learning Symmetry-Equivariant Policy for Bimanual Manipulation IROS 2026
Robotic imitation learning has achieved impressive success in learning complex manipulation behaviors from demonstrations. However, many existing robot learning methods do not explicitly account for the physical symmetries of robotic systems, often resulting in asymmetric or inconsistent behaviors under symmetric observations. This limitation is particularly pronounced in dual-arm manipulation, where bilateral symmetry is inherent to both the robot morphology and the structure of many tasks. In this paper, we introduce EquiBim, a symmetry-equivariant policy learning framework for bimanual manipulation that enforces bilateral equivariance between observations and actions during training. Our approach formulates physical symmetry as a group action on both observation and action spaces, and imposes an equivariance constraint on policy predictions under symmetric transformations. The framework is model-agnostic and can be seamlessly integrated into a wide range of imitation learning pipelines with diverse observation modalities and action representations, including point cloud-based and image-based policies, as well as both end-effector-space and joint-space parameterizations. We evaluate EquiBim on RoboTwin, a dual-arm robotic platform with symmetric kinematics, and evaluate it across diverse observation and action configurations in simulation. We further validate the approach on a real-world dual-arm system. Across both simulation and physical experiments, our method consistently improves performance and robustness under distribution shifts. These results suggest that explicitly enforcing physical symmetry provides a simple yet effective inductive bias for bimanual robot learning.
comment: Submitted to IROS 2026. 8 pages, 6 figures
CRED: Counterfactual Reasoning and Environment Design for Active Preference Learning ICRA
As a robot's operational environment and tasks to perform within it grow in complexity, the explicit specification and balancing of optimization objectives to achieve a preferred behavior profile moves increasingly farther out of reach. These systems benefit strongly by being able to align their behavior to reflect human preferences and respond to corrections, but manually encoding this feedback is infeasible. Active preference learning (APL) learns human reward functions by presenting trajectories for ranking. However, existing methods sample from fixed trajectory sets or replay buffers that limit query diversity and often fail to identify informative comparisons. We propose CRED, a novel trajectory generation method for APL that improves reward inference by jointly optimizing environment design and trajectory selection to efficiently query and extract preferences from users. CRED "imagines" new scenarios through environment design and leverages counterfactual reasoning -- by sampling possible rewards from its current belief and asking "What if this were the true preference?" -- to generate trajectory pairs that expose differences between competing reward functions. Comprehensive experiments and a user study show that CRED significantly outperforms state-of-the-art methods in reward accuracy and sample efficiency and receives higher user ratings.
comment: IEEE International Conference on Robotics and Automation (ICRA) 2026
OccTrack360: 4D Panoptic Occupancy Tracking from Surround-View Fisheye Cameras
Understanding dynamic 3D environments in a spatially continuous and temporally consistent manner is fundamental for robotics and autonomous driving. While recent advances in occupancy prediction provide a unified representation of scene geometry and semantics, progress in 4D panoptic occupancy tracking remains limited by the lack of benchmarks that support surround-view fisheye sensing, long temporal sequences, and instance-level voxel tracking. To address this gap, we present OccTrack360, a new benchmark for 4D panoptic occupancy tracking from surround-view fisheye cameras. OccTrack360 provides substantially longer and more diverse sequences (174~2234 frames) than prior benchmarks, together with principled voxel visibility annotations, including an all-direction occlusion mask and an MEI-based fisheye field-of-view mask. To establish a strong fisheye-oriented baseline, we further propose Focus on Sphere Occ (FoSOcc), a framework that addresses two core challenges in fisheye occupancy tracking: distorted spherical projection and inaccurate voxel-space localization. FoSOcc includes a Center Focusing Module (CFM) to enhance instance-aware spatial localization through supervised focus guidance, and a Spherical Lift Module (SLM) that extends perspective lifting to fisheye imaging under the Unified Projection Model. Extensive experiments on Occ3D-Waymo and OccTrack360 show that our method improves occupancy tracking quality with notable gains on geometrically regular categories, and establishes a strong baseline for future research on surround-view fisheye 4D occupancy tracking. The benchmark and source code will be made publicly available at https://github.com/YouthZest-Lin/OccTrack360.
comment: The benchmark and source code will be made publicly available at https://github.com/YouthZest-Lin/OccTrack360
AtomVLA: Scalable Post-Training for Robotic Manipulation via Predictive Latent World Models
Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation. The execution of complex multi-step behaviors in VLA models can be improved by robust instruction grounding, a critical component for effective control. However, current paradigms predominantly rely on coarse, high-level task instructions during supervised fine-tuning. This instruction grounding gap leaves models without explicit intermediate guidance, leading to severe compounding errors in long-horizon tasks. Therefore, bridging this instruction gap and providing scalable post-training for VLA models is urgent. To tackle this problem, we propose \method, the first subtask-aware VLA framework integrated with a scalable offline post-training pipeline. Our framework leverages a large language model to decompose high-level demonstrations into fine-grained atomic subtasks. This approach utilizes a pretrained predictive world model to score candidate action chunks against subtask goals in the latent space, mitigating error accumulation while significantly improving long-horizon robustness. Furthermore, this approach enables highly efficient Group Relative Policy Optimization without the prohibitive expenses associated with online rollouts on physical robots. Extensive simulations validate that our AtomVLA maintains strong robustness under perturbations. When evaluated against fundamental baseline models, it achieves an average success rate of 97.0\% on the LIBERO benchmark and 48.0\% on the LIBERO-PRO benchmark. Finally, experiments conducted in the real world using the Galaxea R1 Lite platform confirm its broad applicability across diverse tasks, especially long-horizon tasks. All datasets, checkpoints, and code will be released to the public domain following the acceptance of this work for future research.
Rethinking the semantic classification of indoor places by mobile robots
A significant challenge in service robots is the semantic understanding of their surrounding areas. Traditional approaches addressed this problem by segmenting the floor plan into regions corresponding to full rooms that are assigned labels consistent with human perception, e.g. office or kitchen. However, different areas inside the same room can be used in different ways: Could the table and the chair in my kitchen become my office? What is the category of that area now? office or kitchen? To adapt to these circumstances we propose a new paradigm where we intentionally relax the resulting labeling of semantic classifiers by allowing confusions inside rooms. Our hypothesis is that those confusions can be beneficial to a service robot. We present a proof of concept in the task of searching for objects.
comment: Presented at the Workshop on Semantic Scene Understanding for Human Robot Interaction, in the ACM/IEEE International Conference on Human-Robot Interaction (HRI), Stockholm, Sweden, 2023
Spherical-GOF: Geometry-Aware Panoramic Gaussian Opacity Fields for 3D Scene Reconstruction
Omnidirectional images are increasingly used in robotics and vision due to their wide field of view. However, extending 3D Gaussian Splatting (3DGS) to panoramic camera models remains challenging, as existing formulations are designed for perspective projections and naive adaptations often introduce distortion and geometric inconsistencies. We present Spherical-GOF, an omnidirectional Gaussian rendering framework built upon Gaussian Opacity Fields (GOF). Unlike projection-based rasterization, Spherical-GOF performs GOF ray sampling directly on the unit sphere in spherical ray space, enabling consistent ray-Gaussian interactions for panoramic rendering. To make the spherical ray casting efficient and robust, we derive a conservative spherical bounding rule for fast ray-Gaussian culling and introduce a spherical filtering scheme that adapts Gaussian footprints to distortion-varying panoramic pixel sampling. Extensive experiments on standard panoramic benchmarks (OmniBlender and OmniPhotos) demonstrate competitive photometric quality and substantially improved geometric consistency. Compared with the strongest baseline, Spherical-GOF reduces depth reprojection error by 57% and improves cycle inlier ratio by 21%. Qualitative results show cleaner depth and more coherent normal maps, with strong robustness to global panorama rotations. We further validate generalization on OmniRob, a real-world robotic omnidirectional dataset introduced in this work, featuring UAV and quadruped platforms. The source code and the OmniRob dataset will be released at https://github.com/1170632760/Spherical-GOF.
comment: The source code and dataset will be released at https://github.com/1170632760/Spherical-GOF
An Open-Source Robotics Research Platform for Autonomous Laparoscopic Surgery
Autonomous robot-assisted surgery demands reliable, high-precision platforms that strictly adhere to the safety and kinematic constraints of minimally invasive procedures. Existing research platforms, primarily based on the da Vinci Research Kit, suffer from cable-driven mechanical limitations that degrade state-space consistency and hinder the downstream training of reliable autonomous policies. We present an open-source, robot-agnostic Remote Center of Motion (RCM) controller based on a closed-form analytical velocity solver that enforces the trocar constraint deterministically without iterative optimization. The controller operates in Cartesian space, enabling any industrial manipulator to function as a surgical robot. We provide implementations for the UR5e and Franka Emika Panda manipulators, and integrate stereoscopic 3D perception. We integrate the robot control into a full-stack ROS-based surgical robotics platform supporting teleoperation, demonstration recording, and deployment of learned policies via a decoupled server-client architecture. We validate the system on a bowel grasping and retraction task across phantom, ex vivo, and in vivo porcine laparoscopic procedures. RCM deviations remain sub-millimeter across all conditions, and trajectory smoothness metrics (SPARC, LDLJ) are comparable to expert demonstrations from the JIGSAWS benchmark recorded on the da Vinci system. These results demonstrate that the platform provides the precision and robustness required for teleoperation, data collection and autonomous policy deployment in realistic surgical scenarios.
comment: Submitted to iROS 2026
3PoinTr: 3D Point Tracks for Robot Manipulation Pretraining from Casual Videos
Data-efficient training of robust robot policies is the key to unlocking automation in a wide array of novel tasks. Current systems require large volumes of demonstrations to achieve robustness, which is impractical in many applications. Learning policies directly from human videos is a promising alternative that removes teleoperation costs, but it shifts the challenge toward overcoming the embodiment gap (differences in kinematics and strategies between robots and humans), often requiring restrictive and carefully choreographed human motions. We propose 3PoinTr, a method for pretraining robot policies from casual and unconstrained human videos, enabling learning from motions natural for humans. 3PoinTr uses a transformer architecture to predict 3D point tracks as an intermediate embodiment-agnostic representation. 3D point tracks encode goal specifications, scene geometry, and spatiotemporal relationships. We use a Perceiver IO architecture to extract a compact representation for sample-efficient behavior cloning, even when point tracks violate downstream embodiment-specific constraints. We conduct thorough evaluation on simulated and real-world tasks, and find that 3PoinTr achieves robust spatial generalization on diverse categories of manipulation tasks with only 20 action-labeled robot demonstrations. 3PoinTr outperforms the baselines, including behavior cloning methods, as well as prior methods for pretraining from human videos. We also provide evaluations of 3PoinTr's 3D point track predictions compared to an existing point track prediction baseline. We find that 3PoinTr produces more accurate and higher quality point tracks due to a lightweight yet expressive architecture built on a single transformer, in addition to a training formulation that preserves supervision of partially occluded points. Project page: https://adamhung60.github.io/3PoinTr/.
STRIDE: Structured Lagrangian and Stochastic Residual Dynamics via Flow Matching
Robotic systems operating in unstructured environments must operate under significant uncertainty arising from intermittent contacts, frictional variability, and unmodeled compliance. While recent model-free approaches have demonstrated impressive performance, many deployment settings still require predictive models that support planning, constraint handling, and online adaptation. Analytical rigid-body models provide strong physical structure but often fail to capture complex interaction effects, whereas purely data-driven models may violate physical consistency, exhibit data bias, and accumulate long-horizon drift. In this work, we propose STRIDE, a dynamics learning framework that explicitly separates conservative rigid-body mechanics from uncertain, effectively stochastic non-conservative interaction effects. The structured component is modeled using a Lagrangian Neural Network (LNN) to preserve energy-consistent inertial dynamics, while residual interaction forces are represented using Conditional Flow Matching (CFM) to capture multi-modal interaction phenomena. The two components are trained jointly end-to-end, enabling the model to retain physical structure while representing complex stochastic behavior. We evaluate STRIDE on systems of increasing complexity, including a pendulum, the Unitree Go1 quadruped, and the Unitree G1 humanoid. Results show 20% reduction in long-horizon prediction error and 30% reduction in contact force prediction error compared to deterministic residual baselines, supporting more reliable model-based control in uncertain robotic environments.
comment: 9 pages, 7 figures
LAR-MoE: Latent-Aligned Routing for Mixture of Experts in Robotic Imitation Learning
Imitation learning enables robots to acquire manipulation skills from demonstrations, yet deploying a policy across tasks with heterogeneous dynamics remains challenging, as models tend to average over distinct behavioral modes present in the demonstrations. Mixture-of-Experts (MoE) architectures address this by activating specialized subnetworks, but requires meaningful skill decompositions for expert routing. We introduce Latent-Aligned Routing for Mixture of Experts (LAR-MoE), a two-stage framework that decouples unsupervised skill discovery from policy learning. In pre-training, we learn a joint latent representation between observations and future actions through student-teacher co-training. In a post-training stage, the expert routing is regularized to follow the structure of the learned latent space, preventing expert collapse while maintaining parameter efficiency. We evaluate LAR-MoE in simulation and on hardware. On the LIBERO benchmark, our method achieves a 95.2% average success rate with 150M parameters. On a surgical bowel grasping and retraction task, LAR-MoE matches a supervised MoE baseline without requiring any phase annotations, and transfers zero-shot to ex vivo porcine tissue. Our findings suggest that latent-aligned routing provides a principled alternative to supervised skill decomposition, enabling structured expert specialization from unlabeled demonstrations.
comment: Submitted to iROS 2026
R2F: Repurposing Ray Frontiers for LLM-free Object Navigation
Zero-shot open-vocabulary object navigation has progressed rapidly with the emergence of large Vision-Language Models (VLMs) and Large Language Models (LLMs), now widely used as high-level decision-makers instead of end-to-end policies. Although effective, such systems often rely on iterative large-model queries at inference time, introducing latency and computational overhead that limit real-time deployment. To address this problem, we repurpose ray frontiers (R2F), a recently proposed frontier-based exploration paradigm, to develop an LLM-free framework for indoor open-vocabulary object navigation. While ray frontiers were originally used to bias exploration using semantic cues carried along rays, we reinterpret frontier regions as explicit, direction-conditioned semantic hypotheses that serve as navigation goals. Language-aligned features accumulated along out-of-range rays are stored sparsely at frontiers, where each region maintains multiple directional embeddings encoding plausible unseen content. In this way, navigation then reduces to embedding-based frontier scoring and goal tracking within a classical mapping and planning pipeline, eliminating iterative large-model reasoning. We further introduce R2F-VLN, a lightweight extension for free-form language instructions using syntactic parsing and relational verification without additional VLM or LLM components. Experiments in Habitat-sim and on a real robotic platform demonstrate competitive state-of-the-art zero-shot performance with real-time execution, achieving up to 6 times faster runtime than VLM-based alternatives.
Adaptive Entropy-Driven Sensor Selection in a Camera-LiDAR Particle Filter for Single-Vessel Tracking
Robust single-vessel tracking from fixed coastal platforms is hindered by modality-specific degradations: cameras suffer from illumination and visual clutter, while LiDAR performance drops with range and intermittent returns. We present a heterogeneous multi-sensor fusion particle-filter tracker that incorporates an information-gain (entropy-reduction) adaptive sensing policy to select the most informative configuration at each fusion time bin. The approach is validated in a real maritime deployment at the CMMI Smart Marina Testbed (Ayia Napa Marina, Cyprus), using a shore-mounted 3D LiDAR and an elevated fixed camera to track a rigid inflatable boat with onboard GNSS ground truth. We compare LiDAR-only, camera-only, all-sensors, and adaptive configurations. Results show LiDAR dominates near-field accuracy, the camera sustains longer-range coverage when LiDAR becomes unavailable, and the adaptive policy achieves a favorable accuracy-continuity trade-off by switching modalities based on information gain. By avoiding continuous multi-stream processing, the adaptive configuration provides a practical baseline for resilient and resource-aware maritime surveillance.
comment: 8 pages, 5 figures, submitted to FUSION 2026 conference proceedings
FoMo: A Multi-Season Dataset for Robot Navigation in Forêt Montmorency
The Forêt Montmorency (FoMo) dataset is a comprehensive multi-season data collection, recorded over the span of one year in a boreal forest. Featuring a unique combination of on- and off-pavement environments with significant environmental changes, the dataset challenges established odometry and SLAM pipelines. Some highlights of the data include the accumulation of snow exceeding 1 m, significant vegetation growth in front of sensors, and operations at the traction limits of the platform. In total, the FoMo dataset includes over 64 km of six diverse trajectories, repeated during 12 deployments throughout the year. The dataset features data from one rotating and one hybrid solid-state lidar, a Frequency Modulated Continuous Wave (FMCW) radar, full-HD images from a stereo camera and a wide lens monocular camera, as well as data from two IMUs. Ground Truth is calculated by post-processing three GNSS receivers mounted on the Uncrewed Ground Vehicle (UGV) and a static GNSS base station. Additional metadata, such as one measurement per minute from an on-site weather station, camera calibration intrinsics, and vehicle power consumption, is available for all sequences. To highlight the relevance of the dataset, we performed a preliminary evaluation of the robustness of a lidar-inertial, radar-gyro, and a visual-inertial localization and mapping techniques to seasonal changes. We show that seasonal changes have serious effects on the re-localization capabilities of the state-of-the-art methods. The dataset and development kit are available at https://fomo.norlab.ulaval.ca.
Tactile Recognition of Both Shapes and Materials with Automatic Feature Optimization-Enabled Meta Learning ICRA 2026
Tactile perception is indispensable for robots to implement various manipulations dexterously, especially in contact-rich scenarios. However, alongside the development of deep learning techniques, it meanwhile suffers from training data scarcity and a time-consuming learning process in practical applications since the collection of a large amount of tactile data is costly and sometimes even impossible. Hence, we propose an automatic feature optimization-enabled prototypical network to realize meta-learning, i.e., AFOP-ML framework. As a ``learn to learn" network, it not only adapts to new unseen classes rapidly with few-shot, but also learns how to determine the optimal feature space automatically. Based on the four-channel signals acquired from a tactile finger, both shapes and materials are recognized. On a 36-category benchmark, it outperforms several existing approaches by attaining an accuracy of 96.08% in 5-way-1-shot scenario, where only 1 example is available for training. It still remains 88.7% in the extreme 36-way-1-shot case. The generalization ability is further validated through three groups of experiment involving unseen shapes, materials and force/speed perturbations. More insights are additionally provided by this work for the interpretation of recognition tasks and improved design of tactile sensors.
comment: 7 pages, 7 figures, conference paper accepted by ICRA 2026
Human-Aware Robot Behaviour in Self-Driving Labs
Self-driving laboratories (SDLs) are rapidly transforming research in chemistry and materials science to accelerate new discoveries. Mobile robot chemists (MRCs) play a pivotal role by autonomously navigating the lab to transport samples, effectively connecting synthesis, analysis, and characterisation equipment. The instruments within an SDL are typically designed or retrofitted to be accessed by both human and robotic chemists, ensuring operational flexibility and integration between manual and automated workflows. In many scenarios, human and robotic chemists may need to use the same equipment simultaneously. Currently, MRCs rely on simple LiDAR-based obstruction detection, which forces the robot to passively wait if a human is present. This lack of situational awareness leads to unnecessary delays and inefficient coordination in time-critical automated workflows in human-robot shared labs. To address this, we present an initial study of an embodied, AI-driven perception method that facilitates proactive human-robot interaction in shared-access scenarios. Our method features a hierarchical human intention prediction model that allows the robot to distinguish between preparatory actions (waiting) and transient interactions (accessing the instrument). Our results demonstrate that the proposed approach enhances efficiency by enabling proactive human-robot interaction, streamlining coordination, and potentially increasing the efficiency of autonomous scientific labs.
A Recipe for Stable Offline Multi-agent Reinforcement Learning
Despite remarkable achievements in single-agent offline reinforcement learning (RL), multi-agent RL (MARL) has struggled to adopt this paradigm, largely persisting with on-policy training and self-play from scratch. One reason for this gap comes from the instability of non-linear value decomposition, leading prior works to avoid complex mixing networks in favor of linear value decomposition (e.g., VDN) with value regularization used in single-agent setups. In this work, we analyze the source of instability in non-linear value decomposition within the offline MARL setting. Our observations confirm that they induce value-scale amplification and unstable optimization. To alleviate this, we propose a simple technique, scale-invariant value normalization (SVN), that stabilizes actor-critic training without altering the Bellman fixed point. Empirically, we examine the interaction among key components of offline MARL (e.g., value decomposition, value learning, and policy extraction) and derive a practical recipe that unlocks its full potential.
comment: Preprint
StructBiHOI: Structured Articulation Modeling for Long--Horizon Bimanual Hand--Object Interaction Generation
Recent progress in 3D hand--object interaction (HOI) generation has primarily focused on single--hand grasp synthesis, while bimanual manipulation remains significantly more challenging. Long--horizon planning instability, fine--grained joint articulation, and complex cross--hand coordination make coherent bimanual generation difficult, especially under multimodal conditions. Existing approaches often struggle to simultaneously ensure temporal consistency, physical plausibility, and semantic alignment over extended sequences. We propose StructBiHOI, a Structured articulation modeling framework for long-horizon Bimanual HOI generation. Our key insight is to structurally disentangle temporal joint planning from frame--level manipulation refinement. Specifically, a jointVAE models long-term joint evolution conditioned on object geometry and task semantics, while a maniVAE refines fine-grained hand poses at the single--frame level. To enable stable and efficient long--sequence generation, we incorporate a state--space--inspired diffusion denoiser based on Mamba, which models long--range dependencies with linear complexity. This hierarchical design facilitates coherent dual-hand coordination and articulated object interaction. Extensive experiments on bimanual manipulation and single-hand grasping benchmarks demonstrate that our method achieves superior long--horizon stability, motion realism, and computational efficiency compared to strong baselines.
MoMaStage: Skill-State Graph Guided Planning and Closed-Loop Execution for Long-Horizon Indoor Mobile Manipulation
Indoor mobile manipulation (MoMA) enables robots to translate natural language instructions into physical actions, yet long-horizon execution remains challenging due to cascading errors and limited generalization across diverse environments. Learning-based approaches often fail to maintain logical consistency over extended horizons, while methods relying on explicit scene representations impose rigid structural assumptions that reduce adaptability in dynamic settings. To address these limitations, we propose MoMaStage, a structured vision-language framework for long-horizon MoMA that eliminates the need for explicit scene mapping. MoMaStage grounds a Vision-Language Model (VLM) within a Hierarchical Skill Library and a topology-aware Skill-State Graph, constraining task decomposition and skill composition within a feasible transition space. This structured grounding ensures that generated plans remain logically consistent and topologically valid with respect to the agent's evolving physical state. To enhance robustness, MoMaStage incorporates a closed-loop execution mechanism that monitors proprioceptive feedback and triggers graph-constrained semantic replanning when deviations are detected, maintaining alignment between planned skills and physical outcomes. Extensive experiments in physics-rich simulations and real-world environments demonstrate that MoMaStage outperforms state-of-the-art baselines, achieving substantially higher planning success, reducing token overhead, and significantly improving overall task success rates in long-horizon mobile manipulation. Video demonstrations are available on the project website: https://chenxuli-cxli.github.io/MoMaStage/.
comment: 8 pages
Perception-Aware Communication-Free Multi-UAV Coordination in the Wild
We present a communication-free method for safe multi-robot coordination in complex environments such as forests with dense canopy cover, where GNSS is unavailable. Our approach relies on an onboard anisotropic 3D LiDAR sensor used for SLAM as well as for detecting obstacles and neighboring robots. We develop a novel perception-aware 3D navigation framework that enables robots to safely and effectively progress toward a goal region despite limited sensor field-of-view. The approach is evaluated through extensive simulations across diverse scenarios and validated in real-world field experiments, demonstrating its scalability, robustness, and reliability.
PhaForce: Phase-Scheduled Visual-Force Policy Learning with Slow Planning and Fast Correction for Contact-Rich Manipulation
Contact-rich manipulation requires not only vision-dominant task semantics but also closed-loop reactions to force/torque (F/T) transients. Yet, generative visuomotor policies are typically constrained to low-frequency updates due to inference latency and action chunking, underutilizing F/T for control-rate feedback. Furthermore, existing force-aware methods often inject force continuously and indiscriminately, lacking an explicit mechanism to schedule when / how much / where to apply force across different task phases. We propose PhaForce, a phase-scheduled visual--force policy that coordinates low-rate chunk-level planning and high-rate residual correction via a unified contact/phase schedule. PhaForce comprises (i) a contact-aware phase predictor (CAP) that estimates contact probability and phase belief, (ii) a Slow diffusion planner that performs dual-gated visual--force fusion with orthogonal residual injection to preserve vision semantics while conditioning on force, and (iii) a Fast corrector that applies control-rate phase-routed residuals in interpretable corrective subspaces for within-chunk micro-adjustments. Across multiple real-robot contact-rich tasks, PhaForce achieves an average success rate of 86% (+40 pp over baselines), while also substantially improving contact quality by regulating interaction forces and exhibiting robust adaptability to OOD geometric shifts.
Hierarchical Multi-Modal Planning for Fixed-Altitude Sparse Target Search and Sampling
Efficient monitoring of sparse benthic phenomena, such as coral colonies, presents a great challenge for Autonomous Underwater Vehicles. Traditional exhaustive coverage strategies are energy-inefficient, while recent adaptive sampling approaches rely on costly vertical maneuvers. To address these limitations, we propose HIMoS (Hierarchical Informative Multi-Modal Search), a fixed-altitude framework for sparse coral search-and-sample missions. The system integrates a heterogeneous sensor suite within a two-layer planning architecture. At the strategic level, a Global Planner optimizes topological routes to maximize potential discovery. At the tactical level, a receding-horizon Local Planner leverages differentiable belief propagation to generate kinematically feasible trajectories that balance acoustic substrate exploration, visual coral search, and close-range sampling. Validated in high-fidelity simulations derived from real-world coral reef benthic surveys, our approach demonstrates superior mission efficiency compared to state-of-the-art baselines.
comment: 8 pages, 9 figures, conference
EndoSERV: A Vision-based Endoluminal Robot Navigation System
Robot-assisted endoluminal procedures are increasingly used for early cancer intervention. However, the intricate, narrow and tortuous pathways within the luminal anatomy pose substantial difficulties for robot navigation. Vision-based navigation offers a promising solution, but existing localization approaches are error-prone due to tissue deformation, in vivo artifacts and a lack of distinctive landmarks for consistent localization. This paper presents a novel EndoSERV localization method to address these challenges. It includes two main parts, \textit{i.e.}, \textbf{SE}gment-to-structure and \textbf{R}eal-to-\textbf{V}irtual mapping, and hence the name. For long-range and complex luminal structures, we divide them into smaller sub-segments and estimate the odometry independently. To cater for label insufficiency, an efficient transfer technique maps real image features to the virtual domain to use virtual pose ground truth. The training phases of EndoSERV include an offline pretraining to extract texture-agnostic features, and an online phase that adapts to real-world conditions. Extensive experiments based on both public and clinical datasets have been performed to demonstrate the effectiveness of the method even without any real pose labels.
Less is More: Robust Zero-Communication 3D Pursuit-Evasion via Representational Parsimony
Asymmetric 3D pursuit-evasion in cluttered voxel environments is difficult under communication latency, partial observability, and nonholonomic maneuver limits. While many MARL methods rely on richer inter-agent coupling or centralized signals, these dependencies can become fragility sources when communication is delayed or noisy. Building on an inherited path-guided decentralized pursuit scaffold, we study a robustness-oriented question: can representational parsimony improve communication-free coordination? We instantiate this principle with (i) a parsimonious actor observation interface that removes team-coupled channels (83-D to 50-D), and (ii) Contribution-Gated Credit Assignment (CGCA), a locality-aware credit structure for communication-denied cooperation. In Stage-5 evaluation (4 pursuers vs. 1 evader), our configuration reaches 0.753 +/- 0.091 success and 0.223 +/- 0.066 collision, outperforming the 83-D FULL OBS counterpart (0.721 +/- 0.071, 0.253 +/- 0.089). It further shows graceful degradation under speed/yaw/noise/delay stress tests and resilient zero-shot transfer on urban-canyon maps (about 61% success at density 0.24). These results support a practical paradigm shift: explicitly severing redundant cross-agent channels can suppress compounding error cascades and improve robustness in latency-prone deployment.
comment: 7 pages, 10 figures. This work has been submitted to the IEEE for possible publication
SAIL: Test-Time Scaling for In-Context Imitation Learning with VLM
In-context imitation learning allows robots to acquire skills from demonstrations, yet one-shot trajectory generation remains fragile under environmental variation. We propose SAIL, a framework that reframes robot imitation as an iterative refinement problem capable of scaling with test-time compute. SAIL utilizes Monte Carlo Tree Search, where each node is a complete trajectory and edges correspond to trajectory refinements. The process is guided by three core components: an automated archive of successful trajectories for contextually relevant retrieval, a vision language model-based scoring mechanism for trajectory evaluation, and a step-level feedback that provides trajectory-aligned scores for iterative refinement. Experiments across six diverse manipulation tasks in simulation and real-world validation clearly demonstrate that increasing test-time compute consistently improves success rates, achieving up to 95% on complex tasks. Our results suggest that trajectory-level test-time scaling is a robust path toward more generalizable robotic agents.
comment: 8 pages, 3 figures
Seed2Scale: A Self-Evolving Data Engine for Embodied AI via Small to Large Model Synergy and Multimodal Evaluation
Existing data generation methods suffer from exploration limits, embodiment gaps, and low signal-to-noise ratios, leading to performance degradation during self-iteration. To address these challenges, we propose Seed2Scale, a self-evolving data engine that overcomes the data bottleneck through a heterogeneous synergy of "small-model collection, large-model evaluation, and target-model learning". Starting with as few as four seed demonstrations, the engine employs the lightweight Vision-Language-Action model, SuperTiny, as a dedicated collector, leveraging its strong inductive bias for robust exploration in parallel environments. Concurrently, a pre-trained Vision-Language Model is integrated as a Verifer to autonomously perform success/failure judgment and quality scoring for the massive generated trajectories. Seed2Scale effectively mitigates model collapse, ensuring the stability of the self-evolution process. Experimental results demonstrate that Seed2Scale exhibits signifcant scaling potential: as iterations progress, the success rate of the target model shows a robust upward trend, achieving a performance improvement of 131.2%. Furthermore, Seed2Scale signifcantly outperforms existing data augmentation methods, providing a scalable and cost-effective pathway for the large-scale development of Generalist Embodied AI. Project page: https://terminators2025.github.io/Seed2Scale.github.io
FlowTouch: View-Invariant Visuo-Tactile Prediction
Tactile sensation is essential for contact-rich manipulation tasks. It provides direct feedback on object geometry, surface properties, and interaction forces, enhancing perception and enabling fine-grained control. An inherent limitation of tactile sensors is that readings are available only when an object is touched. This precludes their use during planning and the initial execution phase of a task. Predicting tactile information from visual information can bridge this gap. A common approach is to learn a direct mapping from camera images to the output of vision-based tactile sensors. However, the resulting model will depend strongly on the specific setup and on how well the camera can capture the area where an object is touched. In this work, we introduce FlowTouch, a novel model for view-invariant visuo-tactile prediction. Our key idea is to use an object's local 3D mesh to encode rich information for predicting tactile patterns while abstracting away from scene-dependent details. FlowTouch integrates scene reconstruction and Flow Matching-based models for image generation. Our results show that FlowTouch is able to bridge the sim-to-real gap and generalize to new sensor instances. We further show that the resulting tactile images can be used for downstream grasp stability prediction. Our code, datasets and videos are available at https://flowtouch.github.io/
A General Lie-Group Framework for Continuum Soft Robot Modeling
This paper introduces a general Lie group framework for modeling continuum soft robots, employing Cosserat rod theory combined with cumulative parameterization on the Lie group SE(3). This novel approach addresses limitations present in current strain-based and configuration-based methods by providing geometric local control and eliminating unit quaternion constraints. The paper derives unified analytical expressions for kinematics, statics, and dynamics, including recursive Jacobian computations and an energy-conserving integrator suitable for real-time simulation and control. Additionally, the framework is extended to handle complex robotic structures, including segmented, branched, nested, and rigid-soft composite configurations, facilitating a modular and unified modeling strategy. The effectiveness, generality, and computational efficiency of the proposed methodology are demonstrated through various scenarios, including large-deformation rods, concentric tube robots, parallel robots, cable-driven robots, and articulated fingers. This work enhances modeling flexibility and numerical performance, providing an improved toolset for designing, simulating, and controlling soft robotic systems.
Fusion-Poly: A Polyhedral Framework Based on Spatial-Temporal Fusion for 3D Multi-Object Tracking
LiDAR-camera 3D multi-object tracking (MOT) combines rich visual semantics with accurate depth cues to improve trajectory consistency and tracking reliability. In practice, however, LiDAR and cameras operate at different sampling rates. To maintain temporal alignment, existing data pipelines usually synchronize heterogeneous sensor streams and annotate them at a reduced shared frequency, forcing most prior methods to perform spatial fusion only at synchronized timestamps through projection-based or learnable cross-sensor association. As a result, abundant asynchronous observations remain underexploited, despite their potential to support more frequent association and more robust trajectory estimation over short temporal intervals. To address this limitation, we propose Fusion-Poly, a spatial-temporal fusion framework for 3D MOT that integrates asynchronous LiDAR and camera data. Fusion-Poly associates trajectories with multi-modal observations at synchronized timestamps and with single-modal observations at asynchronous timestamps, enabling higher-frequency updates of motion and existence states. The framework contains three key components: a frequency-aware cascade matching module that adapts to synchronized and asynchronous frames according to available detection modalities; a frequency-aware trajectory estimation module that maintains trajectories through high-frequency motion prediction, differential updates, and confidence-calibrated lifecycle management; and a full-state observation alignment module that improves cross-modal consistency at synchronized timestamps by optimizing image-projection errors. On the nuScenes test set, Fusion-Poly achieves 76.5% AMOTA, establishing a new state of the art among tracking-by-detection 3D MOT methods. Extensive ablation studies further validate the effectiveness of each component. Code will be released.
Edged USLAM: Edge-Aware Event-Based SLAM with Learning-Based Depth Priors ICRA 2026
Conventional visual simultaneous localization and mapping (SLAM) algorithms often fail under rapid motion, low illumination, or abrupt lighting transitions due to motion blur and limited dynamic range. Event cameras mitigate these issues with high temporal resolution and high dynamic range (HDR), but their sparse, asynchronous outputs complicate feature extraction and integration with other sensors; e.g. inertial measurement units (IMUs) and standard cameras. We present Edged USLAM, a hybrid visual-inertial system that extends Ultimate SLAM (USLAM) with an edge-aware front-end and a lightweight depth module. The frontend enhances event frames for robust feature tracking and nonlinear motion compensation, while the depth module provides coarse, region-of-interest (ROI)-based scene depth to improve motion compensation and scale consistency. Evaluations across public benchmarks and real-world unmanned air vehicle (UAV) flights demonstrate that performance varies significantly by scenario. For instance, event-only methods like point-line event-based visual-inertial odometry (PL-EVIO) or learning-based pipelines such as deep event-based visual odometry (DEVO) excel in highly aggressive or extreme HDR conditions. In contrast, Edged USLAM provides superior stability and minimal drift in slow or structured trajectories, ensuring consistently accurate localization on real flights under challenging illumination. These findings highlight the complementary strengths of event-only, learning-based, and hybrid approaches, while positioning Edged USLAM as a robust solution for diverse aerial navigation tasks.
comment: 8 pages, 7 figures, 3 tables. Accepted to ICRA 2026. Project code and datasets available at https://github.com/sebnem-byte/Edged-USLAM
Multifingered force-aware control for humanoid robots ICRA 2026
In this paper, we address force-aware control and force distribution in robotic platforms with multi-fingered hands. Given a target goal and force estimates from tactile sensors, we design a controller that adapts the motion of the torso, arm, wrist, and fingers, redistributing forces to maintain stable contact with objects of varying mass distribution or unstable contacts. To estimate forces, we collect a dataset of tactile signals and ground-truth force measurements using five Xela magnetic sensors interacting with indenters, and train force estimators. We then introduce a model-based control scheme that minimizes the distance between the Center of Pressure (CoP) and the centroid of the fingertips contact polygon. Since our method relies on estimated forces rather than raw tactile signals, it has the potential to be applied to any sensor capable of force estimation. We validate our framework on a balancing task with five objects, achieving a $82.7\%$ success rate, and further evaluate it in multi-object scenarios, achieving $80\%$ accuracy. Code and data can be found here https://github.com/hsp-iit/multifingered-force-aware-control.
comment: This work has been accepted for publication in ICRA 2026
POIROT: Investigating Direct Tangible vs. Digitally Mediated Interaction and Attitude Moderation in Multi-party Murder Mystery Games
As social robots take on increasingly complex roles like game masters (GMs) in multi-party games, the expectation that physicality universally enhances user experience remains debated. This study challenges the "one-size-fits-all" view of tangible interaction by identifying a critical boundary condition: users' Negative Attitudes towards Robots (NARS). In a between-subjects experiment (N = 67), a custom-built robot GM facilitated a multi-party murder mystery game (MMG) by delivering clues either through direct tangible interaction or a digitally mediated interface. Baseline multivariate analysis (MANOVA) showed no significant main effect of delivery modality, confirming that tangibility alone does not guarantee superior engagement. However, primary analysis using multilevel linear models (MLM) revealed a reliable moderation: participants high in NARS experienced markedly lower narrative immersion under tangible delivery, whereas those with low NARS scores showed no such decrement. Qualitative findings further illuminate this divergence: tangibility provides novelty and engagement for some but imposes excessive proxemic friction for anxious users, for whom the digital interface acts as a protective social buffer. These results advance a conditional model of HRI and emphasize the necessity for adaptive systems that can tailor interaction modalities to user predispositions.
comment: 16 pages, 7 figures. Accepted to the 21st ACM/IEEE International Conference on Human-Robot Interaction (HRI 2026)
UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing
Understanding and localizing objects in complex 3D environments from natural language descriptions, known as 3D Visual Grounding (3DVG), is a foundational challenge in embodied AI, with broad implications for robotics, augmented reality, and human-machine interaction. Large-scale pre-trained foundation models have driven significant progress on this front, enabling open-vocabulary 3DVG that allows systems to locate arbitrary objects in a given scene. However, their reliance on pre-trained models constrains 3D perception and reasoning within the inherited knowledge boundaries, resulting in limited generalization to unseen spatial relationships and poor robustness to out-of-distribution scenes. In this paper, we replace this constrained perception with training-free visual and geometric reasoning, thereby unlocking open-world 3DVG that enables the localization of any object in any scene beyond the training data. Specifically, the proposed UniGround operates in two stages: a Global Candidate Filtering stage that constructs scene candidates through training-free 3D topology and multi-view semantic encoding, and a Local Precision Grounding stage that leverages multi-scale visual prompting and structured reasoning to precisely identify the target object. Experiments on ScanRefer and EmbodiedScan show that UniGround achieves 46.1\%/34.1\% Acc@0.25/0.5 on ScanRefer and 28.7\% Acc@0.25 on EmbodiedScan, establishing a new state-of-the-art among zero-shot methods on EmbodiedScan without any 3D supervision. We further evaluate UniGround in real-world environments under uncontrolled reconstruction conditions and substantial domain shift, showing training-free reasoning generalizes robustly beyond curated benchmarks.
comment: 14 pages,6 figures,3 tables
TRIAGE: Type-Routed Interventions via Aleatoric-Epistemic Gated Estimation in Robotic Manipulation and Adaptive Perception -- Don't Treat All Uncertainty the Same
Most uncertainty-aware robotic systems collapse prediction uncertainty into a single scalar score and use it to trigger uniform corrective responses. This aggregation obscures whether uncertainty arises from corrupted observations or from mismatch between the learned model and the true system dynamics. As a result, corrective actions may be applied to the wrong component of the closed loop, degrading performance relative to leaving the policy unchanged. We introduce a lightweight post hoc framework that decomposes uncertainty into aleatoric and epistemic components and uses these signals to regulate system responses at inference time. Aleatoric uncertainty is estimated from deviations in the observation distribution using a Mahalanobis density model, while epistemic uncertainty is detected using a noise robust forward dynamics ensemble that isolates model mismatch from measurement corruption. The two signals remain empirically near orthogonal during closed loop execution and enable type specific responses. High aleatoric uncertainty triggers observation recovery, while high epistemic uncertainty moderates control actions. The same signals also regulate adaptive perception by guiding model capacity selection during tracking inference. Experiments demonstrate consistent improvements across both control and perception tasks. In robotic manipulation, the decomposed controller improves task success from 59.4% to 80.4% under compound perturbations and outperforms a combined uncertainty baseline by up to 21.0%. In adaptive tracking inference on MOT17, uncertainty-guided model selection reduces average compute by 58.2% relative to a fixed high capacity detector while preserving detection quality within 0.4%. Code and demo videos are available at https://divake.github.io/uncertainty-decomposition/.
SaiVLA-0: Cerebrum--Pons--Cerebellum Tripartite Architecture for Compute-Aware Vision-Language-Action
We revisit Vision-Language-Action through a neuroscience-inspired triad. Biologically, the Cerebrum provides stable high-level multimodal priors and remains frozen; the Pons Adapter integrates these cortical features with real-time proprioceptive inputs and compiles intent into execution-ready tokens; and the Cerebellum (ParaCAT) performs fast, parallel categorical decoding for online control, with hysteresis/EMA/temperature/entropy for stability. A fixed-ratio schedule and two-stage feature caching make the system compute-aware and reproducible. Inspired by active, foveated vision, our wrist ROIs are geometrically tied to the end-effector via calibrated projection, providing a movement-stabilized, high-resolution view that is sensitive to fine-grained pose changes and complements the global context of the main view. The design is modular: upgrading the Cerebrum only retrains the Pons; changing robots only trains the Cerebellum; cerebellum-only RL can further refine control without touching high-level semantics. As a concept-and-protocol paper with preliminary evidence, we outline a timing protocol under matched conditions (GPU, resolution, batch) to verify anticipated efficiency gains. We also report preliminary LIBERO evidence showing that split feature caching reduces training time (7.5h to 4.5h) and improves average success (86.5% to 92.5%) under official N1.5 head-only training, and that SaiVLA0 reaches 99.0% mean success.
comment: 14 pages, 3 figures
Towards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLA
While Vision-Language-Action (VLA) models have demonstrated remarkable success in robotic manipulation, their application has largely been confined to low-degree-of-freedom end-effectors performing simple, vision-guided pick-and-place tasks. Extending these models to human-like, bimanual dexterous manipulation-specifically contact-rich in-hand operations-introduces critical challenges in high-fidelity data acquisition, multi-skill learning, and multimodal sensory fusion. In this paper, we propose an integrated framework to address these bottlenecks, built upon two components. First, we introduce IMCopilot (In-hand Manipulation Copilot), a suite of reinforcement learning-trained atomic skills that plays a dual role: it acts as a shared-autonomy assistant to simplify teleoperation data collection, and it serves as a callable low-level execution primitive for the VLA. Second, we present MoDE-VLA (Mixture-of-Dexterous-Experts VLA), an architecture that seamlessly integrates heterogeneous force and tactile modalities into a pretrained VLA backbone. By utilizing a residual injection mechanism, MoDE-VLA enables contact-aware refinement without degrading the model's pretrained knowledge. We validate our approach on four tasks of escalating complexity, demonstrating doubled success rate improvement over the baseline in dexterous contact-rich tasks.
comment: Project Homepage: https://sites.google.com/view/mode-vla
DeReCo: Decoupling Representation and Coordination Learning for Object-Adaptive Decentralized Multi-Robot Cooperative Transport
Generalizing decentralized multi-robot cooperative transport across objects with diverse shapes and physical properties remains a fundamental challenge. Under decentralized execution, two key challenges arise: object-dependent representation learning under partial observability and coordination learning in multi-agent reinforcement learning (MARL) under non-stationarity. A typical approach jointly optimizes object-dependent representations and coordinated policies in an end-to-end manner while randomizing object shapes and physical properties during training. However, this joint optimization tightly couples representation and coordination learning, introducing bidirectional interference: inaccurate representations under partial observability destabilize coordination learning, while non-stationarity in MARL further degrades representation learning, resulting in sample-inefficient training. To address this structural coupling, we propose DeReCo, a novel MARL framework that decouples representation and coordination learning for object-adaptive multi-robot cooperative transport, improving sample efficiency and generalization across objects and transport scenarios. DeReCo adopts a three-stage training strategy: (1) centralized coordination learning with privileged object information, (2) reconstruction of object-dependent representations from local observations, and (3) progressive removal of privileged information for decentralized execution. This decoupling mitigates interference between representation and coordination learning and enables stable and sample-efficient training. Experimental results show that DeReCo outperforms baselines in simulation on three training objects, generalizes to six unseen objects with varying masses and friction coefficients, and achieves superior performance on two unseen objects in real-robot experiments.
comment: 9 pages, 7 figures
Adaptive Vision-Based Control of Redundant Robots with Null-Space Interaction for Human-Robot Collaboration
Human-robot collaboration aims to extend human ability through cooperation with robots. This technology is currently helping people with physical disabilities, has transformed the manufacturing process of companies, improved surgical performance, and will likely revolutionize the daily lives of everyone in the future. Being able to enhance the performance of both sides, such that human-robot collaboration outperforms a single robot/human, remains an open issue. For safer and more effective collaboration, a new control scheme has been proposed for redundant robots in this paper, consisting of an adaptive vision-based control term in task space and an interactive control term in null space. Such a formulation allows the robot to autonomously carry out tasks in an unknown environment without prior calibration while also interacting with humans to deal with unforeseen changes (e.g., potential collision, temporary needs) under the redundant configuration. The decoupling between task space and null space helps to explore the collaboration safely and effectively without affecting the main task of the robot end-effector. The stability of the closed-loop system has been rigorously proved with Lyapunov methods, and both the convergence of the position error in task space and that of the damping model in null space are guaranteed. The experimental results of a robot manipulator guided with the technology of augmented reality (AR) are presented to illustrate the performance of the control scheme.
MRDrive: An Open Source Mixed Reality Driving Simulator for Automotive User Research
Designing and evaluating in-vehicle interfaces requires experimental platforms that combine ecological validity with experimental control. Driving simulators are widely used for this purpose. However, they face a fundamental trade-off: high-fidelity physical simulators are costly and difficult to adapt, while virtual reality simulators provide flexibility at the expense of physical interaction with the vehicle. In this work, we present MRDrive, an open mixed-reality driving simulator designed to support HCI research on in-vehicle interaction, attention, and explainability in manual and automated driving contexts. MRDrive enables drivers and passengers to interact with a real vehicle cabin while being fully immersed in a virtual driving environment. We demonstrate the capabilities of MRDrive through a small pilot study that illustrates how the simulator can be used to collect and analyze eye-tracking and touch interaction data in an automated driving scenario. MRDRive is available at: https://github.com/ciao-group/mrdrive
comment: This version has been accepted at CHI 2026
See and Switch: Vision-Based Branching for Interactive Robot-Skill Programming
Programming robots by demonstration (PbD) is an intuitive concept, but scaling it to real-world variability remains a challenge for most current teaching frameworks. Conditional task graphs are very expressive and can be defined incrementally, which fits very well with the PbD idea. However, acting using conditional task graphs requires reliable perception-grounded online branch selection. In this paper, we present See & Switch, an interactive teaching-and-execution framework that represents tasks as user-extendable graphs of skill parts connected via decision states (DS), enabling conditional branching during replay. Unlike prior approaches that rely on manual branching or low-dimensional signals (e.g., proprioception), our vision-based Switcher uses eye-in-hand images (high-dimensional) to select among competing successor skill parts and to detect out-of-distribution contexts that require new demonstrations. We integrate kinesthetic teaching, joystick control, and hand gestures via an input-modality-abstraction layer and demonstrate that our proposed method is teaching modality-independent, enabling efficient in-situ recovery demonstrations. The system is validated in experiments on three challenging dexterous manipulation tasks. We evaluate our method under diverse conditions and furthermore conduct user studies with 8 participants. We show that the proposed method reliably performs branch selection and anomaly detection for novice users, achieving 90.7 % and 87.9 % accuracy, respectively, across 576 real-robot rollouts. We provide all code and data required to reproduce our experiments at http://imitrob.ciirc.cvut.cz/publications/seeandswitch.
comment: 8 pages, 11 figures
Trajectory Tracking Control Design for Autonomous Helicopters with Guaranteed Error Bounds
This paper presents a systematic framework for computing formally guaranteed trajectory tracking error bounds for autonomous helicopters based on Robust Positive Invariant (RPI) sets. The approach focuses on establishing a closed-loop translational error dynamics which is cast into polytopic linear parameter-varying form with bounded additive and state-dependent disturbances. Ellipsoidal RPI sets are computed, yielding explicit position error bounds suitable as certified buffer zones in upper-level trajectory planning. Three controller architectures are compared with respect to the conservatism of their error bounds and tracking performance. Simulation results on a nonlinear helicopter model demonstrate that all architectures respect the derived bounds, while highlighting trade-offs between dynamical fidelity and conservatism in invariant set computation.
comment: Submitted to the 2026 International Conference on Unmanned Aircraft Systems (ICUAS)
AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis
Generating human grasping poses that accurately reflect both object geometry and user-specified interaction semantics is essential for natural hand-object interactions in AR/VR and embodied AI. However, existing semantic grasping approaches struggle with the large modality gap between 3D object representations and textual instructions, and often lack explicit spatial or semantic constraints, leading to physically invalid or semantically inconsistent grasps. In this work, we present AffordGrasp, a diffusion-based framework that produces physically stable and semantically faithful human grasps with high precision. We first introduce a scalable annotation pipeline that automatically enriches hand-object interaction datasets with fine-grained structured language labels capturing interaction intent. Building upon these annotations, AffordGrasp integrates an affordance-aware latent representation of hand poses with a dual-conditioning diffusion process, enabling the model to jointly reason over object geometry, spatial affordances, and instruction semantics. A distribution adjustment module further enforces physical contact consistency and semantic alignment. We evaluate AffordGrasp across four instruction-augmented benchmarks derived from HO-3D, OakInk, GRAB, and AffordPose, and observe substantial improvements over state-of-the-art methods in grasp quality, semantic accuracy, and diversity.
Vector Field Augmented Differentiable Policy Learning for Vision-Based Drone Racing
Autonomous drone racing in complex environments requires agile, high-speed flight while maintaining reliable obstacle avoidance. Differentiable-physics-based policy learning has recently demonstrated high sample efficiency and remarkable performance across various tasks, including agile drone flight and quadruped locomotion. However, applying such methods to drone racing remains difficult, as key objective like gate traversal are inherently hard to express as smooth, differentiable losses. To address these challenges, we propose DiffRacing, a novel vector field-augmented differentiable policy learning framework. DiffRacing integrates differentiable losses and vector fields into the training process to provide continuous and stable gradient signals, balancing obstacle avoidance and high-speed gate traversal. In addition, a differentiable Delta Action Model compensates for dynamics mismatch, enabling efficient sim-to-real transfer without explicit system identification. Extensive simulation and real-world experiments demonstrate that DiffRacing achieves superior sample efficiency, faster convergence, and robust flight performance, thereby demonstrating that vector fields can augment traditional gradient-based policy learning with a task-specific geometric prior.
comment: 8 pages, 7 figures, RAL 2026 March
Dual-Horizon Hybrid Internal Model for Low-Gravity Quadrupedal Jumping with Hardware-in-the-Loop Validation
Locomotion under reduced gravity is commonly realized through jumping, yet continuous pronking in lunar gravity remains challenging due to prolonged flight phases and sparse ground contact. The extended aerial duration increases landing impact sensitivity and makes stable attitude regulation over rough planetary terrain difficult. Existing approaches primarily address single jumps on flat surfaces and lack both continuous-terrain solutions and realistic hardware validation. This work presents a Dual-Horizon Hybrid Internal Model for continuous quadrupedal jumping under lunar gravity using proprioceptive sensing only. Two temporal encoders capture complementary time scales: a short-horizon branch models rapid vertical dynamics with explicit vertical velocity estimation, while a long-horizon branch models horizontal motion trends and center-of-mass height evolution across the jump cycle. The fused representation enables stable and continuous jumping under extended aerial phases characteristic of lunar gravity. To provide hardware-in-the-loop validation, we develop the MATRIX (Mixed-reality Adaptive Testbed for Robotic Integrated eXploration) platform, a digital-twin-driven system that offloads gravity through a pulley-counterweight mechanism and maps Unreal Engine lunar terrain to a motion platform and treadmill in real time. Using MATRIX, we demonstrate continuous jumping of a quadruped robot under lunar-gravity emulation across cratered lunar-like terrain.
Aero-Promptness: Drag-Aware Aerodynamic Manipulability for Propeller-driven Vehicles
This work introduces the Drag-Aware Aerodynamic Manipulability (DAAM), a geometric framework for control allocation in redundant multirotors. By equipping the propeller spin-rate space with a Riemannian metric based on the remaining symmetric acceleration capacity of each motor, the formulation explicitly accounts for motor torque limits and aerodynamic drag. Mapping this metric through the nonlinear thrust law to the generalized force space yields a state-dependent manipulability volume. The log-determinant of this volume acts as a natural barrier function, strictly penalizing drag-induced saturation and low-spin thrust loss. Optimizing this volume along the allocation fibers provides a redundancy resolution strategy inherently invariant to arbitrary coordinate scaling in the generalized-force space. Analytically, we prove that the resulting optimal allocations locally form smooth embedded manifolds, and we geometrically characterize the global jump discontinuities that inevitably arise from physical actuator limits and spin-rate sign transitions.
TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size CVPR 2026
Physics-based humanoid control has achieved remarkable progress in enabling realistic and high-performing single-agent behaviors, yet extending these capabilities to cooperative human-object interaction (HOI) remains challenging. We present TeamHOI, a framework that enables a single decentralized policy to handle cooperative HOIs across any number of cooperating agents. Each agent operates using local observations while attending to other teammates through a Transformer-based policy network with teammate tokens, allowing scalable coordination across variable team sizes. To enforce motion realism while addressing the scarcity of cooperative HOI data, we further introduce a masked Adversarial Motion Prior (AMP) strategy that uses single-human reference motions while masking object-interacting body parts during training. The masked regions are then guided through task rewards to produce diverse and physically plausible cooperative behaviors. We evaluate TeamHOI on a challenging cooperative carrying task involving two to eight humanoid agents and varied object geometries. Finally, to promote stable carrying, we design a team-size- and shape-agnostic formation reward. TeamHOI achieves high success rates and demonstrates coherent cooperation across diverse configurations with a single policy.
comment: CVPR 2026. Project page: https://splionar.github.io/TeamHOI/ Code: https://github.com/sail-sg/TeamHOI
VORL-EXPLORE: A Hybrid Learning Planning Approach to Multi-Robot Exploration in Dynamic Environments
Hierarchical multi-robot exploration commonly decouples frontier allocation from local navigation, which can make the system brittle in dense and dynamic environments. Because the allocator lacks direct awareness of execution difficulty, robots may cluster at bottlenecks, trigger oscillatory replanning, and generate redundant coverage. We propose VORL-EXPLORE, a hybrid learning and planning framework that addresses this limitation through execution fidelity, a shared estimate of local navigability that couples task allocation with motion execution. This fidelity signal is incorporated into a fidelity-coupled Voronoi objective with inter-robot repulsion to reduce contention before it emerges. It also drives a risk-aware adaptive arbitration mechanism between global A* guidance and a reactive reinforcement learning policy, balancing long-range efficiency with safe interaction in confined spaces. The framework further supports online self-supervised recalibration of the fidelity model using pseudo-labels derived from recent progress and safety outcomes, enabling adaptation to non-stationary obstacles without manual risk tuning. We evaluate this capability separately in a dedicated severe-traffic ablation. Extensive experiments in randomized grids and a Gazebo factory scenario show high success rates, shorter path length, lower overlap, and robust collision avoidance. The source code will be made publicly available upon acceptance.
RAPID: Redundancy-Aware and Compatibility-Optimal Edge-Cloud Partitioned Inference for Diverse VLA models
Vision Language Action (VLA) models are mainstream in embodied intelligence but face high inference costs. Edge-Cloud Collaborative (ECC) inference offers an effective fix by easing edge-device computing pressure to meet real-time needs. However, existing ECC frameworks are suboptimal for VLA models due to two challenges: (1) Mainstream environment-oriented edge-cloud partitioning methods are susceptible to interference from visual noise; (2) Existing edge-cloud partitioning methods overlook the step-wise redundancy unique to embodied tasks, thereby disrupting the physical continuity of motion. To address these issues, we propose a novel ECC inference framework, termed RAPID. Specifically, we developed an implementation tailored to the proposed framework. Experiments demonstrate this achieves a speedup of up to 1.73x with only 5%~7% overhead.
Unified Structural-Hydrodynamic Modeling of Underwater Underactuated Mechanisms and Soft Robots
Underwater robots are widely deployed for ocean exploration and manipulation. Underactuated mechanisms are particularly advantageous in aquatic environments, as reducing actuator count lowers the risk of motor leakage while introducing inherent mechanical compliance. However, accurate modeling of underwater underactuated and soft robotic systems remains challenging because it requires identifying a high-dimensional set of internal structural and external hydrodynamic parameters. In this work, we propose a trajectory-driven global optimization framework for unified structural-hydrodynamic modeling of underwater multibody systems. Inspired by the Covariance Matrix Adaptation Evolution Strategy (CMA-ES), the proposed approach simultaneously identifies coupled internal elastic, damping, and distributed hydrodynamic parameters through trajectory-level matching between simulation and experimental motion. This enables high-fidelity reproduction of both underactuated mechanisms and compliant soft robotic systems in underwater environments. We first validate the framework on a link-by-link underactuated multibody mechanism, demonstrating accurate identification of distributed hydrodynamic coefficients, with a normalized end effector position error below 5% across multiple trajectories, varying initial conditions, and both active-passive and fully passive configurations. The identified modeling strategy is then transferred to a single octopus-inspired soft arm, showing strong real-to-sim consistency without manual retuning. Finally, eight identified arms are assembled into a swimming octopus robot, where the unified parameter set enables realistic whole body behavior without additional parameter calibration. These results demonstrate the scalability and transferability of the proposed structural-hydrodynamic modeling framework across underwater underactuated and soft robotic systems.
comment: The first two listed authors contributed equally. Yiyuan Zhang is the corresponding author
Omnidirectional Humanoid Locomotion on Stairs via Unsafe Stepping Penalty and Sparse LiDAR Elevation Mapping
Humanoid robots, characterized by numerous degrees of freedom and a high center of gravity, are inherently unstable. Safe omnidirectional locomotion on stairs requires both omnidirectional terrain perception and reliable foothold selection. Existing methods often rely on forward-facing depth cameras, which create blind zones that restrict omnidirectional mobility. Furthermore, sparse post-contact unsafe stepping penalties lead to low learning efficiency and suboptimal strategies. To realize safe stair-traversal gaits, this paper introduces a single-stage training framework incorporating a dense unsafe stepping penalty that provides continuous feedback as the foot approaches a hazardous placement. To obtain stable and reliable elevation maps, we build a rolling point-cloud mapping system with spatiotemporal confidence decay and a self-protection zone mechanism, producing temporally consistent local maps. These maps are further refined by an Edge-Guided Asymmetric U-Net (EGAU), which mitigates reconstruction distortion caused by sparse LiDAR returns on stair risers. Simulation and real-robot experiments show that the proposed method achieves a near-100\% safe stepping rate on stair terrains in simulation, while maintaining a remarkably high safe stepping rate in real-world deployments. Furthermore, it completes a continuous long-distance walking test on complex outdoor terrains, demonstrating reliable sim-to-real transfer and long-term stability.
Long-Short Term Agents for Pure-Vision Bronchoscopy Robotic Autonomy
Accurate intraoperative navigation is essential for robot-assisted endoluminal intervention, but remains difficult because of limited endoscopic field of view and dynamic artifacts. Existing navigation platforms often rely on external localization technologies, such as electromagnetic tracking or shape sensing, which increase hardware complexity and remain vulnerable to intraoperative anatomical mismatch. We present a vision-only autonomy framework that performs long-horizon bronchoscopic navigation using preoperative CT-derived virtual targets and live endoscopic video, without external tracking during navigation. The framework uses hierarchical long-short agents: a short-term reactive agent for continuous low-latency motion control, and a long-term strategic agent for decision support at anatomically ambiguous points. When their recommendations conflict, a world-model critic predicts future visual states for candidate actions and selects the action whose predicted state best matches the target view. We evaluated the system in a high-fidelity airway phantom, three ex vivo porcine lungs, and a live porcine model. The system reached all planned segmental targets in the phantom, maintained 80\% success to the eighth generation ex vivo, and achieved in vivo navigation performance comparable to the expert bronchoscopist. These results support the preclinical feasibility of sensor-free autonomous bronchoscopic navigation.
DyQ-VLA: Temporal-Dynamic-Aware Quantization for Embodied Vision-Language-Action Models
Vision-Language-Action (VLA) models are dominant in embodied intelligence but are constrained by inference overheads. While model quantization alleviates these bottlenecks for edge deployment, static quantization approaches remain suboptimal for VLAs due to two critical challenges: (1) Temporal-dynamic sensitivity, where fixed precision wastes resources by ignoring stage-varying error tolerances; and (2) Real-time allocation, where identifying real-time sensitivity to guide bit allocation remains unsolved. To address these challenges, we propose DyQ-VLA, a dynamic quantization framework for VLAs. Specifically, a sensitivity-aware switching strategy leverages real-time kinematic proxies to trigger the bit-width switch, while a kinematic-guided module dynamically allocates the optimal bit-width. Experiments show that DyQ-VLA requires only 30.9% of the original memory footprint while maintaining 99.5% of its original performance, achieving 1.49x simulation and up to 1.43x real-world speedups.
NaviDriveVLM: Decoupling High-Level Reasoning and Motion Planning for Autonomous Driving
Vision-language models (VLMs) have emerged as a promising direction for end-to-end autonomous driving (AD) by jointly modeling visual observations, driving context, and language-based reasoning. However, existing VLM-based systems face a trade-off between high-level reasoning and motion planning: large models offer strong semantic understanding but are costly to adapt for precise control, whereas small VLM models can be fine-tuned efficiently but often exhibit weaker reasoning. We propose NaviDriveVLM, a decoupled framework that separates reasoning from action generation using a large-scale Navigator and a lightweight trainable Driver. This design preserves reasoning ability, reduces training cost, and provides an explicit interpretable intermediate representation for downstream planning. Experiments on the nuScenes benchmark show that NaviDriveVLM outperforms large VLM baselines in end-to-end motion planning.
RoboRouter: Training-Free Policy Routing for Robotic Manipulation
Research on robotic manipulation has developed a diverse set of policy paradigms, including vision-language-action (VLA) models, vision-action (VA) policies, and code-based compositional approaches. Concrete policies typically attain high success rates on specific task distributions but lim-ited generalization beyond it. Rather than proposing an other monolithic policy, we propose to leverage the complementary strengths of existing approaches through intelligent policy routing. We introduce RoboRouter, a training-free framework that maintains a pool of heterogeneous policies and learns to select the best-performing policy for each task through accumulated execution experience. Given a new task, RoboRouter constructs a semantic task representation, retrieves historical records of similar tasks, predicts the optimal policy choice without requiring trial-and-error, and incorporates structured feedback to refine subsequent routing decisions. Integrating a new policy into the system requires only lightweight evaluation and incurs no training overhead. Across simulation benchmark and real-world evaluations, RoboRouter consistently outperforms than in-dividual policies, improving average success rate by more than 3% in simulation and over 13% in real-world settings, while preserving execution efficiency. Our results demonstrate that intelligent routing across heterogeneous, off-the-shelf policies provides a practical and scalable pathway toward building more capable robotic systems.
Identifying Influential Actions in Human-Robot Interactions
Human-robot interaction combines robotics, cognitive science, and human factors to study collaborative systems. This paper introduces a method for identifying influential robot actions using transfer entropy, a statistic that measures directed information transfer between time series. TE is effective for capturing complex, nonlinear interactions. We apply this method to analyze how robot actions affect human behavior during a conversation with a remotely controlled robot avatar. By focusing on the impact of proximity, our approach demonstrates TE's capability to identify key actions influencing human responses, highlighting its potential to improve the design and adaptability of robotic systems.
comment: Presented at the 30th International Symposium on Artificial Life and Robotics (AROB 30th). Beppu, Japan, January 2025
Choose What to Observe: Task-Aware Semantic-Geometric Representations for Visuomotor Policy
Visuomotor policies learned from demonstrations often overfit to nuisance visual factors in raw RGB observations, resulting in brittle behavior under appearance shifts such as background changes and object recoloring. We propose a task-aware observation interface that canonicalizes visual input into a shared representation, improving robustness to out-of-distribution (OOD) appearance changes without modifying or fine-tuning the policy. Given an RGB image and an open-vocabulary specification of task-relevant entities, we use SAM3 to segment the target object and robot/gripper. We construct an L0 observation by repainting segmented entities with predefined semantic colors on a constant background. For tasks requiring stronger geometric cues, we further inject monocular depth from Depth Anything 3 into the segmented regions via depth-guided overwrite, yielding a unified semantic--geometric observation (L1) that remains a standard 3-channel, image-like input. We evaluate on RoboMimic (Lift), ManiSkill YCB grasping under clutter, four RLBench tasks under controlled appearance shifts, and two real-world Franka tasks (ReachX and CloseCabinet). Across benchmarks and policy backbones (Flow Matching Policy and SmolVLA), our interface preserves in-distribution performance while substantially improving robustness under OOD visual shifts.
Viewpoint-Agnostic Grasp Pipeline using VLM and Partial Observations
Robust grasping in cluttered, unstructured environments remains challenging for mobile legged manipulators due to occlusions that lead to partial observations, unreliable depth estimates, and the need for collision-free, execution-feasible approaches. In this paper we present an end-to-end pipeline for language-guided grasping that bridges open-vocabulary target selection to safe grasp execution on a real robot. Given a natural-language command, the system grounds the target in RGB using open-vocabulary detection and promptable instance segmentation, extracts an object-centric point cloud from RGB-D, and improves geometric reliability under occlusion via back-projected depth compensation and two-stage point cloud completion. We then generate and collision-filter 6-DoF grasp candidates and select an executable grasp using safety-oriented heuristics that account for reachability, approach feasibility, and clearance. We evaluate the method on a quadruped robot with an arm in two cluttered tabletop scenarios, using paired trials against a view-dependent baseline. The proposed approach achieves a 90% overall success rate (9/10) against 30% (3/10) for the baseline, demonstrating substantially improved robustness to occlusions and partial observations in clutter.
PlayWorld: Learning Robot World Models from Autonomous Play
Action-conditioned video models offer a promising path to building general-purpose robot simulators that can improve directly from data. Yet, despite training on large-scale robot datasets, current state-of-the-art video models still struggle to predict physically consistent robot-object interactions that are crucial in robotic manipulation. To close this gap, we present PlayWorld, a simple, scalable, and fully autonomous pipeline for training high-fidelity video world simulators from interaction experience. In contrast to prior approaches that rely on success-biased human demonstrations, PlayWorld is the first system capable of learning entirely from unsupervised robot self-play, enabling naturally scalable data collection while capturing complex, long-tailed physical interactions essential for modeling realistic object dynamics. Experiments across diverse manipulation tasks show that PlayWorld generates high-quality, physically consistent predictions for contact-rich interactions that are not captured by world models trained on human-collected data.We further demonstrate the versatility of PlayWorld in enabling fine-grained failure prediction and policy evaluation, with up to 40% improvements over human-collected data. Finally, we demonstrate how PlayWorld enables reinforcement learning in the world model, improving policy performance by 65% in success rates when deployed in the real world.
comment: https://robot-playworld.github.io/
Improving through Interaction: Searching Behavioral Representation Spaces with CMA-ES-IG
Robots that interact with humans must adapt to individual users' preferences to operate effectively in human-centered environments. An intuitive and effective technique to learn non-expert users' preferences is through rankings of robot behaviors, e.g., trajectories, gestures, or voices. Existing techniques primarily focus on generating queries that optimize preference learning outcomes, such as sample efficiency or final preference estimation accuracy. However, the focus on outcome overlooks key user expectations in the process of providing these rankings, which can negatively impact users' adoption of robotic systems. This work proposes the Covariance Matrix Adaptation Evolution Strategies with Information Gain (CMA-ES-IG) algorithm. CMA-ES-IG explicitly incorporates user experience considerations into the preference learning process by suggesting perceptually distinct and informative trajectories for users to rank. We demonstrate these benefits through both simulated studies and real-robot experiments. CMA-ES-IG, compared to state-of-the-art alternatives, (1) scales more effectively to higher-dimensional preference spaces, (2) maintains computational tractability for high-dimensional problems, (3) is robust to noisy or inconsistent user feedback, and (4) is preferred by non-expert users in identifying their preferred robot behaviors. This project's code is available at github.com/interaction-lab/CMA-ES-IG
comment: Under submission to IJRR
Characterization, Analytical Planning, and Hybrid Force Control for the Inspire RH56DFX Hand
Commercially accessible dexterous robot hands are increasingly prevalent, but many remain difficult to use as scientific instruments. For example, the Inspire RH56DFX hand exposes only uncalibrated proprioceptive information and shows unreliable contact behavior at high speed (up to 1618% force limit overshoot). Furthermore, its underactuated, coupled finger linkages make antipodal grasps non-trivial. We contribute three improvements to the Inspire RH56DFX to transform it from a black-box device to a research tool: (1) hardware characterization (force calibration, latency, and overshoot), (2) a sim2real validated MuJoCo model for analytical width-to-grasp planning, and (3) a hybrid, closed-loop speed-force grasp controller. We validate these components on peg-in-hole insertion, achieving 65% success and outperforming a wrist-force-only baseline of 10% and on 300 grasps across 15 physically diverse objects, achieving 87% success and outperforming plan-free grasps and learned grasps. Our approach is modular, designed for compatibility with external object detectors and vision-language models for width & force estimation and high-level planning, and provides an interpretable and immediately deployable interface for dexterous manipulation with the Inspire RH56DFX hand, open-sourced at this website https://correlllab.github.io/rh56dfx.html.
SurgCalib: Gaussian Splatting-Based Hand-Eye Calibration for Robot-Assisted Minimally Invasive Surgery
We present a Gaussian Splatting-based framework for hand-eye calibration of the da Vinci surgical robot. In a vision-guided robotic system, accurate estimation of the rigid transformation between the robot base and the camera frame is essential for reliable closed-loop control. For cable-driven surgical robots, this task faces unique challenges. The encoders of surgical instruments often produce inaccurate proprioceptive measurements due to cable stretch and backlash. Conventional hand-eye calibration approaches typically rely on known fiducial patterns and solve the AX = XB formulation. While effective, introducing additional markers into the operating room (OR) environment can violate sterility protocols and disrupt surgical workflows. In this study, we propose SurgCalib, an automatic, markerless framework that has the potential to be used in the OR. SurgCalib first initializes the pose of the surgical instrument using raw kinematic measurements and subsequently refines this pose through a two-phase optimization procedure under the RCM constraint within a Gaussian Splatting-based differentiable rendering pipeline. We evaluate the proposed method on the public dVRK benchmark, SurgPose. The results demonstrate average 2D tool-tip reprojection errors of 12.24 px (2.06 mm) and 11.33 px (1.9 mm), and 3D tool-tip Euclidean distance errors of 5.98 mm and 4.75 mm, for the left and right instruments, respectively.
comment: 9 pages, 7 figures
FAME: Force-Adaptive RL for Expanding the Manipulation Envelope of a Full-Scale Humanoid
Maintaining balance under external hand forces is critical for humanoid bimanual manipulation, where interaction forces propagate through the kinematic chain and constrain the feasible manipulation envelope. We propose \textbf{FAME}, a force-adaptive reinforcement learning framework that conditions a standing policy on a learned latent context encoding upper-body joint configuration and bimanual interaction forces. During training, we apply diverse, spherically sampled 3D forces on each hand to inject disturbances in simulation together with an upper-body pose curriculum, exposing the policy to manipulation-induced perturbations across continuously varying arm configurations. At deployment, interaction forces are estimated from the robot dynamics and fed to the same encoder, enabling online adaptation without wrist force/torque sensors. In simulation across five fixed arm configurations with randomized hand forces and commanded base heights, FAME improves mean standing success to 73.84%, compared to 51.40% for the curriculum-only baseline and 29.44% for the base policy. We further deploy the learned policy on a full-scale Unitree H12 humanoid and evaluate robustness in representative load-interaction scenarios, including asymmetric single-arm load and symmetric bimanual load. Code and videos are available on https://fame10.github.io/Fame/
Formation-Aware Adaptive Conformalized Perception for Safe Leader-Follower Multi-Robot Systems
This paper considers the perception safety problem in distributed vision-based leader-follower formations, where each robot uses onboard perception to estimate relative states, track desired setpoints, and keep the leader within its camera field of view (FOV). Safety is challenging due to heteroscedastic perception errors and the coupling between formation maneuvers and visibility constraints. We propose a distributed, formation-aware adaptive conformal prediction method based on Risk-Aware Mondrian CP to produce formation-conditioned uncertainty quantiles. The resulting bounds tighten in high-risk configurations (near FOV limits) and relax in safer regions. We integrate these bounds into a Formation-Aware Conformal CBF-QP with a smooth margin to enforce visibility while maintaining feasibility and tracking performance. Gazebo simulations show improved formation success rates and tracking accuracy over non-adaptive (global) CP baselines that ignore formation-dependent visibility risk, while preserving finite-sample probabilistic safety guarantees. The experimental videos are available on the \href{https://nail-uh.github.io/iros2026.github.io/}{project website}\footnote{Project Website: https://nail-uh.github.io/iros2026.github.io/}.
comment: 8 pages, 8 figures
Fly, Track, Land: Infrastructure-less Magnetic Localization for Heterogeneous UAV-UGV Teaming
We present a complete infrastructure-less magneto-inductive (MI) localization system enabling a lightweight UAV to autonomously hover, track, and land with centimeter precision on a mobile quadruped robot acting as a dynamic docking pad. This work advances the vision of heterogeneous robot collaboration, where ultra-lightweight flying robots serve as mobile perception agents for ground-based Unmanned Ground Vehicles (UGVs). By extending the sensing horizon and providing complementary viewpoints, the UAVs enhance exploration efficiency and improve the quality of data collection in large-scale, unknown environments. The proposed system aims to complements traditional localization modalities with a compact, embedded, and infrastructure-less magnetic sensing approach, providing accurate short-range relative positioning to bridge the gap between coarse navigation and precise UAV docking. A single lightweight receive coil and a fully embedded estimation pipeline on the UAV deliver 20 Hz relative pose estimates in the UGV's frame, achieving a 3D position root-mean-square error (RMSE) of 5 cm. The system uses real-time estimation and a warm-started solver to estimate the 3D position, which is then fused with inertial and optical-flow measurements in the onboard extended Kalman filter. Real-world experiments validate the effectiveness of the framework, demonstrating significant improvements in UAV--UGV teaming in infrastructure-less scenarios compared to state-of-the-art methods, requiring no external anchors or global positioning. In dynamic scenarios, the UAV tracks and docks with a moving UGV while maintaining a 7.2 cm RMSE and achieving successful autonomous landings.
comment: Submitted to IEEE Transactions on Robotics (T-RO). Supplementary video available
Proprioceptive Safe Active Navigation and Exploration for Planetary Environments
Deformable granular terrains introduce significant locomotion and immobilization risks in planetary exploration and are difficult to detect via remote sensing (e.g., vision). Legged robots can sense terrain properties through leg-terrain interactions during locomotion, offering a direct means to assess traversability in deformable environments. How to systematically exploit this interaction-derived information for navigation planning, however, remains underexplored. We address this gap by presenting PSANE, a Proprioceptive Safe Active Navigation and Exploration framework that leverages leg-terrain interaction measurements for safe navigation and exploration in unknown deformable environments. PSANE learns a traversability model via Gaussian Process regression to estimate and certify safe regions and identify exploration frontiers online, and integrates these estimates with a reactive controller for real-time navigation. Frontier selection is formulated as a multi-objective optimization that balances safe-set expansion probability and goal-directed cost, with subgoals selected via scalarization over the Pareto-optimal frontier set. PSANE safely explores unknown granular terrain and reaches specified goals using only proprioceptively estimated traversability, while achieving performance improvements over baseline methods.
comment: 9 pages, 7 figures
Why Channel-Centric Models are not Enough to Predict End-to-End Performance in Private 5G: A Measurement Campaign and Case Study
Communication-aware robot planning requires accurate predictions of wireless network performance. Current approaches rely on channel-level metrics such as received signal strength and signal-to-noise ratio, assuming these translate reliably into end-to-end throughput. We challenge this assumption through a measurement campaign in a private 5G industrial environment. We evaluate throughput predictions from a commercial ray-tracing simulator as well as data-driven Gaussian process regression models against measurements collected using a mobile robot. The study uses off-the-shelf user equipment in an underground, radio-shielded facility with detailed 3D modeling, representing a best-case scenario for prediction accuracy. The ray-tracing simulator captures the spatial structure of indoor propagation and predicts channel-level metrics with reasonable fidelity. However, it systematically over-predicts throughput, even in line-of-sight regions. The dominant error source is shown to be over-estimation of sustainable MIMO spatial layers: the simulator assumes near-uniform four-layer transmission while measurements reveal substantial adaptation between one and three layers. This mismatch inflates predicted throughput even when channel metrics appear accurate. In contrast, a Gaussian process model with a rational quadratic kernel achieves approximately two-thirds reduction in prediction error with near-zero bias by learning end-to-end throughput directly from measurements. These findings demonstrate that favorable channel conditions do not guarantee high throughput; communication-aware planners relying solely on channel-centric predictions risk overly optimistic trajectories that violate reliability requirements. Accurate throughput prediction for 5G systems requires either extensive calibration of link-layer models or data-driven approaches that capture real system behavior.
Adaptive SINDy: Residual Force System Identification Based UAV Disturbance Rejection
The stability and control of Unmanned Aerial Vehicles (UAVs) in a turbulent environment is a matter of great concern. Devising a robust control algorithm to reject disturbances is challenging due to the highly nonlinear nature of wind dynamics, and modeling the dynamics using analytical techniques is not straightforward. While traditional techniques using disturbance observers and classical adaptive control have shown some progress, they are mostly limited to relatively non-complex environments. On the other hand, learning based approaches are increasingly being used for modeling of residual forces and disturbance rejection; however, their generalization and interpretability is a factor of concern. To this end, we propose a novel integration of data-driven system identification using Sparse Identification of Non-Linear Dynamics (SINDy) with a Recursive Least Square (RLS) adaptive control to adapt and reject wind disturbances in a turbulent environment. We tested and validated our approach on Gazebo harmonic environment and on real flights with wind speeds of up to 2 m/s from four directions, creating a highly dynamic and turbulent environment. Adaptive SINDy outperformed the baseline PID and INDI controllers on several trajectory tracking error metrics without crashing. A root mean square error (RMSE) of up to 12.2 cm and 17.6 cm, and a mean absolute error (MAE) of 13.7 cm and 10.5 cm were achieved on circular and lemniscate trajectories, respectively. The validation was performed on a very lightweight Crazyflie drone under a highly dynamic environment for complex trajectory tracking.
APPLV: Adaptive Planner Parameter Learning from Vision-Language-Action Model
Autonomous navigation in highly constrained environments remains challenging for mobile robots. Classical navigation approaches offer safety assurances but require environment-specific parameter tuning; end-to-end learning bypasses parameter tuning but struggles with precise control in constrained spaces. To this end, recent robot learning approaches automate parameter tuning while retaining classical systems' safety, yet still face challenges in generalizing to unseen environments. Recently, Vision-Language-Action (VLA) models have shown promise by leveraging foundation models' scene understanding capabilities, but still struggle with precise control and inference latency in navigation tasks. In this paper, we propose Adaptive Planner Parameter Learning from Vision-Language-Action Model (\textsc{applv}). Unlike traditional VLA models that directly output actions, \textsc{applv} leverages pre-trained vision-language models with a regression head to predict planner parameters that configure classical planners. We develop two training strategies: supervised learning fine-tuning from collected navigation trajectories and reinforcement learning fine-tuning to further optimize navigation performance. We evaluate \textsc{applv} across multiple motion planners on the simulated Benchmark Autonomous Robot Navigation (BARN) dataset and in physical robot experiments. Results demonstrate that \textsc{applv} outperforms existing methods in both navigation performance and generalization to unseen environments.
SEP-NMPC: Safety Enhanced Passivity-Based Nonlinear Model Predictive Control for a UAV Slung Payload System ICRA 2026
Model Predictive Control (MPC) is widely adopted for agile multirotor vehicles, yet achieving both stability and obstacle-free flight is particularly challenging when a payload is suspended beneath the airframe. This paper introduces a Safety Enhanced Passivity-Based Nonlinear MPC (SEP-NMPC) that provides formal guarantees of stability and safety for a quadrotor transporting a slung payload through cluttered environments. Stability is enforced by embedding a strict passivity inequality, which is derived from a shaped energy storage function with adaptive damping, directly into the NMPC. This formulation dissipates excess energy and ensures asymptotic convergence despite payload swings. Safety is guaranteed through high-order control barrier functions (HOCBFs) that render user-defined clearance sets forward-invariant, obliging both the quadrotor and the swinging payload to maintain separation while interacting with static and dynamic obstacles. The optimization remains quadratic-program compatible and is solved online at each sampling time without gain scheduling or heuristic switching. Extensive simulations and real-world experiments confirm stable payload transport, collision-free trajectories, and real-time feasibility across all tested scenarios. The SEP-NMPC framework therefore unifies passivity-based closed-loop stability with HOCBF-based safety guarantees for UAV slung-payload transportation.
comment: Accepted at ICRA 2026
Predictive Control with Indirect Adaptive Laws for Payload Transportation by Quadrupedal Robots
This paper formally develops a novel hierarchical planning and control framework for robust payload transportation by quadrupedal robots, integrating a model predictive control (MPC) algorithm with a gradient-descent-based adaptive updating law. At the framework's high level, an indirect adaptive law estimates the unknown parameters of the reduced-order (template) locomotion model under varying payloads. These estimated parameters feed into an MPC algorithm for real-time trajectory planning, incorporating a convex stability criterion within the MPC constraints to ensure the stability of the template model's estimation error. The optimal reduced-order trajectories generated by the high-level adaptive MPC (AMPC) are then passed to a low-level nonlinear whole-body controller (WBC) for tracking. Extensive numerical investigations validate the framework's capabilities, showcasing the robot's proficiency in transporting unmodeled, unknown static payloads up to 109% in experiments on flat terrains and 91% on rough experimental terrains. The robot also successfully manages dynamic payloads with 73% of its mass on rough terrains. Performance comparisons with a normal MPC and an L1 MPC indicate a significant improvement. Furthermore, comprehensive hardware experiments conducted in indoor and outdoor environments confirm the method's efficacy on rough terrains despite uncertainties such as payload variations, push disturbances, and obstacles.
comment: 8 pages, 6 figures. Published in IEEE Robotics and Automation Letters
Impact of Different Failures on a Robot's Perceived Reliability ICRA 2026
Robots fail, potentially leading to a loss in the robot's perceived reliability (PR), a measure correlated with trustworthiness. In this study we examine how various kinds of failures affect the PR of the robot differently, and how this measure recovers without explicit social repair actions by the robot. In a preregistered and controlled online video study, participants were asked to predict a robot's success in a pick-and-place task. We examined manipulation failures (slips), freezing (lapses), and three types of incorrect picked objects or place goals (mistakes). Participants were shown one of 11 videos -- one of five types of failure, one of five types of failure followed by a successful execution in the same video, or a successful execution video. This was followed by two additional successful execution videos. Participants bet money either on the robot or on a coin toss after each video. People's betting patterns along with a qualitative analysis of their survey responses highlight that mistakes are less damaging to PR than slips or lapses, and some mistakes are even perceived as successes. We also see that successes immediately following a failure have the same effect on PR as successes without a preceding failure. Finally, we show that successful executions recover PR after a failure. Our findings highlight which robot failures are in higher need of repair in a human-robot interaction, and how trust could be recovered by robot successes.
comment: Accepted to ICRA 2026. 8 pages, 6 figures
HMR-1: Hierarchical Massage Robot with Vision-Language-Model for Embodied Healthcare
The rapid advancement of Embodied Intelligence has opened transformative opportunities in healthcare, particularly in physical therapy and rehabilitation. However, critical challenges remain in developing robust embodied healthcare solutions, such as the lack of standardized evaluation benchmarks and the scarcity of open-source multimodal acupoint massage datasets. To address these gaps, we construct MedMassage-12K - a multimodal dataset containing 12,190 images with 174,177 QA pairs, covering diverse lighting conditions and backgrounds. Furthermore, we propose a hierarchical embodied massage framework, which includes a high-level acupoint grounding module and a low-level control module. The high-level acupoint grounding module uses multimodal large language models to understand human language and identify acupoint locations, while the low-level control module provides the planned trajectory. Based on this, we evaluate existing MLLMs and establish a benchmark for embodied massage tasks. Additionally, we fine-tune the Qwen-VL model, demonstrating the framework's effectiveness. Physical experiments further confirm the practical applicability of the framework.Our dataset and code are publicly available at https://github.com/Xiaofeng-Han-Res/HMR-1.
Scale-Plan: Scalable Language-Enabled Task Planning for Heterogeneous Multi-Robot Teams
Long-horizon task planning for heterogeneous multi-robot systems is essential for deploying collaborative teams in real-world environments; yet, it remains challenging due to the large volume of perceptual information, much of which is irrelevant to task objectives and burdens planning. Traditional symbolic planners rely on manually constructed problem specifications, limiting scalability and adaptability, while recent large language model (LLM)-based approaches often suffer from hallucinations and weak grounding-i.e., poor alignment between generated plans and actual environmental objects and constraints-in object-rich settings. We present Scale-Plan, a scalable LLM-assisted framework that generates compact, task-relevant problem representations from natural language instructions. Given a PDDL domain specification, Scale-Plan constructs an action graph capturing domain structure and uses shallow LLM reasoning to guide a structured graph search that identifies a minimal subset of relevant actions and objects. By filtering irrelevant information prior to planning, Scale-Plan enables efficient decomposition, allocation, and long-horizon plan generation. We evaluate our approach on complex multi-agent tasks and introduce MAT2-THOR, a cleaned benchmark built on AI2-THOR for reliable evaluation of multi-robot planning systems. Scale-Plan outperforms pure LLM and hybrid LLM-PDDL baselines across all metrics, improving scalability and reliability.
Age-Related Differences in the Perception of Eye-Gaze from a Social Robot
There is an increasing interest in social robots assisting older adults during daily life tasks. In this context, non-verbal cues such as deictic gaze are important in natural communication in human-robot interaction. However, the sensibility to deictic-gaze declines naturally with age and results in a reduction in social perception. Therefore, this work explores the benefits of deictic gaze from social robots assisting older adults during daily life tasks, and how age-related differences may influence their social perception in contrast to younger populations. This may help on the design of adaptive age-related non-verbal cues in the Human-Robot Interaction context.
comment: This is the pre-print version. Final publication available at https://doi.org/10.1007/978-3-030-90525-5_30
DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy ICRA 2026
We propose DemoDiffusion, a simple method for enabling robots to perform manipulation tasks by imitating a single human demonstration, without requiring task-specific training or paired human-robot data. Our approach is based on two insights. First, the hand motion in a human demonstration provides a useful prior for the robot's end-effector trajectory, which we can convert into a rough open-loop robot motion trajectory via kinematic retargeting. Second, while this retargeted motion captures the overall structure of the task, it may not align well with plausible robot actions in-context. To address this, we leverage a pre-trained generalist diffusion policy to modify the trajectory, ensuring it both follows the human motion and remains within the distribution of plausible robot actions. Unlike approaches based on online reinforcement learning or paired human-robot data, our method enables robust adaptation to new tasks and scenes with minimal effort. In real-world experiments across 8 diverse manipulation tasks, DemoDiffusion achieves 83.8\% average success rate, compared to 13.8\% for the pre-trained policy and 52.5\% for kinematic retargeting, succeeding even on tasks where the pre-trained generalist policy fails entirely. Project page: https://demodiffusion.github.io/
comment: 11 pages. Published at ICRA 2026
From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models
Our aim is to learn to solve long-horizon decision-making problems in complex robotics domains given low-level skills and a handful of short-horizon demonstrations containing sequences of images. To this end, we focus on learning abstract symbolic world models that facilitate zero-shot generalization to novel goals via planning. A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects. In this work, we leverage pretrained vision-language models (VLMs) to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images. At training time, we pass the proposed predicates and demonstrations into an optimization-based model-learning algorithm to obtain an abstract symbolic world model that is defined in terms of a compact subset of the proposed predicates. At test time, given a novel goal in a novel setting, we use the VLM to construct a symbolic description of the current world state, and then use a search-based planning algorithm to find a sequence of low-level skills that achieves the goal. We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively, applying its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds, as well as novel goals and much longer horizons than those seen at training time.
comment: A version of this paper appears in the official proceedings of RA-L, Volume 11, Issue 4
BEV-Patch-PF: Particle Filtering with BEV-Aerial Feature Matching for Off-Road Geo-Localization
We propose BEV-Patch-PF, a GPS-free sequential geo-localization system that integrates a particle filter with learned bird's-eye-view (BEV) and aerial feature maps. From onboard RGB and depth images, we construct a BEV feature map. For each 3-DoF particle pose hypothesis, we crop the corresponding patch from an aerial feature map computed from a local aerial image queried around the approximate location. BEV-Patch-PF computes a per-particle log-likelihood by matching the BEV feature to the aerial patch feature. On two real-world off-road datasets, our method achieves 9.7x lower absolute trajectory error (ATE) on seen routes and 6.6x lower ATE on unseen routes than a retrieval-based baseline, while maintaining accuracy under dense canopy and shadow. The system runs in real time at 10 Hz on an NVIDIA Tesla T4, enabling practical robot deployment.
Task-Oriented Robot-Human Handovers on Legged Manipulators
Task-oriented handovers (TOH) are fundamental to effective human-robot collaboration, requiring robots to present objects in a way that supports the human's intended post-handover use. Existing approaches are typically based on object- or task-specific affordances, but their ability to generalize to novel scenarios is limited. To address this gap, we present AFT-Handover, a framework that integrates large language model (LLM)-driven affordance reasoning with efficient texture-based affordance transfer to achieve zero-shot, generalizable TOH. Given a novel object-task pair, the method retrieves a proxy exemplar from a database, establishes part-level correspondences via LLM reasoning, and texturizes affordances for feature-based point cloud transfer. We evaluate AFT-Handover across diverse task-object pairs, showing improved handover success rates and stronger generalization compared to baselines. In a comparative user study, our framework is significantly preferred over the current state-of-the-art, effectively reducing human regrasping before tool use. Finally, we demonstrate TOH on legged manipulators, highlighting the potential of our framework for real-world robot-human handovers.
comment: Accepted to 21st ACM/IEEE International Conference on Human-Robot Interaction (HRI) 2026
LIVE-GS: Online LiDAR-Inertial-Visual State Estimation and Globally Consistent Mapping with 3D Gaussian Splatting
While 3D Gaussian Splatting (3DGS) enabled photorealistic mapping, its integration into SLAM has largely followed traditional camera-centric pipelines. As a result, they inherit well-known weaknesses such as high computational load, failure in texture-poor or illumination-varying environments, and limited operational range, particularly for RGB-D setups. On the other hand, LiDAR emerges as a robust alternative, but its integration with 3DGS introduces new challenges, such as the need for tighter global alignment for photorealistic quality and prolonged optimization times caused by sparse data. To address these challenges, we propose LIVE-GS, an online LiDAR-Inertial Visual SLAM framework that tightly couples 3D Gaussian Splatting with LiDAR-based surfels to ensure high-precision map consistency through global geometric optimization. Particularly, to handle sparse data, our system employs a depth-invariant Gaussian initialization strategy for efficient representation and a bounded sigmoid constraint to prevent uncontrolled Gaussian growth. Experiments on public and our datasets demonstrate competitive performance in rendering quality and map-building efficiency compared with representative 3DGS SLAM baselines.
$π$-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs
Flow-based vision-language-action (VLA) models excel in embodied control but suffer from intractable likelihoods during multi-step sampling, hindering online reinforcement learning. We propose \textbf{\textit{$\boldsymbolπ$-StepNFT}} (Step-wise Negative-aware Fine-Tuning), a critic-and-likelihood-free framework that requires only a single forward pass per optimization step and eliminates auxiliary value networks. We identify that wider exploration spaces necessitate finer-grained, step-wise guidance for alignment. Empirically, $π$-StepNFT unlocks latent potential on LIBERO with competitive few-shot robustness. Moreover, it achieves superior generalization on ManiSkill, outperforming value-based baselines in OOD scenarios by preventing overfitting to multimodal features. This property offers a scalable solution promising for complex real-world applications.
Task Parameter Extrapolation via Learning Inverse Tasks from Forward Demonstrations
Generalizing skill policies to novel conditions remains a key challenge in robot learning. Imitation learning methods, while data-efficient, are largely confined to the training region and consistently fail on input data outside it, leading to unpredictable policy failures. Alternatively, transfer learning approaches offer methods for trajectory generation robust to both changes in environment or tasks, but they remain data-hungry and lack accuracy in zero-shot generalization. We address these challenges by framing the problem in the context of task inversion learning and proposing a novel joint learning approach to achieve accurate and efficient knowledge transfer. Our method constructs a common representation of the forward and inverse tasks, and leverages auxiliary forward demonstrations from novel configurations to successfully execute the corresponding inverse tasks, without any direct supervision. We show the extrapolation capabilities of our framework via ablation studies and experiments in simulated and real-world environments that require complex manipulation skills with a diverse set of objects and tools, where we outperform diffusion-based alternatives.
comment: Corrected author affiliation
CroSTAta: Cross-State Transition Attention Transformer for Robotic Manipulation
Learning robotic manipulation policies through supervised learning from demonstrations remains challenging when policies encounter execution variations not explicitly covered during training. While incorporating historical context through attention mechanisms can improve robustness, standard approaches process all past states in a sequence without explicitly modeling the temporal structure that demonstrations may include, such as failure and recovery patterns. We propose a Cross-State Transition Attention Transformer that employs a novel State Transition Attention (STA) mechanism to modulate standard attention weights based on learned state evolution patterns, enabling policies to better adapt their behavior based on execution history. Our approach combines this structured attention with temporal masking during training, where visual information is randomly removed from recent timesteps to encourage temporal reasoning from historical context. Evaluation in simulation shows that STA consistently outperforms standard attention approach and temporal modeling methods like TCN and LSTM networks, achieving more than 2x improvement over cross-attention on precision-critical tasks. The source code and data can be accessed at https://github.com/iit-DLSLab/croSTAta
comment: Code and data available at https://github.com/iit-DLSLab/croSTAta
EasyInsert: A Data-Efficient and Generalizable Insertion Policy
Robotic insertion is a highly challenging task that requires exceptional precision in cluttered environments. Existing methods often have poor generalization capabilities. They typically function in restricted and structured environments, and frequently fail when the plug and socket are far apart, when the scene is densely cluttered, or when handling novel objects. They also rely on strong assumptions such as access to CAD models or a digital twin in simulation. To address these limitations, we propose EasyInsert. Inspired by human intuition, it formulates insertion as a delta-pose regression problem, which unlocks an efficient, highly scalable data collection pipeline with minimal human labor to train an end-to-end visual policy. During execution, the visual policy predicts the relative pose between plug and socket to drive a multi-phase, coarse-to-fine insertion process. EasyInsert demonstrates strong zero-shot generalization capability for unseen objects in cluttered environments, robustly handling cases with significant initial pose deviations. In real-world experiments, by leveraging just 1 hour of human teleoperation data to bootstrap a large-scale automated data collection process, EasyInsert achieves an over 90% success rate in zero-shot insertion for 13 out of 15 unseen novel objects, including challenging objects like Type-C cables, HDMI cables, and Ethernet cables. Furthermore, requiring only a single manual reset, EasyInsert allows for fast adaptation to novel test objects through automated data collection and fine-tuning, achieving an over 90% success rate across all 15 objects.
Scalable Aerial GNSS Localization for Marine Robots
Accurate localization is crucial for water robotics, yet traditional onboard Global Navigation Satellite System (GNSS) approaches are difficult or ineffective due to signal reflection on the water's surface and its high cost of aquatic GNSS receivers. Existing approaches, such as inertial navigation, Doppler Velocity Loggers (DVL), SLAM, and acoustic-based methods, face challenges like error accumulation and high computational complexity. Therefore, a more efficient and scalable solution remains necessary. This paper proposes an alternative approach that leverages an aerial drone equipped with GNSS localization to track and localize a marine robot once it is near the surface of the water. Our results show that this novel adaptation enables accurate single and multi-robot marine robot localization.
comment: International Conference on Robotics and Automation 2025 Workshop Robots in the Wild
ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers
Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-stage spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based architecture for learning task-agnostic visuotactile representations from paired vision and tactile inputs. Our key idea is a two-stage positional injection: local (modality-specific) positional encodings are added within each stream, and a global positional encoding is added on the joint token sequence immediately before attention, providing a shared positional vocabulary at the stage where cross-modal interaction occurs. We make the positional injection points explicit and conduct controlled ablations that isolate their effect before a token-wise nonlinearity versus immediately before self-attention. Experiments on multiple large-scale real-world datasets show that ViTaPEs not only surpasses state-of-the-art baselines across various recognition tasks but also demonstrates zero-shot generalization to unseen, out-of-domain scenarios. We further demonstrate the transfer-learning strength of ViTaPEs in a robotic grasping task, where it outperforms state-of-the-art baselines in predicting grasp success. Project page: https://sites.google.com/view/vitapes
RoboLayout: Differentiable 3D Scene Generation for Embodied Agents
Recent advances in vision language models (VLMs) have shown strong potential for spatial reasoning and 3D scene layout generation from open-ended language instructions. However, generating layouts that are not only semantically coherent but also feasible for interaction by embodied agents remains challenging, particularly in physically constrained indoor environments. In this paper, RoboLayout is introduced as an extension of LayoutVLM that augments the original framework with agent-aware reasoning and improved optimization stability. RoboLayout integrates explicit reachability constraints into a differentiable layout optimization process, enabling the generation of layouts that are navigable and actionable by embodied agents. Importantly, the agent abstraction is not limited to a specific robot platform and can represent diverse entities with distinct physical capabilities, such as service robots, warehouse robots, humans of different age groups, or animals, allowing environment design to be tailored to the intended agent. In addition, a local refinement stage is proposed that selectively reoptimizes problematic object placements while keeping the remainder of the scene fixed, improving convergence efficiency without increasing global optimization iterations. Overall, RoboLayout preserves the strong semantic alignment and physical plausibility of LayoutVLM while enhancing applicability to agent-centric indoor scene generation, as demonstrated by experimental results across diverse scene configurations.
FreeTacMan: Robot-free Visuo-Tactile Data Collection System for Contact-rich Manipulation
Enabling robots with contact-rich manipulation remains a pivotal challenge in robot learning, which is substantially hindered by the data collection gap, including its inefficiency and limited sensor setup. While prior work has explored handheld paradigms, their rod-based mechanical structures remain rigid and unintuitive, providing limited tactile feedback and posing challenges for operators. Motivated by the dexterity and force feedback of human motion, we propose FreeTacMan, a human-centric and robot-free data collection system for accurate and efficient robot manipulation. Concretely, we design a wearable gripper with visuo-tactile sensors for data collection, which can be worn by human fingers for intuitive control. A high-precision optical tracking system is introduced to capture end-effector poses while synchronizing visual and tactile feedback simultaneously. We leverage FreeTacMan to collect a large-scale multimodal dataset, comprising over 3000k paired visuo-tactile images with end-effector poses, 10k demonstration trajectories across 50 diverse contact-rich manipulation tasks. FreeTacMan achieves multiple improvements in data collection performance over prior works and enables effective policy learning from self-collected datasets. By open-sourcing the hardware and the dataset, we aim to facilitate reproducibility and support research in visuo-tactile manipulation.
Iterative Closed-Loop Motion Synthesis for Scaling the Capabilities of Humanoid Control
Physics-based humanoid control relies on training with motion datasets that have diverse data distributions. However, the fixed difficulty distribution of datasets limits the performance ceiling of the trained control policies. Additionally, the method of acquiring high-quality data through professional motion capture systems is constrained by costs, making it difficult to achieve large-scale scalability. To address these issues, we propose a closed-loop automated motion data generation and iterative framework. It can generate high-quality motion data with rich action semantics, including martial arts, dance, combat, sports, gymnastics, and more. Furthermore, our framework enables difficulty iteration of policies and data through physical metrics and objective evaluations, allowing the trained tracker to break through its original difficulty limits. On the PHC single-primitive tracker, using only approximately 1/10 of the AMASS dataset size, the average failure rate on the test set (2201 clips) is reduced by 45% compared to the baseline. Finally, we conduct comprehensive ablation and comparative experiments to highlight the rationality and advantages of our framework.
MetricNet: Recovering Metric Scale in Generative Navigation Policies ICRA'26
Generative navigation policies have made rapid progress in improving end-to-end learned navigation. Despite their promising results, this paradigm has two structural problems. First, the sampled trajectories exist in an abstract, unscaled space without metric grounding. Second, the control strategy discards the full path, instead moving directly towards a single waypoint. This leads to short-sighted and unsafe actions, moving the robot towards obstacles that a complete and correctly scaled path would circumvent. To address these issues, we propose MetricNet, an effective add-on for generative navigation that predicts the metric distance between waypoints, grounding policy outputs in metric coordinates. We evaluate our method in simulation with a new benchmarking framework and show that executing MetricNet-scaled waypoints significantly improves both navigation and exploration performance. Beyond simulation, we further validate our approach in real-world experiments. Finally, we propose MetricNav, which integrates MetricNet into a navigation policy to guide the robot away from obstacles while still moving towards the goal.
comment: Accepted to ICRA'26
Synchronized Online Friction Estimation and Adaptive Grasp Control for Robust Gentle Grasp
We introduce a unified framework for gentle robotic grasping that synergistically couples real-time friction estimation with adaptive grasp control. We propose a new particle filter-based method for real-time estimation of the friction coefficient using vision-based tactile sensors. This estimate is seamlessly integrated into a reactive controller that dynamically modulates grasp force to maintain a stable grip. The two processes operate synchronously in a closed-loop: the controller uses the current best estimate to adjust the force, while new tactile feedback from this action continuously refines the estimation. This creates a highly responsive and robust sensorimotor cycle. The reliability and efficiency of the complete framework are validated through extensive robotic experiments.
MCGS-SLAM: A Multi-Camera SLAM Framework Using Gaussian Splatting for High-Fidelity Mapping ICRA
Recent progress in dense SLAM has primarily targeted monocular setups, often at the expense of robustness and geometric coverage. We present MCGS-SLAM, the first purely RGB-based multi-camera SLAM system built on 3D Gaussian Splatting (3DGS). Unlike prior methods relying on sparse maps or inertial data, MCGS-SLAM fuses dense RGB inputs from multiple viewpoints into a unified, continuously optimized Gaussian map. A multi-camera bundle adjustment (MCBA) jointly refines poses and depths via dense photometric and geometric residuals, while a scale consistency module enforces metric alignment across views using low-rank priors. The system supports RGB input and maintains real-time performance at large scale. Experiments on synthetic and real-world datasets show that MCGS-SLAM consistently yields accurate trajectories and photorealistic reconstructions, usually outperforming monocular baselines. Notably, the wide field of view from multi-camera input enables reconstruction of side-view regions that monocular setups miss, critical for safe autonomous operation. These results highlight the promise of multi-camera Gaussian Splatting SLAM for high-fidelity mapping in robotics and autonomous driving.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026
Input-to-State Stable Coupled Oscillator Networks for Closed-form Model-based Control in Latent Space NeurIPS 2024
Even though a variety of methods have been proposed in the literature, efficient and effective latent-space control (i.e., control in a learned low-dimensional space) of physical systems remains an open challenge. We argue that a promising avenue is to leverage powerful and well-understood closed-form strategies from control theory literature in combination with learned dynamics, such as potential-energy shaping. We identify three fundamental shortcomings in existing latent-space models that have so far prevented this powerful combination: (i) they lack the mathematical structure of a physical system, (ii) they do not inherently conserve the stability properties of the real systems, (iii) these methods do not have an invertible mapping between input and latent-space forcing. This work proposes a novel Coupled Oscillator Network (CON) model that simultaneously tackles all these issues. More specifically, (i) we show analytically that CON is a Lagrangian system - i.e., it possesses well-defined potential and kinetic energy terms. Then, (ii) we provide formal proof of global Input-to-State stability using Lyapunov arguments. Moving to the experimental side, we demonstrate that CON reaches SoA performance when learning complex nonlinear dynamics of mechanical systems directly from images. An additional methodological innovation contributing to achieving this third goal is an approximated closed-form solution for efficient integration of network dynamics, which eases efficient training. We tackle (iii) by approximating the forcing-to-input mapping with a decoder that is trained to reconstruct the input based on the encoded latent space force. Finally, we show how these properties enable latent-space control. We use an integral-saturated PID with potential force compensation and demonstrate high-quality performance on a soft robot using raw pixels as the only feedback information.
comment: 38th Conference on Neural Information Processing Systems (NeurIPS 2024) spotlight, 50 pages
M4Diffuser: Multi-View Diffusion Policy with Manipulability-Aware Control for Robust Mobile Manipulation
Mobile manipulation requires the coordinated control of a mobile base and a robotic arm while simultaneously perceiving both global scene context and fine-grained object details. Existing single-view approaches often fail in unstructured environments due to limited fields of view, exploration, and generalization abilities. Moreover, classical controllers, although stable, struggle with efficiency and manipulability near singularities. To address these challenges, we propose M4Diffuser, a hybrid framework that integrates a Multi-View Diffusion Policy with a novel Reduced and Manipulability-aware QP (ReM-QP) controller for mobile manipulation. The diffusion policy leverages proprioceptive states and complementary camera perspectives with both close-range object details and global scene context to generate task-relevant end-effector goals in the world frame. These high-level goals are then executed by the ReM-QP controller, which eliminates slack variables for computational efficiency and incorporates manipulability-aware preferences for robustness near singularities. Comprehensive experiments in simulation and real-world environments show that M4Diffuser achieves 7 to 56 percent higher success rates and reduces collisions by 3 to 31 percent over baselines. Our approach demonstrates robust performance for smooth whole-body coordination, and strong generalization to unseen tasks, paving the way for reliable mobile manipulation in unstructured environments. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/m4diffuser.
comment: Project page: https://sites.google.com/view/m4diffuser, 10 pages, 9 figures
Pretraining in Actor-Critic Reinforcement Learning for Robot Locomotion
The pretraining-finetuning paradigm has facilitated numerous transformative advancements in artificial intelligence research in recent years. However, in the domain of reinforcement learning (RL) for robot locomotion, individual skills are often learned from scratch despite the high likelihood that some generalizable knowledge is shared across all task-specific policies belonging to the same robot embodiment. This work aims to define a paradigm for pretraining neural network models that encapsulate such knowledge and can subsequently serve as a basis for warm-starting the RL process in classic actor-critic algorithms, such as Proximal Policy Optimization (PPO). We begin with a task-agnostic exploration-based data collection algorithm to gather diverse, dynamic transition data, which is then used to train a Proprioceptive Inverse Dynamics Model (PIDM) through supervised learning. The pretrained weights are then loaded into both the actor and critic networks to warm-start the policy optimization of actual tasks. We systematically validated our proposed method with 9 distinct robot locomotion RL environments comprising 3 different robot embodiments, showing significant benefits of this initialization strategy. Our proposed approach on average improves sample efficiency by 36.9% and task performance by 7.3% compared to random initialization. We further present key ablation studies and empirical analyses that shed light on the mechanisms behind the effectiveness of this method.
ClearDepth: Enhanced Stereo Perception of Transparent Objects for Robotic Manipulation
Transparent object depth perception poses a challenge in everyday life and logistics, primarily due to the inability of standard 3D sensors to accurately capture depth on transparent or reflective surfaces. This limitation significantly affects depth map and point cloud-reliant applications, especially in robotic manipulation. We developed a vision transformer-based algorithm for stereo depth recovery of transparent objects. This approach is complemented by an innovative feature post-fusion module, which enhances the accuracy of depth recovery by structural features in images. To address the high costs associated with dataset collection for stereo camera-based perception of transparent objects, our method incorporates a parameter-aligned, domain-adaptive, and physically realistic Sim2Real simulation for efficient data generation, accelerated by AI algorithm. Our experimental results demonstrate the model's exceptional Sim2Real generalizability in real-world scenarios, enabling precise depth mapping of transparent objects to assist in robotic manipulation. Project details are available at https://sites.google.com/view/cleardepth/ .
comment: 9 pages
NaviTrace: Evaluating Embodied Navigation of Vision-Language Models ICRA 2026
Vision-language models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models' navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation trace in image space. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate eight state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation. The benchmark and leaderboard can be found at https://leggedrobotics.github.io/navitrace_webpage/.
comment: 11 pages, 6 figures, with appendix, accepted to ICRA 2026
Context Matters! Relaxing Goals with LLMs for Feasible 3D Scene Planning
Embodied agents need to plan and act reliably in real and complex 3D environments. Classical planning (e.g., PDDL) offers structure and guarantees, but in practice it fails under noisy perception and incorrect predicate grounding. On the other hand, Large Language Models (LLMs)-based planners leverage commonsense reasoning, yet frequently propose actions that are unfeasible or unsafe. Following recent works that combine the two approaches, we introduce ContextMatters, a framework that fuses LLMs and classical planning to perform hierarchical goal relaxation: the LLM helps ground symbols to the scene and, when the target is unreachable, it proposes functionally equivalent goals that progressively relax constraints, adapting the goal to the context of the agent's environment. Operating on 3D Scene Graphs, this mechanism turns many nominally unfeasible tasks into tractable plans and enables context-aware partial achievement when full completion is not achievable. Our experimental results show a +52.45% Success Rate improvement over state-of-the-art LLMs+PDDL baseline, demonstrating the effectiveness of our approach. Moreover, we validate the execution of ContextMatter in a real world scenario by deploying it on a TIAGo robot. Code, dataset, and supplementary materials are available to the community at https://lab-rococo-sapienza.github.io/context-matters/.
Graph Neural Model Predictive Control for High-Dimensional Systems
The control of high-dimensional systems, such as soft robots, requires models that faithfully capture complex dynamics while remaining computationally tractable. This work presents a framework that integrates Graph Neural Network (GNN)-based dynamics models with structure-exploiting Model Predictive Control to enable real-time control of high-dimensional systems. By representing the system as a graph with localized interactions, the GNN preserves sparsity, while a tailored condensing algorithm eliminates state variables from the control problem, ensuring efficient computation. The complexity of our condensing algorithm scales linearly with the number of system nodes, and leverages Graphics Processing Unit (GPU) parallelization to achieve real-time performance. The proposed approach is validated in simulation and experimentally on a physical soft robotic trunk. Results show that our method scales to systems with up to 1,000 nodes at 100 Hz in closed-loop, and demonstrates real-time reference tracking on hardware with sub-centimeter accuracy, outperforming baselines by 63.6%. Finally, we show the capability of our method to achieve effective full-body obstacle avoidance.
Unsupervised Discovery of Failure Taxonomies from Deployment Logs
As robotic systems become increasingly integrated into real-world environments, ranging from autonomous vehicles to household assistants, they inevitably encounter diverse and unstructured scenarios that lead to failures. While such failures pose safety and reliability challenges, they also provide rich perceptual data for improving system robustness. However, manually analyzing large-scale failure datasets is impractical and does not scale. In this work, we introduce the problem of unsupervised discovery of failure taxonomies from large volumes of raw failure logs, aiming to obtain semantically coherent and actionable failure modes directly from perceptual trajectories. Our approach first infers structured failure explanations from multimodal inputs using vision-language reasoning, and then performs clustering in the resulting semantic reasoning space, enabling the discovery of recurring failure modes rather than isolated episode-level descriptions. We evaluate our method across robotic manipulation, indoor navigation, and autonomous driving domains, and demonstrate that the discovered taxonomies are consistent, interpretable, and practically useful. In particular, we show that structured failure taxonomies guide targeted data collection for offline policy refinement and enhance runtime failure monitoring systems. Website: https://mllm-failure-clustering.github.io/
EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video ICLR 2026
Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object manipulation. To this end, we use Apple Vision Pro to collect EgoDex: the largest and most diverse dataset of dexterous human manipulation to date. EgoDex has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording, where multiple calibrated cameras and on-device SLAM can be used to precisely track the pose of every joint of each hand. The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194 different tabletop tasks ranging from tying shoelaces to folding laundry. Furthermore, we train and systematically evaluate imitation learning policies for hand trajectory prediction on the dataset, introducing metrics and benchmarks for measuring progress in this increasingly important area. By releasing this large-scale dataset, we hope to push the frontier of robotics, computer vision, and foundation models. EgoDex is publicly available for download at https://github.com/apple/ml-egodex.
comment: ICLR 2026
Autonomous UAV-Quadruped Docking in Complex Terrains via Active Posture Alignment and Constraint-Aware Control
Autonomous docking between Unmanned Aerial Vehicles (UAVs) and ground robots is essential for heterogeneous systems, yet most existing approaches target wheeled platforms whose limited mobility constrains exploration in complex terrains. Quadruped robots offer superior adaptability but undergo frequent posture variations, making it difficult to provide a stable landing surface for UAVs. To address these challenges, we propose an autonomous UAV-quadruped docking framework for GPS-denied environments. On the quadruped side, a Hybrid Internal Model with Horizontal Alignment (HIM-HA), learned via deep reinforcement learning, actively stabilizes the torso to provide a level platform. On the UAV side, a three-phase strategy is adopted, consisting of long-range acquisition with a median-filtered YOLOv8 detector, close-range tracking with a constraint-aware controller that integrates a Nonsingular Fast Terminal Sliding Mode Controller (NFTSMC) and a logarithmic Barrier Function (BF) to guarantee finite-time error convergence under field-of-view (FOV) constraints, and terminal descent guided by a Safety Period (SP) mechanism that jointly verifies tracking accuracy and platform stability. The proposed framework is validated in both simulation and real-world scenarios, successfully achieving docking on outdoor staircases higher than 17 cm and rough slopes steeper than 30 degrees. Supplementary materials and videos are available at: https://uav-quadruped-docking.github.io.
FoldNet: Learning Generalizable Closed-Loop Policy for Garment Folding via Keypoint-Driven Asset and Demonstration Synthesis
Due to the deformability of garments, generating a large amount of high-quality data for robotic garment manipulation tasks is highly challenging. In this paper, we present a synthetic garment dataset that can be used for robotic garment folding. We begin by constructing geometric garment templates based on keypoints and applying generative models to generate realistic texture patterns. Leveraging these keypoint annotations, we generate folding demonstrations in simulation and train folding policies via closed-loop imitation learning. To improve robustness, we propose KG-DAgger, which uses a keypoint-based strategy to generate demonstration data for recovering from failures. KG-DAgger significantly improves the model performance, boosting the real-world success rate by 25\%. After training with 15K trajectories (about 2M image-action pairs), the model achieves a 75\% success rate in the real world. Experiments in both simulation and real-world settings validate the effectiveness of our proposed framework.
comment: Project: https://pku-epic.github.io/FoldNet/
Beyond Collision Cones: Dynamic Obstacle Avoidance for Nonholonomic Robots via Dynamic Parabolic Control Barrier Functions ICRA
Control Barrier Functions (CBFs) are a powerful tool for ensuring the safety of autonomous systems, yet applying them to nonholonomic robots in cluttered, dynamic environments remains an open challenge. State-of-the-art methods often rely on collision-cone or velocity-obstacle constraints which, by only considering the angle of the relative velocity, are inherently conservative and can render the CBF-based quadratic program infeasible, particularly in dense scenarios. To address this issue, we propose a Dynamic Parabolic Control Barrier Function (DPCBF) that defines the safe set using a parabolic boundary. The parabola's vertex and curvature dynamically adapt based on both the distance to an obstacle and the magnitude of the relative velocity, creating a less restrictive safety constraint. We prove that the proposed DPCBF is valid for a kinematic bicycle model subject to input constraints. Extensive comparative simulations demonstrate that our DPCBF-based controller significantly enhances navigation success rates and QP feasibility compared to baseline methods. Our approach successfully navigates through dense environments with up to 100 dynamic obstacles, scenarios where collision cone-based methods fail due to infeasibility.
comment: The first two authors contributed equally to this work. 2026 IEEE International Conference on Robotics and Automation (ICRA). Project page: https://www.taekyung.me/dpcbf
AgenticLab: A Real-World Robot Agent Platform that Can See, Think, and Act
Recent advances in large vision-language models (VLMs) have demonstrated generalizable open-vocabulary perception and reasoning, yet their real-robot manipulation capability remains unclear for long-horizon, closed-loop execution in unstructured, in-the-wild environments. Prior VLM-based manipulation pipelines are difficult to compare across different research groups' setups, and many evaluations rely on simulation, privileged state, or specially designed setups. We present AgenticLab, a model-agnostic robot agent platform and benchmark for open-world manipulation. AgenticLab provides a closed-loop agent pipeline for perception, task decomposition, online verification, and replanning. Using AgenticLab, we benchmark state-of-the-art VLM-based agents on real-robot tasks in unstructured environments. Our benchmark reveals several failure modes that offline vision-language tests (e.g., VQA and static image understanding) fail to capture, including breakdowns in multi-step grounding consistency, object grounding under occlusion and scene changes, and insufficient spatial reasoning for reliable manipulation. We will release the full hardware and software stack to support reproducible evaluation and accelerate research on general-purpose robot agents.
comment: Added appendix
Event-Based Visual Teach-and-Repeat via Fast Fourier-Domain Cross-Correlation
Visual teach-and-repeat (VT&R) navigation enables robots to autonomously traverse previously demonstrated paths using visual feedback. We present a novel event-camera-based VT\&R system. Our system formulates event-stream matching as frequency-domain cross-correlation, transforming spatial convolutions into efficient Fourier-space multiplications. By exploiting the binary structure of event frames and applying image compression techniques, we achieve a processing latency of just 2.88 ms, about 3.5 times faster than conventional camera-based baselines that are optimised for runtime efficiency. Experiments using a Prophesee EVK4 HD event camera mounted on an AgileX Scout Mini robot demonstrate successful autonomous navigation across 3000+ meters of indoor and outdoor trajectories in daytime and nighttime conditions. Our system maintains Cross-Track Errors (XTE) below 15 cm, demonstrating the practical viability of event-based perception for real-time VT\&R navigation.
comment: 8 Pages, 5 Figures, Under Review
MobiDock: Design and Control of A Modular Self Reconfigurable Bimanual Mobile Manipulator via Robotic Docking IROS2026
Multi-robot systems, particularly mobile manipulators, face challenges in control coordination and dynamic stability when working together. To address this issue, this study proposes MobiDock, a modular self-reconfigurable mobile manipulator system that allows two independent robots to physically connect and form a unified mobile bimanual platform. This process helps transform a complex multi-robot control problem into the management of a simpler, single system. The system utilizes an autonomous docking strategy based on computer vision with AprilTag markers and a new threaded screw-lock mechanism. Experimental results show that the docked configuration demonstrates better performance in dynamic stability and operational efficiency compared to two independently cooperating robots. Specifically, the unified system has lower Root Mean Square (RMS) Acceleration and Jerk values, higher angular precision, and completes tasks significantly faster. These findings confirm that physical reconfiguration is a powerful design principle that simplifies cooperative control, improving stability and performance for complex tasks in real-world environments.
comment: IROS2026 submited
Utility Theory based Cognitive Modeling in the Application of Robotics: A Survey
Cognitive modeling, which explores the essence of cognition, including motivation, emotion, and perception, has been widely applied in the artificial intelligence (AI) agent domains, such as robotics. From the computational perspective, various cognitive functionalities have been developed through utility theory to provide a detailed and process-based understanding for specifying corresponding computational models of representations, mechanisms, and processes. Especially for decision-making and learning in multi-agent/robot systems (MAS/MRS), a suitable cognitive model can guide agents in choosing reasonable strategies to achieve their current needs and learning to cooperate and organize their behaviors, optimizing the system's utility, building stable and reliable relationships, and guaranteeing each group member's sustainable development, similar to the human society. This survey examines existing robotic systems for developmental cognitive models in the context of utility theory. We discuss the evolution of cognitive modeling in robotics from behavior-based robotics (BBR) and cognitive architectures to the properties of value systems in robots, such as the studies on motivations as artificial value systems, and the utility theory based cognitive modeling for generating and updating strategies in robotic interactions. Then, we examine the extent to which existing value systems support the application of robotics from an AI agent cognitive modeling perspective, including single-agent and multi-agent systems, trust among agents, and human-robot interaction. Finally, we survey the existing literature of current value systems in relevant fields and propose several promising research directions, along with some open problems that we deem necessary for further investigation.
Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition ICRA 2026
Deep learning methods for Visual Place Recognition (VPR) have advanced significantly, largely driven by large-scale datasets. However, most existing approaches are trained on a single dataset, which can introduce dataset-specific inductive biases and limit model generalization. While multi-dataset joint training offers a promising solution for developing universal VPR models, divergences among training datasets can saturate the limited information capacity in feature aggregation layers, leading to suboptimal performance. To address these challenges, we propose Query-based Adaptive Aggregation (QAA), a novel feature aggregation technique that leverages learned queries as reference codebooks to effectively enhance information capacity without significant computational or parameter complexity. We show that computing the Cross-query Similarity (CS) between query-level image features and reference codebooks provides a simple yet effective way to generate robust descriptors. Our results demonstrate that QAA outperforms state-of-the-art models, achieving balanced generalization across diverse datasets while maintaining peak performance comparable to dataset-specific models. Ablation studies further explore QAA's mechanisms and scalability. Visualizations reveal that the learned queries exhibit diverse attention patterns across datasets. Project page: http://xjh19971.github.io/QAA.
comment: 8 pages, 4 figures, accepted at ICRA 2026
IPPO Learns the Game, Not the Team: A Study on Generalization in Heterogeneous Agent Teams
Multi-Agent Reinforcement Learning (MARL) is commonly deployed in settings where agents are trained via self-play with homogeneous teammates, often using parameter sharing and a single policy architecture. This opens the question: to what extent do self-play PPO agents learn general coordination strategies grounded in the underlying game, compared to overfitting to their training partners' behaviors? This paper investigates the question using the Heterogeneous Multi-Agent Challenge (HeMAC) environment, which features distinct Observer and Drone agents with complementary capabilities. We introduce Rotating Policy Training (RPT), an approach that rotates heterogeneous teammate policies of different learning algorithms during training, to expose the agent to a broader range of partner strategies. When playing alongside a withheld teammate policy (DDQN), we find that RPT achieves similar performance to a standard self-play baseline, IPPO, where all agents were trained sharing a single PPO policy. This result indicates that in this heterogeneous multi-agent setting, the IPPO baseline generalizes to novel teammate algorithms despite not experiencing teammate diversity during training. This shows that a simple IPPO baseline may possess the level of generalization to novel teammates that a diverse training regimen was designed to achieve.
comment: 4 pages, 3 figures, appendix
Influence-Based Reward Modulation for Implicit Communication in Human-Robot Interaction
Communication is essential for successful interaction. In human-robot interaction, implicit communication holds the potential to enhance robots' understanding of human needs, emotions, and intentions. This paper introduces a method to foster implicit communication in HRI without explicitly modelling human intentions or relying on pre-existing knowledge. Leveraging Transfer Entropy, we modulate influence between agents in social interactions in scenarios involving either collaboration or competition. By integrating influence into agents' rewards within a partially observable Markov decision process, we demonstrate that boosting influence enhances collaboration and interaction, while resisting influence promotes social independence and diminishes performance in certain scenarios. Our findings are validated through simulations and real-world experiments with human participants in social navigation and autonomous driving settings.
comment: Preprint. 26 pages, 15 figures. Submitted to IEEE Transactions on Human-Robot Interaction (THRI). Accepted manuscript version
Contact-Grounded Policy: Dexterous Visuotactile Policy with Generative Contact Grounding
Contact-rich dexterous manipulation with multi-finger hands remains an open challenge in robotics because task success depends on multi-point contacts that continuously evolve and are highly sensitive to object geometry, frictional transitions, and slip. Recently, tactile-informed manipulation policies have shown promise. However, most use tactile signals as additional observations rather than modeling contact state or how their action outputs interact with low-level controller dynamics. We present Contact-Grounded Policy (CGP), a visuotactile policy that grounds multi-point contacts by predicting coupled trajectories of actual robot state and tactile feedback, and using a learned contact-consistency mapping to convert these predictions into executable target robot states for a compliance controller. CGP consists of two components: (i) a conditional diffusion model that forecasts future robot state and tactile feedback in a compressed latent space, and (ii) a learned contact-consistency mapping that converts the predicted robot state-tactile pair into executable targets for a compliance controller, enabling it to realize the intended contacts. We evaluate CGP using a physical four-finger Allegro V5 hand with Digit360 fingertip tactile sensors, and a simulated five-finger Tesollo DG-5F hand with dense whole-hand tactile arrays. Across a range of dexterous tasks including in-hand manipulation, delicate grasping, and tool use, CGP outperforms visuomotor and visuotactile diffusion-policy baselines.
HumanHalo - Safe and Efficient 3D Navigation Among Humans via Minimally Conservative MPC
Safe and efficient robotic navigation among humans is essential for integrating robots into everyday environments. Most existing approaches focus on simplified 2D crowd navigation and fail to account for the full complexity of human body dynamics beyond root motion. We present HumanMPC, a Model Predictive Control (MPC) framework for 3D Micro Air Vehicle (MAV) navigation among humans that combines theoretical safety guarantees with data-driven models for realistic human motion forecasting. Our approach introduces a novel twist to reachability-based safety formulation that constrains only the initial control input for safety while modeling its effects over the entire planning horizon, enabling safe yet efficient navigation. We validate HumanMPC in both simulated experiments using real human trajectories and in the real-world, demonstrating its effectiveness across tasks ranging from goal-directed navigation to visual servoing for human tracking. While we apply our method to MAVs in this work, it is generic and can be adapted by other platforms. Our results show that the method ensures safety without excessive conservatism and outperforms baseline approaches in both efficiency and reliability.
M3CAD: Towards Generic Cooperative Autonomous Driving Benchmark ICRA 2026
We introduce M$^3$CAD, a comprehensive benchmark designed to advance research in generic cooperative autonomous driving. M$^3$CAD comprises 204 sequences with 30,000 frames. Each sequence includes data from multiple vehicles and different types of sensors, e.g., LiDAR point clouds, RGB images, and GPS/IMU, supporting a variety of autonomous driving tasks, including object detection and tracking, mapping, motion forecasting, occupancy prediction, and path planning. This rich multimodal setup enables M$^3$CAD to support both single-vehicle and multi-vehicle cooperative autonomous driving research. To the best of our knowledge, M$^3$CAD is the most complete benchmark specifically designed for cooperative, multi-task autonomous driving research. To test its effectiveness, we use M$^3$CAD to evaluate both state-of-the-art single-vehicle and cooperative driving solutions, setting baseline performance results. Since most existing cooperative perception methods focus on merging features but often ignore network bandwidth requirements, we propose a new multi-level fusion approach which adaptively balances communication efficiency and perception accuracy based on the current network conditions. We release M$^3$CAD, along with the baseline models and evaluation results, to support the development of robust cooperative autonomous driving systems. All resources will be made publicly available on https://github.com/zhumorui/M3CAD
comment: Accepted to ICRA 2026
Multi-Quadruped Cooperative Object Transport: Learning Decentralized Pinch-Lift-Move ICRA 2026
We study decentralized cooperative transport using teams of N-quadruped robots with arm that must pinch, lift, and move ungraspable objects through physical contact alone. Unlike prior work that relies on rigid mechanical coupling between robots and objects, we address the more challenging setting where mechanically independent robots must coordinate through contact forces alone without any communication or centralized control. To this end, we employ a hierarchical policy architecture that separates base locomotion from arm control, and propose a constellation reward formulation that unifies position and orientation tracking to enforce rigid contact behavior. The key insight is encouraging robots to behave as if rigidly connected to the object through careful reward design and training curriculum rather than explicit mechanical constraints. Our approach enables coordination through shared policy parameters and implicit synchronization cues - scaling to arbitrary team sizes without retraining. We show extensive simulation experiments to demonstrate robust transport across 2-10 robots on diverse object geometries and masses, along with sim2real transfer results on lightweight objects.
comment: Accepted to ICRA 2026. Project page: https://decplm.github.io
Open-World Task and Motion Planning via Vision-Language Model Genereated Constraints
Foundation models like Vision-Language Models (VLMs) excel at common sense vision and language tasks such as visual question answering. However, they cannot yet directly solve complex, long-horizon robot manipulation problems requiring precise continuous reasoning. Task and Motion Planning (TAMP) systems can handle long-horizon reasoning through discrete-continuous hybrid search over parameterized skills, but rely on detailed environment models and cannot interpret novel human objectives, such as arbitrary natural language goals. We propose integrating VLMs into TAMP systems by having them generate discrete and continuous language-parameterized constraints that enable open-world reasoning. Specifically, we use VLMs to generate discrete action ordering constraints that constrain TAMP search over action sequences, and continuous constraints in the form of code that augments traditional TAMP manipulation constraints. Experiments show that our approach, OWL-TAMP, outperforms baselines relying solely on TAMP or VLMs across several long-horizon manipulation tasks specified directly in natural language. We additionally demonstrate that OWL-TAMP can be deployed with an off-the-shelf TAMP system to solve challenging manipulation tasks on real-world hardware.
comment: A version of this paper appears in IEEE Robotics and Automation Letters (RA-L) Volume 11, Issue 3
Connectivity Maintenance and Recovery for Multi-Robot Motion Planning
Connectivity is crucial in many multi-robot applications, yet balancing between maintaining it and the fleet's traversability in obstacle-rich environments remains a challenge. Reactive controllers, such as control barrier functions, while providing connectivity guarantees, often struggle to traverse obstacle-rich environments due to deadlocks. We propose a real-time Bézier-based constrained motion planning algorithm, namely, MPC--CLF--CBF, that produces trajectory and control concurrently, under high-order control barrier functions and control Lyapunov functions conditions. Our motion planner significantly improves the navigation success rate of connected fleets in a cluttered workspace and recovers after inevitable connection loss by bypassing obstacles or from an initially disconnected fleet configuration. In addition, our predictive motion planner, owing to its Bézier curve solution, can easily obtain continuous-time arbitrary orders of derivatives, making it suitable for agile differentially flat systems, such as quadrotors. We validate the proposed algorithm through simulations and a physical experiment with $8$ Crazyflie nano-quadrotors.
Magnetically Driven Elastic Microswimmers: Exploiting Hysteretic Collapse for Autonomous Propulsion and Independent Control
When swimming at low Reynolds numbers, inertial effects are negligible and reciprocal movements cannot induce net motion. Instead, symmetry breaking is necessary to achieve net propulsion. Directed swimming can be supported by magnetic fields, which simultaneously provide a versatile means of remote actuation. Thus, we analyze the motion of a straight microswimmer composed of three magnetizable beads connected by two elastic links. The swimming mechanism is based on oriented external magnetic fields that oscillate in magnitude. Through induced reversible hysteretic collapse of the two segments of the swimmer, the two pairs of beads jump into contact and separate nonreciprocally. Due to higher-order hydrodynamic interactions, net displacement results after each cycle. Different microswimmers can be tuned to different driving amplitudes and frequencies, allowing for simultaneous independent control by just one external magnetic field. The swimmer geometry and magnetic field shape are optimized for maximum swimming speed using an evolutionary optimization strategy. Thanks to the simple working principle, an experimental realization of such a microrobot seems feasible and may open new approaches for microinvasive medical interventions such as targeted drug delivery.
comment: 12 pages, 7 figures, submitted to ACS Nanoscience Au
CuriousBot: Interactive Mobile Exploration via Actionable 3D Relational Object Graph
Mobile exploration is a longstanding challenge in robotics, yet current methods primarily focus on active perception instead of active interaction, limiting the robot's ability to interact with and fully explore its environment. Existing robotic exploration approaches via active interaction are often restricted to tabletop scenes, neglecting the unique challenges posed by mobile exploration, such as large exploration spaces, complex action spaces, and diverse object relations. In this work, we introduce a 3D relational object graph that encodes diverse object relations and enables exploration through active interaction. We develop a system based on this representation and evaluate it across diverse scenes. Our qualitative and quantitative results demonstrate the system's effectiveness and generalization across object instances, relations, and scenes, outperforming methods solely relying on vision-language models (VLMs).
comment: Accepted to IEEE Robotics and Automation Letters (RA-L). Project Page: https://curiousbot.theaiinstitute.com/
Multiagent Systems
IronEngine: Towards General AI Assistant
This paper presents IronEngine, a general AI assistant platform organized around a unified orchestration core that connects a desktop user interface, REST and WebSocket APIs, Python clients, local and cloud model backends, persistent memory, task scheduling, reusable skills, 24-category tool execution, MCP-compatible extensibility, and hardware-facing integration. IronEngine introduces a three-phase pipeline -- Discussion (Planner--Reviewer collaboration), Model Switch (VRAM-aware transition), and Execution (tool-augmented action loop) -- that separates planning quality from execution capability. The system features a hierarchical memory architecture with multi-level consolidation, a vectorized skill repository backed by ChromaDB, an adaptive model management layer supporting 92 model profiles with VRAM-aware context budgeting, and an intelligent tool routing system with 130+ alias normalization and automatic error correction. We present experimental results on file operation benchmarks achieving 100\% task completion with a mean total time of 1541 seconds across four heterogeneous tasks, and provide detailed comparisons with representative AI assistant systems including ChatGPT, Claude Desktop, Cursor, Windsurf, and open-source agent frameworks. Without disclosing proprietary prompts or core algorithms, this paper analyzes the platform's architectural decomposition, subsystem design, experimental performance, safety boundaries, and comparative engineering advantages. The resulting study positions IronEngine as a system-oriented foundation for general-purpose personal assistants, automation frameworks, and future human-centered agent platforms.
comment: Technical Report
Less is More: Robust Zero-Communication 3D Pursuit-Evasion via Representational Parsimony
Asymmetric 3D pursuit-evasion in cluttered voxel environments is difficult under communication latency, partial observability, and nonholonomic maneuver limits. While many MARL methods rely on richer inter-agent coupling or centralized signals, these dependencies can become fragility sources when communication is delayed or noisy. Building on an inherited path-guided decentralized pursuit scaffold, we study a robustness-oriented question: can representational parsimony improve communication-free coordination? We instantiate this principle with (i) a parsimonious actor observation interface that removes team-coupled channels (83-D to 50-D), and (ii) Contribution-Gated Credit Assignment (CGCA), a locality-aware credit structure for communication-denied cooperation. In Stage-5 evaluation (4 pursuers vs. 1 evader), our configuration reaches 0.753 +/- 0.091 success and 0.223 +/- 0.066 collision, outperforming the 83-D FULL OBS counterpart (0.721 +/- 0.071, 0.253 +/- 0.089). It further shows graceful degradation under speed/yaw/noise/delay stress tests and resilient zero-shot transfer on urban-canyon maps (about 61% success at density 0.24). These results support a practical paradigm shift: explicitly severing redundant cross-agent channels can suppress compounding error cascades and improve robustness in latency-prone deployment.
comment: 7 pages, 10 figures. This work has been submitted to the IEEE for possible publication
Modeling the Senegalese artisanal fisheries migrations
The North-West African coast is enriched by the Canary current, which sustain a very produc- tive marine ecosystem. The Senegalese artisanal fishing fleet, the largest in West Africa, ben- efit from this particularly productive ecosystem. It has survived the ages with remarkable adaptability, and has great flexibility allowing it to react quickly to changes, in particular by changing fishing gear and performing migrations. However, since the 1980s, the increasing fishing effort led to a progressive fish depletion, increasing fisher's migration distances to access new fishing grounds. Since 2007 many fishers even started to navigate to Canary archi- pelago in order to find a more lucrative job in Europe, carrying candidate to emigration in their canoes. This phenomenon further increased since 2022 due to a new drop in fishery yields, consecutive to the development of fishmeal factories along the coast that amplified overfishing. Climate change may also impact fish habitat, and by consequence the distribution of fishing grounds. The question addressed in this research was how climate change, fishing effort and socio-economic parameters interact and determine the artisanal fishery dynamics. An interdisciplinary approach allowed us to collect data and qualitative information on cli- mate, fishing effort and socio-economic parameters. This served as a basis to build a multi- agent model of the mobility of Senegalese artisanal fishing. We implemented a first version of the model and presented some preliminary simulations with contrasted fishing effort and climate scenario. The results suggested that first, climate change should have only a slight impact on artisanal fishing, even in the most extreme climate scenario considered. Second, if fishing effort was maintained at current levels, we found a collapse of the fishery with massive fishers migrations whatever the climate scenario. Third, with reduced fishing effort, a sustain- able fishery equilibrium emerges in which Senegal's artisanal fishery catches ~250,000 tons of fish a year mostly in Senegal, approaching the 2000s catches records. This sustainable equi- librium maintained with the two-climate change scenario tested. Fishers migrations provide clues of the fish populations state and have implications for the sustainable exploitation of fishing resources. Senegalese artisanal fishers' migrations impact the regional distribution of the fishing effort, therefore must be taken into account in regional development and planning policies for this sector, particularly in a context of increasing infrastructure and spatial man- agement measures (e.g. marine protected areas). This work lays the foundations of a computer simulation tool for decision support.
TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size CVPR 2026
Physics-based humanoid control has achieved remarkable progress in enabling realistic and high-performing single-agent behaviors, yet extending these capabilities to cooperative human-object interaction (HOI) remains challenging. We present TeamHOI, a framework that enables a single decentralized policy to handle cooperative HOIs across any number of cooperating agents. Each agent operates using local observations while attending to other teammates through a Transformer-based policy network with teammate tokens, allowing scalable coordination across variable team sizes. To enforce motion realism while addressing the scarcity of cooperative HOI data, we further introduce a masked Adversarial Motion Prior (AMP) strategy that uses single-human reference motions while masking object-interacting body parts during training. The masked regions are then guided through task rewards to produce diverse and physically plausible cooperative behaviors. We evaluate TeamHOI on a challenging cooperative carrying task involving two to eight humanoid agents and varied object geometries. Finally, to promote stable carrying, we design a team-size- and shape-agnostic formation reward. TeamHOI achieves high success rates and demonstrates coherent cooperation across diverse configurations with a single policy.
comment: CVPR 2026. Project page: https://splionar.github.io/TeamHOI/ Code: https://github.com/sail-sg/TeamHOI
LDP: An Identity-Aware Protocol for Multi-Agent LLM Systems
As multi-agent AI systems grow in complexity, the protocols connecting them constrain their capabilities. Current protocols such as A2A and MCP do not expose model-level properties as first-class primitives, ignoring properties fundamental to effective delegation: model identity, reasoning profile, quality calibration, and cost characteristics. We present the LLM Delegate Protocol (LDP), an AI-native communication protocol introducing five mechanisms: (1) rich delegate identity cards with quality hints and reasoning profiles; (2) progressive payload modes with negotiation and fallback; (3) governed sessions with persistent context; (4) structured provenance tracking confidence and verification status; (5) trust domains enforcing security boundaries at the protocol level. We implement LDP as a plugin for the JamJet agent runtime and evaluate against A2A and random baselines using local Ollama models and LLM-as-judge evaluation. Identity-aware routing achieves ~12x lower latency on easy tasks through delegate specialization, though it does not improve aggregate quality in our small delegate pool; semantic frame payloads reduce token count by 37% (p=0.031) with no observed quality loss; governed sessions eliminate 39% token overhead at 10 rounds; and noisy provenance degrades synthesis quality below the no-provenance baseline, arguing that confidence metadata is harmful without verification. Simulated analyses show architectural advantages in attack detection (96% vs. 6%) and failure recovery (100% vs. 35% completion). This paper contributes a protocol design, reference implementation, and initial evidence that AI-native protocol primitives enable more efficient and governable delegation.
comment: 16 pages, 9 figures, 8 tables, 4 appendices
Scale-Plan: Scalable Language-Enabled Task Planning for Heterogeneous Multi-Robot Teams
Long-horizon task planning for heterogeneous multi-robot systems is essential for deploying collaborative teams in real-world environments; yet, it remains challenging due to the large volume of perceptual information, much of which is irrelevant to task objectives and burdens planning. Traditional symbolic planners rely on manually constructed problem specifications, limiting scalability and adaptability, while recent large language model (LLM)-based approaches often suffer from hallucinations and weak grounding-i.e., poor alignment between generated plans and actual environmental objects and constraints-in object-rich settings. We present Scale-Plan, a scalable LLM-assisted framework that generates compact, task-relevant problem representations from natural language instructions. Given a PDDL domain specification, Scale-Plan constructs an action graph capturing domain structure and uses shallow LLM reasoning to guide a structured graph search that identifies a minimal subset of relevant actions and objects. By filtering irrelevant information prior to planning, Scale-Plan enables efficient decomposition, allocation, and long-horizon plan generation. We evaluate our approach on complex multi-agent tasks and introduce MAT2-THOR, a cleaned benchmark built on AI2-THOR for reliable evaluation of multi-robot planning systems. Scale-Plan outperforms pure LLM and hybrid LLM-PDDL baselines across all metrics, improving scalability and reliability.
Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead
As LLM agents evolve into collaborative multi-agent systems, their memory requirements grow rapidly in complexity. This position paper frames multi-agent memory as a computer architecture problem. We distinguish shared and distributed memory paradigms, propose a three-layer memory hierarchy (I/O, cache, and memory), and identify two critical protocol gaps: cache sharing across agents and structured memory access control. We argue that the most pressing open challenge is multi-agent memory consistency. Our architectural framing provides a foundation for building reliable, scalable multi-agent systems.
CRAwDAD: Causal Reasoning Augmentation with Dual-Agent Debate
When people reason about cause and effect, they often consider many competing "what if" scenarios before deciding which explanation fits best. Analogously, advanced language models capable of causal inference can consider multiple interventions and counterfactuals to judge the validity of causal claims. Crucially, this type of reasoning is less like a single calculation and more like an internal dialogue between alternative hypotheses. In this paper, we make this dialogue explicit through a dual-agent debate framework where one model provides a structured causal inference, and the other critically examines this reasoning for logical flaws. When disagreements arise, the agents attempt to persuade each other, challenging each other's logic and revising their conclusions until they converge on a mutually agreed answer. To take advantage of this deliberative process, we specifically use reasoning language models, whose strengths in both causal inference and adversarial debate remain under-explored relative to standard large language models. We evaluate our approach on the CLadder dataset, a benchmark linking natural language questions to formally defined causal graphs across all three rungs of Pearl's ladder of causation. With Qwen3 and DeepSeek-R1 as debater agents, we demonstrate that multi-agent debate improves DeepSeek-R1's overall accuracy in causal inference from 78.03% to 87.45%, with the counterfactual category specifically improving from 67.94% to 80.04% accuracy. Similarly, Qwen3's overall accuracy improves from 84.16% to 89.41%, and counterfactual questions from 71.53% to 80.35%, showing that even strong models can still benefit greatly from debate with weaker agents. Our results highlight the potential of reasoning models as building blocks for multi-agent systems in causal inference, and demonstrate the importance of diverse perspectives in causal problem-solving.
comment: 12 pages, 8 figures. Code available at https://github.com/finnvamosi/CRAwDAD
The Illusion of Collusion
Algorithmic agents are used in a variety of competitive decision-making settings, including pricing contexts that range from online retail to residential home rental. We study the emergence of algorithmic collusion when competing agents employ multi-armed bandit algorithms and competition is modeled as a repeated Prisoner's Dilemma game. Notably, agents in our setting perform online learning with no prior model of game structure and have no direct knowledge of competitor states or actions, thus they cannot learn strategies that depend on these factors. These context-free bandits nonetheless frequently learn seemingly collusive behavior, a phenomenon we term naive collusion. Our results reveal that whether naive collusion emerges depends starkly on the choice of behavior policy employed by bandit learners. The mechanism underpinning the emergence of collusive outcomes is synchronicity in agent action plays, where synchronicity captures how often agents play the same action. We show that in the long-run, naive algorithmic collusion never emerges when both agents use a broad class of persistently random algorithms, including the epsilon-greedy algorithm without epsilon decay, sometimes emerges when both agents use greedy-in-the-limit algorithms which feature randomness during exploration but are asymptotically deterministic, and always emerges when both agents use deterministic bandit learning algorithms like those in the well-known upper confidence bound (UCB) family. We highlight market and algorithmic conditions under which one can and cannot predict a priori whether collusion will occur. Our findings have several policy implications: preventing pricing algorithms from conditioning their actions on competitor prices may not preclude algorithmic collusion, symmetry in algorithms may increase collusion potential, and the emergence of algorithmic collusion is path dependent.
LatentMem: Customizing Latent Memory for Multi-Agent Systems
Large language model (LLM)-powered multi-agent systems (MAS) demonstrate remarkable collective intelligence, wherein multi-agent memory serves as a pivotal mechanism for continual adaptation. However, existing multi-agent memory designs remain constrained by two fundamental bottlenecks: (i) memory homogenization arising from the absence of role-aware customization, and (ii) information overload induced by excessively fine-grained memory entries. To address these limitations, we propose LatentMem, a learnable multi-agent memory framework designed to customize agent-specific memories in a token-efficient manner. Specifically, LatentMem comprises an experience bank that stores raw interaction trajectories in a lightweight form, and a memory composer that synthesizes compact latent memories conditioned on retrieved experience and agent-specific contexts. Further, we introduce Latent Memory Policy Optimization (LMPO), which propagates task-level optimization signals through latent memories to the composer, encouraging it to produce compact and high-utility representations. Extensive experiments across diverse benchmarks and mainstream MAS frameworks show that LatentMem achieves a performance gain of up to $19.36$% over vanilla settings and consistently outperforms existing memory architectures, without requiring any modifications to the underlying frameworks.
MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks
While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity - agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity - and (2) efficacy uncertainty - MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MASOrchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented subagents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and subagents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA, while achieving more than 10x efficiency over strong baselines. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.
comment: Preprint; Work in Progress
Characterizing MARL for Energy Control: A Multi-KPI Benchmark on the CityLearn Environment
The optimization of urban energy systems is crucial for the advancement of sustainable and resilient smart cities, which are becoming increasingly complex with multiple decision-making units. To address scalability and coordination concerns, Multi-Agent Reinforcement Learning (MARL) is a promising solution. This paper addresses the imperative need for comprehensive and reliable benchmarking of MARL algorithms on energy management tasks. CityLearn is used as a case study environment because it realistically simulates urban energy systems, incorporates multiple storage systems, and utilizes renewable energy sources. By doing so, our work sets a new standard for evaluation, conducting a comparative study across multiple key performance indicators (KPIs). This approach illuminates the key strengths and weaknesses of various algorithms, moving beyond traditional KPI averaging which often masks critical insights. Our experiments utilize widely accepted baselines such as Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC), and encompass diverse training schemes including Decentralized Training with Decentralized Execution (DTDE) and Centralized Training with Decentralized Execution (CTDE) approaches and different neural network architectures. Our work also proposes novel KPIs that tackle real world implementation challenges such as individual building contribution and battery storage lifetime. Our findings show that DTDE consistently outperforms CTDE in both average and worst-case performance. Additionally, temporal dependency learning improved control on memory dependent KPIs such as ramping and battery usage, contributing to more sustainable battery operation. Results also reveal robustness to agent or resource removal, highlighting both the resilience and decentralizability of the learned policies.
Jagarin: A Three-Layer Architecture for Hibernating Personal Duty Agents on Mobile
Personal AI agents face a fundamental deployment paradox on mobile: persistent background execution drains battery and violates platform sandboxing policies, yet purely reactive agents miss time-sensitive obligations until the user remembers to ask. We present Jagarin, a three-layer architecture that resolves this paradox through structured hibernation and demand-driven wake. The first layer, DAWN (Duty-Aware Wake Network), is an on-device heuristic engine that computes a composite urgency score from four signals: duty-typed optimal action windows, user behavioral engagement prediction, opportunity cost of inaction, and cross-duty batch resonance. It uses adaptive per-user thresholds to decide when a sleeping agent should nudge or escalate. The second layer, ARIA (Agent Relay Identity Architecture), is a commercial email identity proxy that routes the full commercial inbox -- obligations, promotional offers, loyalty rewards, and platform updates -- to appropriate DAWN handlers by message category, eliminating cold-start and removing manual data entry. The third layer, ACE (Agent-Centric Exchange), is a protocol framework for direct machine-readable communication from institutions to personal agents, replacing human-targeted email as the canonical channel. Together, these three layers form a complete stack from institutional signal to on-device action, without persistent cloud state, continuous background execution, or privacy compromise. A working Flutter prototype is demonstrated on Android, combining all three layers with an ephemeral cloud agent invoked only on user-initiated escalation.
comment: 12 pages, 4 figures
Utility Theory based Cognitive Modeling in the Application of Robotics: A Survey
Cognitive modeling, which explores the essence of cognition, including motivation, emotion, and perception, has been widely applied in the artificial intelligence (AI) agent domains, such as robotics. From the computational perspective, various cognitive functionalities have been developed through utility theory to provide a detailed and process-based understanding for specifying corresponding computational models of representations, mechanisms, and processes. Especially for decision-making and learning in multi-agent/robot systems (MAS/MRS), a suitable cognitive model can guide agents in choosing reasonable strategies to achieve their current needs and learning to cooperate and organize their behaviors, optimizing the system's utility, building stable and reliable relationships, and guaranteeing each group member's sustainable development, similar to the human society. This survey examines existing robotic systems for developmental cognitive models in the context of utility theory. We discuss the evolution of cognitive modeling in robotics from behavior-based robotics (BBR) and cognitive architectures to the properties of value systems in robots, such as the studies on motivations as artificial value systems, and the utility theory based cognitive modeling for generating and updating strategies in robotic interactions. Then, we examine the extent to which existing value systems support the application of robotics from an AI agent cognitive modeling perspective, including single-agent and multi-agent systems, trust among agents, and human-robot interaction. Finally, we survey the existing literature of current value systems in relevant fields and propose several promising research directions, along with some open problems that we deem necessary for further investigation.
Computing Evolutionarily Stable Strategies in Multiplayer Games
We present an algorithm for computing all evolutionarily stable strategies in nondegenerate normal-form games with three or more players.
comment: Reverting to original title after fixing Google scholar merge
Stochastic Self-Organization in Multi-Agent Systems ICLR 2026
Multi-agent systems (MAS) based on Large Language Models (LLMs) have the potential to solve tasks that are beyond the reach of any single LLM. However, this potential can only be realized when the collaboration mechanism between agents is optimized. Specifically, optimizing the communication structure between agents is critical for fruitful collaboration. Most existing approaches rely on fixed topologies, pretrained graph generators, optimization over edges, or employ external LLM judges, thereby adding to the complexity. In this work, we introduce a response-conditioned framework that adapts communication on-the-fly. Agents independently generate responses to the user query and assess peer contributions using an approximation of the Shapley value. A directed acyclic graph (DAG) is then constructed to regulate the propagation of the responses among agents, which ensures stable and efficient message transmission from high-contributing agents to others. This graph is dynamically updated based on the agent responses from the previous collaboration round. Since the proposed framework enables the self-organization of agents without additional supervision or training, we refer to it as SelfOrg. The SelfOrg framework goes beyond task- and query-level optimization and takes into account the stochastic nature of agent responses. Experiments with both strong and weak LLM backends demonstrate robust performance, with significant gains in the weak regime where prior methods collapse. We also theoretically show that multiple agents increase the chance of correctness and that the correct responses naturally dominate the information flow.
comment: Accepted to ICLR 2026
Computational Multi-Agents Society Experiments: Social Modeling Framework Based on Generative Agents
This paper introduces CMASE, a framework for Computational Multi-Agent Society Experiments that integrates generative agent-based modeling with virtual ethnographic methods to support researcher embedding, interactive participation, and mechanism-oriented intervention in virtual social environments. By transforming the simulation into a simulated ethnographic field, CMASE shifts the researcher from an external operator to an embedded participant. Specifically, the framework is designed to achieve three core capabilities: (1) enabling real-time human-computer interaction that allows researchers to dynamically embed themselves into the system to characterize complex social intervention processes; (2) reconstructing the generative logic of social phenomena by combining the rigor of computational experiments with the interpretative depth of traditional ethnography; and (3) providing a predictive foundation with causal explanatory power to make forward-looking judgments without sacrificing empirical accuracy. Experimental results show that CMASE can not only simulate complex phenomena, but also generate behavior trajectories consistent with both statistical patterns and mechanistic explanations. These findings demonstrate CMASE's methodological value for intervention modeling, highlighting its potential to advance interdisciplinary integration in the social sciences. The official code is available at: https://github.com/armihia/CMASE .
comment: 20 pages, 3 figures
Personalized Collaborative Learning with Affinity-Based Variance Reduction ICLR 2026
Multi-agent learning faces a fundamental tension: leveraging distributed collaboration without sacrificing the personalization needed for diverse agents. This tension intensifies when aiming for full personalization while adapting to unknown heterogeneity levels -- gaining collaborative speedup when agents are similar, without performance degradation when they are different. Embracing the challenge, we propose personalized collaborative learning (PCL), a novel framework for heterogeneous agents to collaboratively learn personalized solutions with seamless adaptivity. Through carefully designed bias correction and importance correction mechanisms, our method AffPCL robustly handles both environment and objective heterogeneity. We prove that AffPCL reduces sample complexity over independent learning by a factor of $\max\{n^{-1}, δ\}$, where $n$ is the number of agents and $δ\in[0,1]$ measures their heterogeneity. This affinity-based acceleration automatically interpolates between the linear speedup of federated learning in homogeneous settings and the baseline of independent learning, without requiring prior knowledge of the system. Our analysis further reveals that an agent may obtain linear speedup even by collaborating with arbitrarily dissimilar agents, unveiling new insights into personalization and collaboration in the high heterogeneity regime.
comment: Published as a conference paper at ICLR 2026
Algorithmic Collusion at Test Time: A Meta-game Design and Evaluation AAMAS 2026
The threat of algorithmic collusion, and whether it merits regulatory intervention, remains debated, as existing evaluations of its emergence often rely on long learning horizons, assumptions about counterparty rationality in adopting collusive strategies, and symmetry in hyperparameters and economic settings among players. To study collusion risk, we introduce a meta-game design for analyzing algorithmic behavior under test-time constraints. We model agents as possessing pretrained policies with distinct strategic characteristics (e.g., competitive, naively cooperative, or robustly collusive), and formulate the problem as selecting a meta-strategy that combines a pretrained, initial policy with an in-game adaptation rule. We seek to examine whether collusion can emerge under rational choices and how agents co-adapt toward cooperation or competition. To this end, we sample normal-form empirical games over meta-strategy profiles, compute relevant game statistics (e.g., payoffs against individuals and regret against an equilibrium mixture of opponents), and construct empirical best-response graphs to uncover strategic relationships. We evaluate reinforcement-learning, UCB, and LLM-based strategies in repeated pricing games under symmetric and asymmetric cost settings, and present findings on the feasibility of algorithmic collusion and the effectiveness of pricing strategies in practical ``test-time'' environments. The source code is available at: https://github.com/chailab-rutgers/CollusionMetagame.
comment: AAMAS 2026. 34 pages
Systems and Control (EESS)
Carbon-aware Market Participation for Building Energy Management Systems
Tackling climate change requires the rapid and deep decarbonization of electric power systems. While energy management systems (EMSs) play a central role in this transition, conventional EMSs focus mainly on economic efficiency and often overlook the environmental impact of operational decisions. To address this gap, this paper proposes a unified, real-time building-level carbon-aware EMS (CAEMS) capable of simultaneously co-optimizing grid imports, energy storage, and flexible demand within a single integrated framework. We formulate a mixed-integer linear program (MILP) model that directly integrates time-varying marginal carbon intensity signals into the EMS objective for coordinated participation in both day-ahead (DA) and real-time (RT) markets. To relax the unrealistic assumption of perfect foresight, we incorporate a model predictive control (MPC) extension driven by a Transformer-based forecaster that jointly predicts electricity prices and carbon intensity. The proposed CAEMS is validated using real-world data from the PJM electricity market. Simulation results demonstrate that modest carbon prices can achieve a significant 22.5% reduction in emissions with only a 1.7% increase in cost.
Reachability-based Temporal Logic Verification for Reliable LLM-guided Human-Autonomy Teaming
We propose a reachability-based framework for reliable LLM-guided human-autonomy teaming (HAT) using signal temporal logic (STL). In the proposed framework, LLM is leveraged as a translator that transfers natural language commands given by a human operator into corresponding STL specifications or vice versa. An STL feasibility filter (SFF) is proposed to check the feasibility of the generated STL. The SFF first decomposes the complex and nested LLM translation into a set of simpler subformulas for parallelization and informative feedback generation. The reachability analysis method is then applied to verify if each subformula is feasible for a target dynamical system: if feasible, perform mission planning, otherwise, reject it. The proposed SFF can identify infeasible subformulas, more than simply providing the boolean verification results for the whole STL, thereby facilitating the feedback generation of LLM to request modification of the command to the human. Consequently, the proposed framework can allow more reliable HAT by enabling safe and informative communication between the human operator and the autonomous agent. Our experiments demonstrate that the proposed framework can successfully filter out infeasible subformulas and generate informative feedback based on such information.
Rethinking Strict Dissipativity for Economic MPC
Stability of economic model predictive control can be proven under the assumption that a strict dissipativity condition holds. This assumption has a clear interpretation in terms of the so-called rotated stage cost, which must have its minimum at the optimal steady state. However, contrary to dissipativity, for strict dissipativity the storage function cannot be immediately related to the value function of an optimal control problem formulated with the economic stage cost. We propose the novel concept of two-storage strict dissipativity, which requires two storage functions to satisfy dissipativity and be separated by a positive definite function. This new condition can be immediately related to optimal control by means of value functions and might be easier to verify than strict dissipativity. Furthermore, we prove that two-storage strict dissipativity is sufficient and necessary for asymptotic stability, it is related to strict dissipativity, and also to alternative approaches relying on the so-called cost-to-travel. Finally, we discuss commonly used and new terminal cost designs that guarantee asymptotic stability in the finite-horizon case.
Input Dexterity and Output Negotiation in Feedback-Linearizable Nonlinear Systems
We introduce a task-relative taxonomy of actuator inputs for nonlinear systems within the input-output feedback-linearization framework. Given a flat output specifying the task, inputs are classified as essential, redundant, or dexterity: essential inputs are required for exact linearization, redundant inputs can be removed without effect, and dexterity inputs can be deactivated while preserving exact linearization of a reduced task. We show that a subset is dexterity if and only if, under a suitable dynamic prolongation, it can appear as additional output channels (flat-input complement) on a common validity set. Whenever a family of systems obtained by (de)activating dexterity inputs admits a common prolongation, the family can be interpreted as a single prolonged system endowed with different output selections. This enables a unified linearizing controller that negotiates between full and reduced tasks without transients on shared outputs under compatibility and dwell-time conditions. Simulations on a fully actuated aerial platform illustrate graceful task downgrades from six-dimensional pose tracking as lateral-force channels are deactivated.
Behavioral Generative Agents for Power Dispatch and Auction
This paper presents positive initial evidence that generative agents can relax the rigidity of traditional mathematical models for human decision-making in power dispatch and auction settings. We design two proof-of-concept energy experiments with generative agents powered by a large language model (LLM). First, we construct a home battery management testbed with stochastic electricity prices and blackout interventions, and benchmark LLM decisions against dynamic programming. By incorporating an in-context learning (ICL) module, we show that behavioral patterns discovered by a stronger reasoning model can be transferred to a smaller LLM via example-based prompting, leading agents to prioritize post-blackout energy reserves over short-term profit. Second, we study LLM agents in simultaneous ascending auctions (SAA) for power network access, comparing their behavior with an optimization benchmark, the straightforward bidding strategy. By designing ICL prompts with rule-based, myopic, and strategic objectives, we find that structured prompting combined with ICL enables LLM agents to both reproduce economically rational strategies and exhibit systematic behavioral deviations. Overall, these results suggest that LLM-powered agents provide a flexible and expressive testbed for modeling human decision-making in power system applications.
Multi-Mode Pinching-Antenna Systems: Mode Selection or Mode Combining?
This letter investigates multi-mode pinching antenna systems (PASS), where signals of multiple orthogonal modes can be transmitted within a dielectric waveguide and radiated by pinching antennas (PAs). This enables mode-domain multiplexing for efficient multi-user communications using a single waveguide. In particular, two operating protocols are proposed, namely mode selection and mode combining. Mode selection enforces each PA to predominantly radiate signal power of one single mode, while mode combining allows each PA to flexibly radiate power of multiple modes. Based on the two protocols, a sum rate maximization problem is formulated for multi-mode PASS-enabled multi-user downlink communications, where the transmit beamforming, PA positions, and PA propagation constants are jointly optimized. To address this rapidly oscillating and highly nonconvex problem, a particle swarm optimization (PSO) based Karush-Kuhn-Tucker (KKT)-parameterized beamforming (PSO- KPBF) algorithm is proposed. KKT-conditioned solutions are exploited to guide the swarm search, thus reducing the search space and achieving fast convergence. Numerical results demonstrate that: 1) Even using a simple uniform mode-combining design, the multi-mode PASS significantly outperform conventional single-mode PASS and hybrid beamforming systems; and 2) Mode combining achieves high spectral efficiency, while mode selection approximates its performance with a lower hardware complexity. Code is released at https://github.com/xiaoxiaxusummer/multi_mode_pinching_antenna
comment: Submitted to IEEE. Code is available at https://github.com/xiaoxiaxusummer/multi_mode_pinching_antenna
Integrating Lagrangian Neural Networks into the Dyna Framework for Reinforcement Learning
Model-based reinforcement learning (MBRL) is sample-efficient but depends on the accuracy of the learned dynamics, which are often modeled using black-box methods that do not adhere to physical laws. Those methods tend to produce inaccurate predictions when presented with data that differ from the original training set. In this work, we employ Lagrangian neural networks (LNNs), which enforce an underlying Lagrangian structure to train the model within a Dyna-based MBRL framework. Furthermore, we train the LNN using stochastic gradient-based and state-estimation-based optimizers to learn the network's weights. The state-estimation-based method converges faster than the stochastic gradient-based method during neural network training. Simulation results are provided to illustrate the effectiveness of the proposed LNN-based Dyna framework for MBRL.
comment: 5 pages, 3 figures
Adaptive Entropy-Driven Sensor Selection in a Camera-LiDAR Particle Filter for Single-Vessel Tracking
Robust single-vessel tracking from fixed coastal platforms is hindered by modality-specific degradations: cameras suffer from illumination and visual clutter, while LiDAR performance drops with range and intermittent returns. We present a heterogeneous multi-sensor fusion particle-filter tracker that incorporates an information-gain (entropy-reduction) adaptive sensing policy to select the most informative configuration at each fusion time bin. The approach is validated in a real maritime deployment at the CMMI Smart Marina Testbed (Ayia Napa Marina, Cyprus), using a shore-mounted 3D LiDAR and an elevated fixed camera to track a rigid inflatable boat with onboard GNSS ground truth. We compare LiDAR-only, camera-only, all-sensors, and adaptive configurations. Results show LiDAR dominates near-field accuracy, the camera sustains longer-range coverage when LiDAR becomes unavailable, and the adaptive policy achieves a favorable accuracy-continuity trade-off by switching modalities based on information gain. By avoiding continuous multi-stream processing, the adaptive configuration provides a practical baseline for resilient and resource-aware maritime surveillance.
comment: 8 pages, 5 figures, submitted to FUSION 2026 conference proceedings
IronEngine: Towards General AI Assistant
This paper presents IronEngine, a general AI assistant platform organized around a unified orchestration core that connects a desktop user interface, REST and WebSocket APIs, Python clients, local and cloud model backends, persistent memory, task scheduling, reusable skills, 24-category tool execution, MCP-compatible extensibility, and hardware-facing integration. IronEngine introduces a three-phase pipeline -- Discussion (Planner--Reviewer collaboration), Model Switch (VRAM-aware transition), and Execution (tool-augmented action loop) -- that separates planning quality from execution capability. The system features a hierarchical memory architecture with multi-level consolidation, a vectorized skill repository backed by ChromaDB, an adaptive model management layer supporting 92 model profiles with VRAM-aware context budgeting, and an intelligent tool routing system with 130+ alias normalization and automatic error correction. We present experimental results on file operation benchmarks achieving 100\% task completion with a mean total time of 1541 seconds across four heterogeneous tasks, and provide detailed comparisons with representative AI assistant systems including ChatGPT, Claude Desktop, Cursor, Windsurf, and open-source agent frameworks. Without disclosing proprietary prompts or core algorithms, this paper analyzes the platform's architectural decomposition, subsystem design, experimental performance, safety boundaries, and comparative engineering advantages. The resulting study positions IronEngine as a system-oriented foundation for general-purpose personal assistants, automation frameworks, and future human-centered agent platforms.
comment: Technical Report
Eigenvalue Patterns and Participation Analysis of Symmetric Renewable Energy Power Systems
State-space analysis is widely employed for examining power system dynamics but faces challenges in large-scale power systems integrated with numerous inverter-based resources (IBRs), where the significant increase of system states complicates modal analysis. Notably, renewable energy power systems often consist of multiple homogeneous generation units. This uniformity, termed symmetry in this paper, can facilitate the system stability analysis. Eigenvalue patterns and participation factors in three types of symmetric renewable energy power systems are investigated, including ideally-, quasi-, and group-symmetric systems. An ideally-symmetric (quasi-symmetric) system comprises a group of identical (similar) subsystems connected to an external grid. A system containing multiple such groups is termed group-symmetric. In these symmetric systems, two types of modes are defined to characterize different interactions: inner-group modes, which describe the interactions among subsystems within a single group, and group-grid modes, which describe the interactions between the groups and the external grid. A new concept termed group participation factor is also proposed to extend the use of conventional participation factors for repeated and close modes. In addition, the invariance properties of the inner-group modes and group-grid modes are discussed. The findings provide insights for stability analysis and targeted optimization in power systems. Theoretical advances are validated through numerical results and electromagnetic transient (EMT) simulations on example power systems of varied types and scales.
comment: 17 pages, 15 figures
The coordination between TSO and DSO in the context of energy transition - A review
Nowadays, energy transition is ongoing in many countries, aiming to reduce dependence on fossil fuels and CO2 emissions. Besides the positive impacts on the environment, this transition brings technical challenges to the system operators, such as the intricacies of energy system integration, diminishing uncertainty, and incentivizing customers with advanced transaction models. The coordination between the Transmission system operator (TSO) and the Distribution system operator (DSO) is one of the most important aspects of encountering these obstacles. This coordination enhances the utilization of flexibility from Distributed energy resources (DERs) by incentivizing the market parties with better willingness to pay schemes. This paper provides an overview of the coordination schemes (CS), their classification, assessment of the current situation and the challenges associated with applying these schemes in practical context. The main purpose is to investigate the most effective way for TSO/DSOs to use the flexibility resource to maintain the balancing of the entire system while ensuring no congestion occurs in the network. A broad range of possible coordination schemes along with exploiting flexibility services is presented and the pros and cons are analyzed. Additionally, the study presents a general scenario that describes the interaction between the operators and the third party in providing service to the balancing market, considering cases with and without coordination.
comment: Published in: 2024 59th International Universities Power Engineering Conference (UPEC)
Adaptive Tracking Control of Euler-Lagrange Systems with Time-Varying State and Input Constraints
This paper presents an adaptive control framework for Euler-Lagrange (E-L) systems that enforces user-defined time-varying state and input constraints in the presence of parametric uncertainties and bounded disturbances. The proposed design integrates a time-varying barrier Lyapunov Function (TVBLF) with a saturated control law to guarantee constraint satisfaction without resorting to real-time optimization. A key contribution is the development of an offline, verifiable feasibility condition that certifies the existence of a feasible control policy for any prescribed pair of time-varying state and input envelopes. Additionally, we prove boundedness of all closed-loop signals. Real-time experiments conducted on a 2-DoF helicopter model validate the efficacy and practical viability of the proposed method.
PolyFormer: learning efficient reformulations for scalable optimization under complex physical constraints
Real-world optimization problems are often constrained by complex physical laws that limit computational scalability. These constraints are inherently tied to complex regions, and thus learning models that incorporate physical and geometric knowledge, i.e., physics-informed machine learning (PIML), offer a promising pathway for efficient solution. Here, we introduce PolyFormer, which opens a new direction for PIML in prescriptive optimization tasks, where physical and geometric knowledge is not merely used to regularize learning models, but to simplify the problems themselves. PolyFormer captures geometric structures behind constraints and transforms them into efficient polytopic reformulations, thereby decoupling problem complexity from solution difficulty and enabling off-the-shelf optimization solvers to efficiently produce feasible solutions with acceptable optimality loss. Through evaluations across three important problems (large-scale resource aggregation, network-constrained optimization, and optimization under uncertainty), PolyFormer achieves computational speedups up to 6,400-fold and memory reductions up to 99.87%, while maintaining solution quality competitive with or superior to state-of-the-art methods. These results demonstrate that PolyFormer provides an efficient and reliable solution for scalable constrained optimization, expanding the scope of PIML to prescriptive tasks in scientific discovery and engineering applications.
comment: Code availability: All the data and code are made openly available at https://github.com/wenyl16/PolyFormer
Coupling Europe's Capacity Markets
European Member States are increasingly introducing national capacity mechanisms (CMs) to manage growing adequacy risks. However, isolated national CMs are inefficient in highly interconnected electricity systems, such as the European system. While progress has been made in facilitating cross-border participation by generation capacity in CMs, existing arrangements are prone to under- or over-investment and do not properly value the contribution of interconnection capacity to Member States' adequacy targets. In this paper, we propose a novel conceptual design for a coupled European capacity market that utilises the logic of flow-based market coupling. In a comparative analysis of different market design scenarios in an illustrative multi-zone case study, using a bespoke long-run equilibrium problem, we show that the proposed flow-based coupling of capacity markets reduces system costs by harnessing available capacity in neighbouring market zones while ensuring deliverability with respect to network constraints in all scarcity situations.
Augmented Model Predictive Control: A Balance between Satellite Agility and Computation Complexity
Agile earth observation satellites employ multiple actuators to enable flexible and responsive imaging capabilities. While significant advancements in actuator technology have enhanced satellites' torque and momentum, relatively little attention has been given to control strategies specifically tailored to improve satellite agility. This paper provides a comparative analysis of different Model Predictive Control (MPC) formulations and introduces an augmented-MPC method that effectively balances agility requirements with hardware implementation constraints. The proposed method achieves the high-performance characteristics of nonlinear MPC while preserving the computational simplicity of linear MPC. Numerical simulations and physical experiments are conducted to validate the effectiveness and feasibility of the proposed approach.
comment: European Control Conference 2026
Trajectory Tracking Control Design for Autonomous Helicopters with Guaranteed Error Bounds
This paper presents a systematic framework for computing formally guaranteed trajectory tracking error bounds for autonomous helicopters based on Robust Positive Invariant (RPI) sets. The approach focuses on establishing a closed-loop translational error dynamics which is cast into polytopic linear parameter-varying form with bounded additive and state-dependent disturbances. Ellipsoidal RPI sets are computed, yielding explicit position error bounds suitable as certified buffer zones in upper-level trajectory planning. Three controller architectures are compared with respect to the conservatism of their error bounds and tracking performance. Simulation results on a nonlinear helicopter model demonstrate that all architectures respect the derived bounds, while highlighting trade-offs between dynamical fidelity and conservatism in invariant set computation.
comment: Submitted to the 2026 International Conference on Unmanned Aircraft Systems (ICUAS)
Distributed Coordination Algorithms with Efficient Communication for Open Multi-Agent Systems with Dynamic Communication Links and Processing Delays
In this paper we focus on the distributed quantized average consensus problem in open multi-agent systems consisting of dynamic directed communication links among active nodes. We propose three communication-efficient distributed algorithms designed for different scenarios. Our first algorithm solves the quantized averaging problem over the currently active node set under finite network openness (i.e., when the active set eventually stabilizes). Our second algorithm extends the aforementioned approach for the case where nodes suffer from arbitrary bounded processing delays. Our third algorithm operates over indefinitely open multi-agent networks with dynamic communication links (i.e., with continuous node arrivals and departures), computing the average that incorporates both active and historically active nodes. We analyze our algorithms' operation, establish their correctness, and present novel necessary and sufficient topological conditions ensuring their finite-time convergence. Numerical simulations on distributed sensor fusion for environmental monitoring demonstrate fast finite-time convergence and robustness across varying network sizes, departure/arrival rates, and processing delays. Finally, it is shown that our proposed algorithms compare favorably to algorithms in the existing literature.
Aero-Promptness: Drag-Aware Aerodynamic Manipulability for Propeller-driven Vehicles
This work introduces the Drag-Aware Aerodynamic Manipulability (DAAM), a geometric framework for control allocation in redundant multirotors. By equipping the propeller spin-rate space with a Riemannian metric based on the remaining symmetric acceleration capacity of each motor, the formulation explicitly accounts for motor torque limits and aerodynamic drag. Mapping this metric through the nonlinear thrust law to the generalized force space yields a state-dependent manipulability volume. The log-determinant of this volume acts as a natural barrier function, strictly penalizing drag-induced saturation and low-spin thrust loss. Optimizing this volume along the allocation fibers provides a redundancy resolution strategy inherently invariant to arbitrary coordinate scaling in the generalized-force space. Analytically, we prove that the resulting optimal allocations locally form smooth embedded manifolds, and we geometrically characterize the global jump discontinuities that inevitably arise from physical actuator limits and spin-rate sign transitions.
Model-Free DRL Control for Power Inverters: From Policy Learning to Real-Time Implementation via Knowledge Distillation
In response to the trade-off between control performance and computational burden hindering the deployment of Deep Reinforcement Learning (DRL) in power inverters, this paper presents a novel model-free control framework leveraging policy distillation. To handle the convergence instability and steady-state errors inherent in model-free agents, an error energy-guided hybrid reward mechanism is established to theoretically constrain the exploration space. More specifically, an adaptive importance weighting mechanism is integrated into the distillation architecture to amplify the significance of fluctuation regions, ensuring high-quality transfer of transient control logic by mitigating the observational bias dominated by steady-state data. This approach efficiently compresses the heavy DRL policy into a lightweight neural network, retaining the desired control performance while overcoming the computational bottleneck during deployment. The proposed method is validated through a hardware-based kilowatt-level experimental platform. Experimental comparison results with traditional methods demonstrate that the proposed technique reduces inference time to the microsecond level and achieves superior transient response speed and parameter robustness.
comment: 10 pages, 6 figures, 8 tables, IEEE journal submission. This work proposes a model-free deep reinforcement learning control framework for voltage source inverters, integrating Lyapunov-based reward design and adaptive weighted policy distillation for lightweight real-time implementation, validated by simulation and kilowatt-level hardware experiments
Robust control synthesis for uncertain linear systems with input saturation using mixed IQCs
This paper develops a robust control synthesis method for uncertain linear systems with input saturation in the framework of integral quadratic constraints (IQCs). The system is reformulated as a linear fractional representation (LFR) that captures both dead-zone nonlinearity and time-varying uncertainties. By combining mixed IQC-based dissipation inequalities with quadratic Lyapunov functions, sufficient conditions for robust stabilization are established. Compared with conventional approaches based on a single static sector condition for the dead-zone nonlinearity, the proposed method yields improved $\mathcal{L}_2$-gain performance through the use of scaled mixed IQCs. For systems subject to time-varying structured uncertainties, a new scaled bounded real lemma is further developed based on the IQC characterization. The resulting $\mathcal{H}_\infty$ synthesis conditions are expressed as linear matrix inequalities (LMIs), which are numerically tractable in all decision variables, including the scaling factors in the IQC multipliers. The proposed method is validated using a second-order uncertain system in linear fractional form, and its superiority over an anti-windup design is further illustrated by a cart-pendulum example.
Viewpoint-Agnostic Grasp Pipeline using VLM and Partial Observations
Robust grasping in cluttered, unstructured environments remains challenging for mobile legged manipulators due to occlusions that lead to partial observations, unreliable depth estimates, and the need for collision-free, execution-feasible approaches. In this paper we present an end-to-end pipeline for language-guided grasping that bridges open-vocabulary target selection to safe grasp execution on a real robot. Given a natural-language command, the system grounds the target in RGB using open-vocabulary detection and promptable instance segmentation, extracts an object-centric point cloud from RGB-D, and improves geometric reliability under occlusion via back-projected depth compensation and two-stage point cloud completion. We then generate and collision-filter 6-DoF grasp candidates and select an executable grasp using safety-oriented heuristics that account for reachability, approach feasibility, and clearance. We evaluate the method on a quadruped robot with an arm in two cluttered tabletop scenarios, using paired trials against a view-dependent baseline. The proposed approach achieves a 90% overall success rate (9/10) against 30% (3/10) for the baseline, demonstrating substantially improved robustness to occlusions and partial observations in clutter.
The FABRIC Strategy for Verifying Neural Feedback Systems
Forward reachability analysis is a dominant approach for verifying reach-avoid specifications in neural feedback systems, i.e., dynamical systems controlled by neural networks, and a number of directions have been proposed and studied. In contrast, far less attention has been given to backward reachability analysis for these systems, in part because of the limited scalability of known techniques. In this work, we begin to address this gap by introducing new algorithms for computing both over- and underapproximations of backward reachable sets for nonlinear neural feedback systems. We also describe and implement an integration of these backward reachability techniques with existing ones for forward analysis. We call the resulting algorithm Forward and Backward Reachability Integration for Certification (FaBRIC). We evaluate our algorithms on a representative set of benchmarks and show that they significantly outperform the prior state of the art.
Formation-Aware Adaptive Conformalized Perception for Safe Leader-Follower Multi-Robot Systems
This paper considers the perception safety problem in distributed vision-based leader-follower formations, where each robot uses onboard perception to estimate relative states, track desired setpoints, and keep the leader within its camera field of view (FOV). Safety is challenging due to heteroscedastic perception errors and the coupling between formation maneuvers and visibility constraints. We propose a distributed, formation-aware adaptive conformal prediction method based on Risk-Aware Mondrian CP to produce formation-conditioned uncertainty quantiles. The resulting bounds tighten in high-risk configurations (near FOV limits) and relax in safer regions. We integrate these bounds into a Formation-Aware Conformal CBF-QP with a smooth margin to enforce visibility while maintaining feasibility and tracking performance. Gazebo simulations show improved formation success rates and tracking accuracy over non-adaptive (global) CP baselines that ignore formation-dependent visibility risk, while preserving finite-sample probabilistic safety guarantees. The experimental videos are available on the \href{https://nail-uh.github.io/iros2026.github.io/}{project website}\footnote{Project Website: https://nail-uh.github.io/iros2026.github.io/}.
comment: 8 pages, 8 figures
Optimizing Reinforcement Learning Training over Digital Twin Enabled Multi-fidelity Networks
In this paper, we investigate a novel digital network twin (DNT) assisted deep learning (DL) model training framework. In particular, we consider a physical network where a base station (BS) uses several antennas to serve multiple mobile users, and a DNT that is a virtual representation of the physical network. The BS must adjust its antenna tilt angles to optimize the data rates of all users. Due to user mobility, the BS may not be able to accurately track network dynamics such as wireless channels and user mobilities. Hence, a reinforcement learning (RL) approach is used to dynamically adjust the antenna tilt angles. To train the RL, we can use data collected from the physical network and the DNT. The data collected from the physical network is more accurate but incurs more communication overhead compared to the data collected from the DNT. Therefore, it is necessary to determine the ratio of data collected from the physical network and the DNT to improve the training of the RL model. We formulate this problem as an optimization problem whose goal is to jointly optimize the tilt angle adjustment policy and the data collection strategy, aiming to maximize the data rates of all users while constraining the time delay introduced by collecting data from the physical network. To solve this problem, we propose a hierarchical RL framework that integrates robust adversarial loss and proximal policy optimization (PPO). Simulation results show that our proposed method reduces the physical network data collection delay by up to 28.01% and 1x compared to a hierarchical RL that uses vanilla PPO as the first level RL, and the baseline that uses robust-RL at the first level and selects the data collection ratio randomly.
Feedback Does Not Increase the Capacity of Approximately Memoryless Surjective POST Channels
We study a class of finite-state channels, known as POST channels, in which the previous channel output serves as the current state. A POST channel is deemed approximately memoryless when the state-dependent transition matrices are sufficiently close to one another. For this family of channels, under a surjectivity condition on the associated memoryless reference channel, we show that the feedback capacity coincides with the non-feedback capacity. Consequently, for almost all approximately memoryless POST channels whose input alphabet size is no smaller than the output alphabet size, feedback provides no capacity gain. This result extends Shannon's classical theorem on discrete memoryless channels and demonstrates that the phenomenon holds well beyond the strictly memoryless case.
SEP-NMPC: Safety Enhanced Passivity-Based Nonlinear Model Predictive Control for a UAV Slung Payload System ICRA 2026
Model Predictive Control (MPC) is widely adopted for agile multirotor vehicles, yet achieving both stability and obstacle-free flight is particularly challenging when a payload is suspended beneath the airframe. This paper introduces a Safety Enhanced Passivity-Based Nonlinear MPC (SEP-NMPC) that provides formal guarantees of stability and safety for a quadrotor transporting a slung payload through cluttered environments. Stability is enforced by embedding a strict passivity inequality, which is derived from a shaped energy storage function with adaptive damping, directly into the NMPC. This formulation dissipates excess energy and ensures asymptotic convergence despite payload swings. Safety is guaranteed through high-order control barrier functions (HOCBFs) that render user-defined clearance sets forward-invariant, obliging both the quadrotor and the swinging payload to maintain separation while interacting with static and dynamic obstacles. The optimization remains quadratic-program compatible and is solved online at each sampling time without gain scheduling or heuristic switching. Extensive simulations and real-world experiments confirm stable payload transport, collision-free trajectories, and real-time feasibility across all tested scenarios. The SEP-NMPC framework therefore unifies passivity-based closed-loop stability with HOCBF-based safety guarantees for UAV slung-payload transportation.
comment: Accepted at ICRA 2026
Predictive Control with Indirect Adaptive Laws for Payload Transportation by Quadrupedal Robots
This paper formally develops a novel hierarchical planning and control framework for robust payload transportation by quadrupedal robots, integrating a model predictive control (MPC) algorithm with a gradient-descent-based adaptive updating law. At the framework's high level, an indirect adaptive law estimates the unknown parameters of the reduced-order (template) locomotion model under varying payloads. These estimated parameters feed into an MPC algorithm for real-time trajectory planning, incorporating a convex stability criterion within the MPC constraints to ensure the stability of the template model's estimation error. The optimal reduced-order trajectories generated by the high-level adaptive MPC (AMPC) are then passed to a low-level nonlinear whole-body controller (WBC) for tracking. Extensive numerical investigations validate the framework's capabilities, showcasing the robot's proficiency in transporting unmodeled, unknown static payloads up to 109% in experiments on flat terrains and 91% on rough experimental terrains. The robot also successfully manages dynamic payloads with 73% of its mass on rough terrains. Performance comparisons with a normal MPC and an L1 MPC indicate a significant improvement. Furthermore, comprehensive hardware experiments conducted in indoor and outdoor environments confirm the method's efficacy on rough terrains despite uncertainties such as payload variations, push disturbances, and obstacles.
comment: 8 pages, 6 figures. Published in IEEE Robotics and Automation Letters
Neural Network Tuning of FSMPC for Drives
This preprint presents a neural network tuner for the finite state model predictive control of an induction motor. The tuner deals with the parameters of the controllers in the speed loop and in the stator current loop. The results are assessed using a five phase machine in an experimental setup. Data for the neural network training is obtained from the experiments using step tests.
Minimax Linear Regulator Problems for Positive Systems
Explicit solutions to optimal control problems are rarely obtainable. Of particular interest are the explicit solutions derived for minimax problems, providing a framework to address adversarial conditions and uncertainty. This work considers a multi-disturbance minimax Linear Regulator (LR) framework for positive linear time-invariant systems in continuous time, which, analogous to the Linear-Quadratic Regulator (LQR) problem, can be utilized for the stabilization of positive systems. The problem is studied for nonnegative and state-bounded disturbances. Dynamic programming theory is leveraged to derive explicit solutions to the minimax LR problem for both finite and infinite time horizons. In addition, a fixed-point method is proposed that computes the solution for the infinite horizon case, and the minimum L1-induced gain of the system is studied. We motivate the prospective scalability properties of our framework with a large-scale water management network.
comment: 30 pages, 6 figures. Accepted for publication in IEEE Transactions on Automatic Control
Integrating a Causal Foundation Model into a Prescriptive Maintenance Framework for Optimising Production-Line OEE
The transition to prescriptive maintenance (PsM) in manufacturing is critically constrained by a dependence on predictive models. Such purely predictive models tend to capture statistical associations in the data without identifying the underlying causal drivers of failure, which can lead to costly misdiagnoses and ineffective measures. This fundamental limitation results in a key challenge: while we can predict that a failure may occur, we lack a systematic method to understand why a failure occurs. This paper proposes a model based on causal machine learning to bridge this gap. Our objective is to move beyond diagnosis to active prescription by simulating and evaluating potential fixes to optimise KPIs such as Overall Equipment Effectiveness (OEE). For this purpose, a pre-trained causal foundation model is used as a ``what-if'' simulator to estimate the effects of potential fixes. By estimating the causal effect of each intervention on system-level KPIs, specific actions can be recommended for the production line. This can help identify plausible root causes and quantify their operational impact. The model is evaluated using semi-synthetic manufacturing data and compared with non-causal and causal baseline machine learning models. This paper provides a technical basis for a human-centred approach, allowing engineers to test potential solutions in a causal environment to make more effective operational decisions and reduce costly downtimes.
comment: 9 pages, 3 images, 1 table, conference paper
The Phantom of Davis-Wielandt Shell: A Unified Framework for Graphical Stability Analysis of MIMO LTI Systems
This paper presents a unified framework based on Davis-Wielandt (DW) shell for graphical stability analysis of multi-input and multi-output linear time-invariant feedback systems. Connections between DW shells and various graphical representations, as well as gain and phase measures, are established through an intuitive geometric perspective. Within this framework, we map the relationships and relative conservatism among various separation conditions. A rotated scaled relative graph ($θ$-SRG) concept is proposed as a mixed gain-phase representation, from which a closed-loop stability criterion is derived and shown to be the least conservative among the existing 2-D graphical conditions for bi-component feedback loops. We also propose a reliable and generalizable algorithm for visualizing the $θ$-SRGs and include a system example to demonstrate the reduced conservatism of the proposed condition.
comment: 16 pages, 12 figures. This version corrects some typos that may lead to confusion
Input-to-State Stable Coupled Oscillator Networks for Closed-form Model-based Control in Latent Space NeurIPS 2024
Even though a variety of methods have been proposed in the literature, efficient and effective latent-space control (i.e., control in a learned low-dimensional space) of physical systems remains an open challenge. We argue that a promising avenue is to leverage powerful and well-understood closed-form strategies from control theory literature in combination with learned dynamics, such as potential-energy shaping. We identify three fundamental shortcomings in existing latent-space models that have so far prevented this powerful combination: (i) they lack the mathematical structure of a physical system, (ii) they do not inherently conserve the stability properties of the real systems, (iii) these methods do not have an invertible mapping between input and latent-space forcing. This work proposes a novel Coupled Oscillator Network (CON) model that simultaneously tackles all these issues. More specifically, (i) we show analytically that CON is a Lagrangian system - i.e., it possesses well-defined potential and kinetic energy terms. Then, (ii) we provide formal proof of global Input-to-State stability using Lyapunov arguments. Moving to the experimental side, we demonstrate that CON reaches SoA performance when learning complex nonlinear dynamics of mechanical systems directly from images. An additional methodological innovation contributing to achieving this third goal is an approximated closed-form solution for efficient integration of network dynamics, which eases efficient training. We tackle (iii) by approximating the forcing-to-input mapping with a decoder that is trained to reconstruct the input based on the encoded latent space force. Finally, we show how these properties enable latent-space control. We use an integral-saturated PID with potential force compensation and demonstrate high-quality performance on a soft robot using raw pixels as the only feedback information.
comment: 38th Conference on Neural Information Processing Systems (NeurIPS 2024) spotlight, 50 pages
A Deep Learning-Based Method for Power System Resilience Evaluation
Power system resilience is vital to modern society, as outages caused by extreme weather can severely disrupt communities. Existing statistical and simulation-based methods for resilience quantification are either retrospective or rely on simplified physical models, limiting their applicability. This paper proposes a deep learning-based framework that integrates historical outage and weather data to predict event-level resilience, measured using the resilience trapezoid method. The trained model is then applied to a benchmark weather dataset to estimate regional resilience, with optional socioeconomic and demographic factors incorporated as weighting terms when policymakers wish to emphasize the needs of specific population groups. The effectiveness of the framework is first validated on simulated outage records, showing strong agreement between predicted and simulated resilience values. It is then applied to real historical outage data to assess the resilience of actual power systems. Beyond evaluation, the results can guide targeted investments in distributed energy resources to improve resilience in vulnerable regions.
comment: Submitted to Applied Energy Oct 31, 2025
An LLM-Assisted Multi-Agent Control Framework for Roll-to-Roll Manufacturing Systems
Roll-to-roll manufacturing requires precise tension and velocity control to ensure product quality, yet controller commissioning and adaptation remain time-intensive processes dependent on expert knowledge. This paper presents an LLM-assisted multi-agent framework that automates control system design and adaptation for R2R systems while maintaining safety. The framework operates through five phases: system identification from operational data, automated controller selection and tuning, sim-to-real adaptation with safety verification, continuous monitoring with diagnostic capabilities, and periodic model refinement. Experimental validation on a R2R system demonstrates successful tension regulation and velocity tracking under significant model uncertainty, with the framework achieving performance convergence through iterative adaptation. The approach reduces manual tuning effort while providing transparent diagnostic information for maintenance planning, offering a practical pathway for integrating AI-assisted automation in manufacturing control systems.
Beyond Collision Cones: Dynamic Obstacle Avoidance for Nonholonomic Robots via Dynamic Parabolic Control Barrier Functions ICRA
Control Barrier Functions (CBFs) are a powerful tool for ensuring the safety of autonomous systems, yet applying them to nonholonomic robots in cluttered, dynamic environments remains an open challenge. State-of-the-art methods often rely on collision-cone or velocity-obstacle constraints which, by only considering the angle of the relative velocity, are inherently conservative and can render the CBF-based quadratic program infeasible, particularly in dense scenarios. To address this issue, we propose a Dynamic Parabolic Control Barrier Function (DPCBF) that defines the safe set using a parabolic boundary. The parabola's vertex and curvature dynamically adapt based on both the distance to an obstacle and the magnitude of the relative velocity, creating a less restrictive safety constraint. We prove that the proposed DPCBF is valid for a kinematic bicycle model subject to input constraints. Extensive comparative simulations demonstrate that our DPCBF-based controller significantly enhances navigation success rates and QP feasibility compared to baseline methods. Our approach successfully navigates through dense environments with up to 100 dynamic obstacles, scenarios where collision cone-based methods fail due to infeasibility.
comment: The first two authors contributed equally to this work. 2026 IEEE International Conference on Robotics and Automation (ICRA). Project page: https://www.taekyung.me/dpcbf
Utility Theory based Cognitive Modeling in the Application of Robotics: A Survey
Cognitive modeling, which explores the essence of cognition, including motivation, emotion, and perception, has been widely applied in the artificial intelligence (AI) agent domains, such as robotics. From the computational perspective, various cognitive functionalities have been developed through utility theory to provide a detailed and process-based understanding for specifying corresponding computational models of representations, mechanisms, and processes. Especially for decision-making and learning in multi-agent/robot systems (MAS/MRS), a suitable cognitive model can guide agents in choosing reasonable strategies to achieve their current needs and learning to cooperate and organize their behaviors, optimizing the system's utility, building stable and reliable relationships, and guaranteeing each group member's sustainable development, similar to the human society. This survey examines existing robotic systems for developmental cognitive models in the context of utility theory. We discuss the evolution of cognitive modeling in robotics from behavior-based robotics (BBR) and cognitive architectures to the properties of value systems in robots, such as the studies on motivations as artificial value systems, and the utility theory based cognitive modeling for generating and updating strategies in robotic interactions. Then, we examine the extent to which existing value systems support the application of robotics from an AI agent cognitive modeling perspective, including single-agent and multi-agent systems, trust among agents, and human-robot interaction. Finally, we survey the existing literature of current value systems in relevant fields and propose several promising research directions, along with some open problems that we deem necessary for further investigation.
Event-Driven Safe and Resilient Control of Automated and Human-Driven Vehicles under EU-FDI Attacks
This paper studies the safe and resilient control of Connected and Automated Vehicles (CAVs) operating in mixed traffic environments where they must interact with Human-Driven Vehicles (HDVs) under uncertain dynamics and exponentially unbounded false data injection (EU-FDI) attacks. These attacks pose serious threats to safety-critical applications. While resilient control strategies can mitigate adversarial effects, they often overlook collision avoidance requirements. Conversely, safety-critical approaches tend to assume nominal operating conditions and lack resilience to adversarial inputs. To address these challenges, we propose an event-driven safe and resilient (EDSR) control framework that integrates event-driven Control Barrier Functions (CBFs) and Control Lyapunov Functions (CLFs) with adaptive attack-resilient control. The framework further incorporates data-driven estimation of HDV behaviors to ensure safety and resilience against EU-FDI attacks. Specifically, we focus on the lane-changing maneuver of CAVs in the presence of unpredictable HDVs and EU-FDI attacks on acceleration inputs. The event-driven approach reduces computational load while maintaining real-time safety guarantees. Simulation results, including comparisons with conventional safety-critical control methods that lack resilience, validate the effectiveness and robustness of the proposed EDSR framework in achieving collision-free maneuvers, stable velocity regulation, and resilient operation under adversarial conditions.
Opponent State Inference Under Partial Observability: An HMM-POMDP Framework for 2026 Formula 1 Energy Strategy
The 2026 Formula 1 technical regulations introduce a fundamental change to energy strategy: under a 50/50 internal combustion engine / battery power split with unlimited regeneration and a driver-controlled Override Mode (abbreviated MOM throughout), the optimal energy deployment policy depends not only on a driver's own state but on the hidden state of rival cars. This creates a Partially Observable Stochastic Game that cannot be solved by single-agent optimisation methods. We present a tractable two-layer inference and decision framework. The first layer is a 30-state Hidden Markov Model (HMM) that infers a probability distribution over each rival's ERS charge level, Override Mode status, and tyre degradation state from five publicly observable telemetry signals. The second layer is a Deep Q-Network (DQN) policy that takes the HMM belief state as input and selects between energy deployment strategies. We formally characterise the counter-harvest trap -- a deceptive strategy in which a car deliberately suppresses observable deployment signals to induce a rival into a failed attack -- and show that detecting it requires belief-state inference rather than reactive threshold rules. On synthetic races generated from the model's own assumptions, the HMM achieves 92.3% ERS inference accuracy (random baseline: 33.3%) and detects counter-harvest trap conditions with 95.7% recall. Pre-registration -- empirical validation begins Australian Grand Prix, 8 March 2026.
comment: 17 pages. Pre-registered theoretical framework; empirical calibration on 2026 race telemetry begins Australian Grand Prix, 8 March 2026. Paper 1 of 3. ResearchGate preprint: DOI 10.13140/RG.2.2.16034.08644
Verifying Nonlinear Neural Feedback Systems using Polyhedral Enclosures
As dynamical systems equipped with neural network controllers (neural feedback systems) become increasingly prevalent, it is critical to develop methods to ensure their safe operation. Verifying safety requires extending control theoretic analysis methods to these systems. Although existing techniques can efficiently handle linear neural feedback systems, relatively few scalable methods address the nonlinear case. We propose a novel algorithm for forward reachability analysis of nonlinear neural feedback systems. The approach leverages the structure of the nonlinear transition functions of the systems to compute tight polyhedral enclosures (i.e., abstractions). These enclosures, combined with the neural controller, are then encoded as a mixed-integer linear program (MILP). Optimizing this MILP yields a sound over-approximation of the forward-reachable set. We evaluate our algorithm on representative benchmarks and demonstrate an order of magnitude improvement over the current state of the art.
CONQURE: A Co-Execution Environment for Quantum and Classical Resources
Cutting edge classical computing today relies on a combination of CPU-based computing with a strong reliance on accelerators. In particular, high-performance computing (HPC) and machine learning (ML) rely heavily on acceleration via GPUs for numerical kernels. In the future, acceleration via quantum devices may complement GPUs for kernels where algorithms provide quantum advantage, i.e., significant speedups over classical algorithms. Computing with quantum kernels mapped onto quantum processing units (QPUs) requires seamless integration into HPC and ML. However, quantum offloading onto HPC/cloud lacks open-source software infrastructure. For classical algorithms, parallelization standards, such as OpenMP, MPI, or CUDA exist. In contrast, a lack of quantum abstractions currently limits the adoption of quantum acceleration in practical applications creating a gap between quantum algorithm development and practical HPC integration. Such integration needs to extend to efficient quantum offloading of kernels, which further requires scheduling of quantum resources, control of QPU kernel execution, tracking of QPU results, providing results to classical calling contexts and coordination with HPC scheduling. This work proposes CONQURE, a co-execution environment for quantum and classical resources. CONQURE is a fully open-source cloud queue framework that presents a novel modular scheduling framework allowing users to offload OpenMP quantum kernels to QPUs as quantum circuits, to relay results back to calling contexts in classical computing, and to schedule quantum resources via our CONQURE API. We show our API has a low overhead averaging 12.7ms in our tests, and we demonstrate functionality on an ion-trap device. Our OpenMP extension enables the parallelization of VQE runs with a 3.1X reduction in runtime.
Provable Acceleration of Distributed Optimization with Local Updates
In conventional distributed optimization, each agent performs a single local update between two communication rounds with its neighbors to synchronize solutions. Inspired by the success of using multiple local updates in federated learning, incorporating local updates into distributed optimization has recently attracted increasing attention. However, unlike federated learning, where multiple local updates can accelerate learning by improving gradient estimation under mini-batch settings, it remains unclear whether similar benefits hold in distributed optimization when gradients are exact. Moreover, existing theoretical results typically require reducing the step size when multiple local updates are employed, which can entirely offset any potential benefit of these additional local updates and obscure their true impact on convergence. In this paper, we focus on the classic DIGing algorithm and leverage the tight performance bounds provided by Performance Estimation Problems (PEP) to show that incorporating local updates can indeed accelerate distributed optimization. To the best of our knowledge, this is the first rigorous demonstration of such acceleration for a broad class of objective functions. Our analysis further reveals that, under an appropriate step size, performing only two local updates is sufficient to achieve the maximal possible improvement, and that additional local updates provide no further gains. Because more updates increase computational cost, these findings offer practical guidance for efficient implementation. Extensive experiments on both synthetic and real-world datasets corroborate the theoretical findings.
A Novel Adaptive Formation Control Strategy for Teams of Unmanned Vehicles Under Complete Dynamic Uncertainty
Modern unmanned systems, including aerial, terrestrial, and underwater vehicles, are increasingly utilized in dynamic and unpredictable environments, where the presence of modeling uncertainties necessitates the development of robust and adaptive control strategies. In this work, we address the formation control problem for a team of unmanned systems with completely uncertain dynamics under a virtual leader-following framework. We propose a novel cooperative adaptive formation control algorithm, designed using artificial neural networks to achieve accurate formation tracking. The effectiveness of the proposed control strategy is established through rigorous theoretical analysis, which guarantees uniform ultimate boundedness of the overall system and exponential convergence of the tracking errors to a small neighborhood of zero. Numerical simulations further validate the effectiveness of the proposed formation control algorithm, demonstrating that the followers accurately track the desired formation trajectory relative to the leader, even in the presence of complete system uncertainties. This work suggests potential application in coordinating multiple unmanned airships for tasks such as persistent aerial surveillance, atmospheric data collection, and wide-area communication support, where adaptability to time-varying and uncertain dynamics is essential.
comment: 8 pages, 6 figures, Conference
Personalized Collaborative Learning with Affinity-Based Variance Reduction ICLR 2026
Multi-agent learning faces a fundamental tension: leveraging distributed collaboration without sacrificing the personalization needed for diverse agents. This tension intensifies when aiming for full personalization while adapting to unknown heterogeneity levels -- gaining collaborative speedup when agents are similar, without performance degradation when they are different. Embracing the challenge, we propose personalized collaborative learning (PCL), a novel framework for heterogeneous agents to collaboratively learn personalized solutions with seamless adaptivity. Through carefully designed bias correction and importance correction mechanisms, our method AffPCL robustly handles both environment and objective heterogeneity. We prove that AffPCL reduces sample complexity over independent learning by a factor of $\max\{n^{-1}, δ\}$, where $n$ is the number of agents and $δ\in[0,1]$ measures their heterogeneity. This affinity-based acceleration automatically interpolates between the linear speedup of federated learning in homogeneous settings and the baseline of independent learning, without requiring prior knowledge of the system. Our analysis further reveals that an agent may obtain linear speedup even by collaborating with arbitrarily dissimilar agents, unveiling new insights into personalization and collaboration in the high heterogeneity regime.
comment: Published as a conference paper at ICLR 2026
Prognostics for Autonomous Deep-Space Habitat Health Management under Multiple Unknown Failure Modes
Deep-space habitats (DSHs) are safety-critical systems that must operate autonomously for long periods, often beyond the reach of ground-based maintenance or expert intervention. Monitoring health and anticipating failures are essential for safe operations. Prognostics based on remaining useful life (RUL) prediction support this goal by estimating how long a subsystem can operate before failure. Critical DSH subsystems, including environmental control and life support, power generation, and thermal control, are monitored by many sensors and can degrade through multiple failure modes. In practice, these failure modes are often unknown, and the sensors providing useful information may vary across modes, making accurate RUL prediction challenging when failure data are unlabeled. We propose an unsupervised prognostics framework for RUL prediction that jointly identifies latent failure modes and selects informative sensors using unlabeled run-to-failure data. The framework has two phases: offline sensor selection and failure mode identification, and online diagnosis and RUL prediction. In the offline phase, failure times are modeled using a mixture of Gaussian regressions, and an Expectation-Maximization algorithm simultaneously clusters degradation trajectories and selects mode-specific sensors. In the online phase, low-dimensional features from selected sensors diagnose the active failure mode and predict RUL through a weighted functional regression model. The framework is evaluated on a simulated dataset capturing key telemetry challenges in DSH systems and on the NASA C-MAPSS benchmark. Results show improved prediction accuracy and clearer identification of informative sensors and failure modes than existing methods.
comment: Manuscript under review
Distributed Model Predictive Control for Dynamic Cooperation of Multi-Agent Systems
We propose a distributed model predictive control (MPC) framework for coordinating heterogeneous, nonlinear multi-agent systems under individual and coupling constraints. The cooperative task is encoded as a shared objective function minimized collectively by the agents. Each agent optimizes an artificial reference as an intermediate step towards the cooperative objective, along with a control input to track it. We establish recursive feasibility, asymptotic stability, and transient performance bounds under suitable assumptions. The solution to the cooperative task is not predetermined but emerges from the optimized interactions of the agents. We demonstrate the framework on numerical examples inspired by satellite constellation control, collision-free narrow-passage traversal, and coordinated quadrotor flight.
IMAS$^2$: Joint Agent Selection and Information-Theoretic Coordinated Perception In Dec-POMDPs
We study the problem of jointly selecting sensing agents and synthesizing decentralized active perception policies for the chosen subset of agents within a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) framework. Our approach employs a two-layer optimization structure. In the inner layer, we introduce information-theoretic metrics, defined by the mutual information between the unknown trajectories or some hidden property in the environment and the collective partial observations in the multi-agent system, as a unified objective for active perception problems. We employ various optimization methods to obtain optimal sensor policies that maximize mutual information for distinct active perception tasks. In the outer layer, we prove that under certain conditions, the information-theoretic objectives are monotone and submodular with respect to the subset of observations collected from multiple agents. We then exploit this property to design an IMAS$^2$ (Information-theoretic Multi-Agent Selection and Sensing) algorithm for joint sensing agent selection and sensing policy synthesis. However, since the policy search space is infinite, we adapt the classical Nemhauser-Wolsey argument to prove that the proposed IMAS$^2$ algorithm can provide a tight $(1 - 1/e)$-guarantee on the performance. Finally, we demonstrate the effectiveness of our approach in a multi-agent cooperative perception in a grid-world environment.
Robotics
Perceptive Variable-Timing Footstep Planning for Humanoid Locomotion on Disconnected Footholds
Many real-world walking scenarios contain obstacles and unsafe ground patches (e.g., slippery or cluttered areas), leaving a disconnected set of admissible footholds that can be modeled as stepping-stone-like regions. We propose an onboard, perceptive mixed-integer model predictive control framework that jointly plans foot placement and step duration using step-to-step Divergent Component of Motion (DCM) dynamics. Ego-centric depth images are fused into a probabilistic local heightmap, from which we extract a union of convex steppable regions. Region membership is enforced with binary variables in a mixed-integer quadratic program (MIQP). To keep the optimization tractable while certifying safety, we embed capturability bounds in the DCM space: a lateral one-step condition (preventing leg crossing) and a sagittal infinite-step bound that limits unstable growth. We further re-plan within the step by back-propagating the measured instantaneous DCM to update the initial DCM, improving robustness to model mismatch and external disturbances. We evaluate the approach in simulation on Digit on randomized stepping-stone fields, including external pushes. The planner generates terrain-aware, dynamically consistent footstep sequences with adaptive timing and millisecond-level solve times.
comment: 8 pages, 5 figures, 1 table, 3 algorithms. Supplemental video at: https://youtu.be/5EeuBnSb66s
Underwater Embodied Intelligence for Autonomous Robots: A Constraint-Coupled Perspective on Planning, Control, and Deployment
Autonomous underwater robots are increasingly deployed for environmental monitoring, infrastructure inspection, subsea resource exploration, and long-horizon exploration. Yet, despite rapid advances in learning-based planning and control, reliable autonomy in real ocean environments remains fundamentally constrained by tightly coupled physical limits. Hydrodynamic uncertainty, partial observability, bandwidth-limited communication, and energy scarcity are not independent challenges; they interact within the closed perception-planning-control loop and often amplify one another over time. This Review develops a constraint-coupled perspective on underwater embodied intelligence, arguing that planning and control must be understood within tightly coupled sensing, communication, coordination, and resource constraints in real ocean environments. We synthesize recent progress in reinforcement learning, belief-aware planning, hybrid control, multi-robot coordination, and foundation-model integration through this embodied perspective. Across representative application domains, we show how environmental monitoring, inspection, exploration, and cooperative missions expose distinct stress profiles of cross-layer coupling. To unify these observations, we introduce a cross-layer failure taxonomy spanning epistemic, dynamic, and coordination breakdowns, and analyze how errors cascade across autonomy layers under uncertainty. Building on this structure, we outline research directions toward physics-grounded world models, certifiable learning-enabled control, communication-aware coordination, and deployment-aware system design. By internalizing constraint coupling rather than treating it as an external disturbance, underwater embodied intelligence may evolve from performance-driven adaptation toward resilient, scalable, and verifiable autonomy under real ocean conditions.
comment: This article is currently under review
Relating Reinforcement Learning to Dynamic Programming-Based Planning
This paper bridges some of the gap between optimal planning and reinforcement learning (RL), both of which share roots in dynamic programming applied to sequential decision making or optimal control. Whereas planning typically favors deterministic models, goal termination, and cost minimization, RL tends to favor stochastic models, infinite-horizon discounting, and reward maximization in addition to learning-related parameters such as the learning rate and greediness factor. A derandomized version of RL is developed, analyzed, and implemented to yield performance comparisons with value iteration and Dijkstra's algorithm using simple planning models. Next, mathematical analysis shows: 1) conditions under which cost minimization and reward maximization are equivalent, 2) conditions for equivalence of single-shot goal termination and infinite-horizon episodic learning, and 3) conditions under which discounting causes goal achievement to fail. The paper then advocates for defining and optimizing truecost, rather than inserting arbitrary parameters to guide operations. Performance studies are then extended to the stochastic case, using planning-oriented criteria and comparing value iteration to RL with learning rates and greediness factors.
comment: 43 pages, 8 figures
Physics-infused Learning for Aerial Manipulator in Winds and Near-Wall Environments
Aerial manipulation (AM) expands UAV capabilities beyond passive observation to contact-based operations at high altitudes and in otherwise inaccessible environments. Although recent advances show promise, most AM systems are developed in controlled settings that overlook key aerodynamic effects. Simplified thrust models are often insufficient to capture the nonlinear wind disturbances and proximity-induced flow variations present in real-world environments near infrastructure, while high-fidelity CFD methods remain impractical for real-time use. Learning-based models are computationally efficient at inference, but often struggle to generalize to unseen condition. This paper combines both approaches by integrating a physics-based blade-element model with a learning-based residual force estimator, along with a rotor-speed allocation strategy for disturbance compensation, resulting in a unified control framework. The blade-element model computes per-rotor aerodynamic forces under wind and provides a refined feedforward disturbance estimate. A learning-based estimator then predicts the residual forces not captured by the model, enabling compensation for unmodeled aerodynamic effects. An online adaptation mechanism further updates the residual-force prediction and rotor-speed allocation jointly to reduce the mismatch between desired and realized thrust. We evaluate this framework in both free-flight and wall-contact tracking tasks in a simulated near-wall wind environment. Results demonstrate improved disturbance estimation and trajectory-tracking accuracy over conventional approaches, enabling robust wall-contact execution under challenging aerodynamic conditions.
Reasoning Knowledge-Gap in Drone Planning via LLM-based Active Elicitation
Human-AI joint planning in Unmanned Aerial Vehicles (UAVs) typically relies on control handover when facing environmental uncertainties, which is often inefficient and cognitively demanding for non-expert operators. To address this, we propose a novel framework that shifts the collaboration paradigm from control takeover to active information elicitation. We introduce the Minimal Information Neuro-Symbolic Tree (MINT), a reasoning mechanism that explicitly structures knowledge gaps regarding obstacles and goals into a queryable format. By leveraging large language models, our system formulates optimal binary queries to resolve specific ambiguities with minimal human interaction. We demonstrate the efficacy of this approach through a comprehensive workflow integrating a vision-language model for perception, voice interfaces, and a low-level UAV control module in both high-fidelity NVIDIA Isaac simulations and real-world deployments. Experimental results show that our method achieves a significant improvement in the success rate for complex search-and-rescue tasks while significantly reducing the frequency of human interaction compared to exhaustive querying baselines.
Uncertainty Mitigation and Intent Inference: A Dual-Mode Human-Machine Joint Planning System
Effective human-robot collaboration in open-world environments requires joint planning under uncertain conditions. However, existing approaches often treat humans as passive supervisors, preventing autonomous agents from becoming human-like teammates that can actively model teammate behaviors, reason about knowledge gaps, query, and elicit responses through communication to resolve uncertainties. To address these limitations, we propose a unified human-robot joint planning system designed to tackle dual sources of uncertainty: task-relevant knowledge gaps and latent human intent. Our system operates in two complementary modes. First, an uncertainty-mitigation joint planning module enables two-way conversations to resolve semantic ambiguity and object uncertainty. It utilizes an LLM-assisted active elicitation mechanism and a hypothesis-augmented A^* search, subsequently computing an optimal querying policy via dynamic programming to minimize interaction and verification costs. Second, a real-time intent-aware collaboration module maintains a probabilistic belief over the human's latent task intent via spatial and directional cues, enabling dynamic, coordination-aware task selection for agents without explicit communication. We validate the proposed system in both Gazebo simulations and real-world UAV deployments integrated with a Vision-Language Model (VLM)-based 3D semantic perception pipeline. Experimental results demonstrate that the system significantly cuts the interaction cost by 51.9% in uncertainty-mitigation planning and reduces the task execution time by 25.4% in intent-aware cooperation compared to the baselines.
Preference-Conditioned Reinforcement Learning for Space-Time Efficient Online 3D Bin Packing
Robotic bin packing is widely deployed in warehouse automation, with current systems achieving robust performance through heuristic and learning-based strategies. These systems must balance compact placement with rapid execution, where selecting alternative items or reorienting them can improve space utilization but introduce additional time. We propose a selection-based formulation that explicitly reasons over this trade-off: at each step, the robot evaluates multiple candidate actions, weighing expected packing benefit against estimated operational time. This enables time-aware strategies that selectively accept increased operational time when it yields meaningful spatial improvements. Our method, STEP (Space-Time Efficient Packing), uses a preference-conditioned, Transformer-based reinforcement learning policy, and allows generalization across candidate set sizes and integration with standard placement modules. It achieves a 44% reduction in operational time without compromising packing density. Additional material is available at https://step-packing.github.io.
comment: 8 pages, 5 figures. Accepted to IEEE International Conference on Robotics and Automation 2026. Project Website: https://step-packing.github.io
MWM: Mobile World Models for Action-Conditioned Consistent Prediction
World models enable planning in imagined future predicted space, offering a promising framework for embodied navigation. However, existing navigation world models often lack action-conditioned consistency, so visually plausible predictions can still drift under multi-step rollout and degrade planning. Moreover, efficient deployment requires few-step diffusion inference, but existing distillation methods do not explicitly preserve rollout consistency, creating a training-inference mismatch. To address these challenges, we propose MWM, a mobile world model for planning-based image-goal navigation. Specifically, we introduce a two-stage training framework that combines structure pretraining with Action-Conditioned Consistency (ACC) post-training to improve action-conditioned rollout consistency. We further introduce Inference-Consistent State Distillation (ICSD) for few-step diffusion distillation with improved rollout consistency. Our experiments on benchmark and real-world tasks demonstrate consistent gains in visual fidelity, trajectory accuracy, planning success, and inference efficiency. Code: https://github.com/AIGeeksGroup/MWM. Website: https://aigeeksgroup.github.io/MWM.
Toward Global Intent Inference for Human Motion by Inverse Reinforcement Learning
This paper investigates whether a single, unified cost function can explain and predict human reaching movements, in contrast with existing approaches that rely on subject- or posture-specific optimization criteria. Using the Minimal Observation Inverse Reinforcement Learning (MO-IRL) algorithm, together with a seven-dimensional set of candidate cost terms, we efficiently estimate time-varying cost weights for a standard planar reaching task. MO-IRL provides orders-of-magnitude faster convergence than bilevel formulations, while using only a fraction of the available data, enabling the practical exploration of time-varying cost structures. Three levels of generality are evaluated: Subject-Dependent Posture-Dependent, Subject-Dependent Posture-Independent, and Subject-Independent Posture-Independent. Across all cases, time-varying weights substantially improve trajectory reconstruction, yielding an average 27% reduction in RMSE compared to the baseline. The inferred costs consistently highlight a dominant role for joint-acceleration regulation, complemented by smaller contributions from torque-change smoothness. Overall, a single subject- and posture-agnostic time-varying cost function is shown to predict human reaching trajectories with high accuracy, supporting the existence of a unified optimality principle governing this class of movements.
comment: 8 pages, 6 figures
Inverse Resistive Force Theory (I-RFT): Learning granular properties through robot-terrain physical interactions
For robots to navigate safely and efficiently on soft, granular terrains, it is crucial to gather information about the terrain's mechanical properties, which directly affect locomotion performance. Recent research has developed robotic legs that can accurately sense ground reaction forces during locomotion. However, existing tests of granular property estimation often rely on specific foot trajectories, such as vertical penetration or horizontal shear, limiting their applicability during natural locomotion. To address this limitation, we introduce a physics-informed machine learning framework, Inverse Resistive Force Theory (I-RFT), which integrates the Granular Resistive Force Theory model with Gaussian Processes to infer terrain properties from proprioceptively measured contact forces under arbitrary gait trajectories. By embedding the granular force model within the learning process, I-RFT preserves physical consistency while enabling generalization across diverse motion primitives. Experimental results demonstrate that I-RFT accurately estimates terrain properties across multiple gait trajectories and toe shapes. Moreover, we show that the quantified uncertainty over the terrain resistance stress map could enable robots to optimize foot design and gait trajectories for efficient information gathering. This approach establishes a new foundation for data-efficient characterization of complex granular environments and opens new avenues for locomotion strategies that actively adapt gait for autonomous terrain exploration.
A Robust Antenna Provides Tactile Feedback in a Multi-legged Robot
Multi-legged elongate robots hold promise for maneuvering through complex environments. Prior work has demonstrated that reliable locomotion can be achieved using open-loop body undulation and foot placement on rugose terrain. However, robust navigation through confined spaces remains challenging when body-environment contact is extensive and terrain rheology varies rapidly. To address this challenge, we develop a pair of tactile antennae for multi-legged robots that enable real-time sensing of surrounding geometry, modeling the morphology and function of biological centipede antennae. Each antenna features gradient compliance, with a stiff base and soft tip, allowing repeated deformation and elastic recovery. Robophysical experiments reveal a relationship between continuous antenna curvature and contact force, leading to a simplified mapping from antenna deformation to inferred discrete collision states. We incorporate this mapping into a controller that selects among a set of locomotor maneuvers based on the inferred collision state. Experiments in obstacle-rich and confined environments demonstrate that tactile feedback enables reliable steering and allows the robot to recover from near-stuck conditions without requiring global environmental information or real-time vision. These results highlight how mechanically tuned tactile appendages can simplify sensing and enhance autonomy in elongate multi-legged robots operating in constrained spaces.
Residual Control for Fast Recovery from Dynamics Shifts
Robotic systems operating in real-world environments inevitably encounter unobserved dynamics shifts during continuous execution, including changes in actuation, mass distribution, or contact conditions. When such shifts occur mid-episode, even locally stabilizing learned policies can experience substantial transient performance degradation. While input-to-state stability guarantees bounded state deviation, it does not ensure rapid restoration of task-level performance. We address inference-time recovery under frozen policy parameters by casting adaptation as constrained disturbance shaping around a nominal stabilizing controller. We propose a stability-aligned residual control architecture in which a reinforcement learning policy trained under nominal dynamics remains fixed at deployment, and adaptation occurs exclusively through a bounded additive residual channel. A Stability Alignment Gate (SAG) regulates corrective authority through magnitude constraints, directional coherence with the nominal action, performance-conditioned activation, and adaptive gain modulation. These mechanisms preserve the nominal closed-loop structure while enabling rapid compensation for unobserved dynamics shifts without retraining or privileged disturbance information. Across mid-episode perturbations including actuator degradation, mass variation, and contact changes, the proposed method consistently reduces recovery time relative to frozen and online-adaptation baselines while maintaining near-nominal steady-state performance. Recovery time is reduced by \textbf{87\%} on the Go1 quadruped, \textbf{48\%} on the Cassie biped, \textbf{30\%} on the H1 humanoid, and \textbf{20\%} on the Scout wheeled platform on average across evaluated conditions relative to a frozen SAC policy.
Directing the Robot: Scaffolding Creative Human-AI-Robot Interaction
Robots are moving beyond industrial settings into creative, educational, and public environments where interaction is open-ended and improvisational. Yet much of human-AI-robot interaction remains framed around performance and efficiency, positioning humans as supervisors rather than collaborators. We propose a re-framing of AI interaction with robots as scaffolding: infrastructure that enables humans to shape robotic behaviour over time while remaining meaningfully in control. Through scenarios from creative practice, learning-by-teaching, and embodied interaction, we illustrate how humans can act as executive directors, defining intent and steering revisions, while AI mediates between human expression and robotic execution. We outline design and evaluation implications that foreground creativity, agency, and flow. Finally, we discuss open challenges in social, scalable, and mission-critical contexts. We invite the community to rethink interacting with Robots and AI not as autonomy, but as sustained support for human creativity.
comment: 4 pages, 1 figure
AeroPlace-Flow: Language-Grounded Object Placement for Aerial Manipulators via Visual Foresight and Object Flow
Precise object placement remains underexplored in aerial manipulation, where most systems rely on predefined target coordinates and focus primarily on grasping and control. Specifying exact placement poses, however, is cumbersome in real-world settings, where users naturally communicate goals through language. In this work, we present AeroPlace-Flow, a training-free framework for language-grounded aerial object placement that unifies visual foresight with explicit 3D geometric reasoning and object flow. Given RGB-D observations of the object and the placement scene, along with a natural language instruction, AeroPlace-Flow first synthesizes a task-complete goal image using image editing models. The imagined configuration is then grounded into metric 3D space through depth alignment and object-centric reasoning, enabling the inference of a collision-aware object flow that transports the grasped object to a language and contact-consistent placement configuration. The resulting motion is executed via standard trajectory tracking for an aerial manipulator. AeroPlace-Flow produces executable placement targets without requiring predefined poses or task-specific training. We validate our approach through extensive simulation and real-world experiments, demonstrating reliable language-conditioned placement across diverse aerial scenarios with an average success rate of 75% on hardware.
C$^2$-Explorer: Contiguity-Driven Task Allocation with Connectivity-Aware Task Representation for Decentralized Multi-UAV Exploration
Efficient multi-UAV exploration under limited communication is severely bottlenecked by inadequate task representation and allocation. Previous task representations either impose heavy communication requirements for coordination or lack the flexibility to handle complex environments, often leading to inefficient traversal. Furthermore, short-horizon allocation strategies neglect spatiotemporal contiguity, causing non-contiguous assignments and frequent cross-region detours. To address this, we propose C$^2$-Explorer, a decentralized framework that constructs a connectivity graph to decompose disconnected unknown components into independent task units. We then introduce a contiguity-driven allocation formulation with a graph-based neighborhood penalty to discourage non-adjacent assignments, promoting more contiguous task sequences over time. Extensive simulation experiments show that C$^2$-Explorer consistently outperforms state-of-the-art (SOTA) baselines, reducing average exploration time by 43.1\% and path length by 33.3\%. Real-world flights further demonstrate the system's feasibility. The code will be released at https://github.com/Robotics-STAR-Lab/C2-Explorer
RoboPCA: Pose-centered Affordance Learning from Human Demonstrations for Robot Manipulation ICRA 2026
Understanding spatial affordances -- comprising the contact regions of object interaction and the corresponding contact poses -- is essential for robots to effectively manipulate objects and accomplish diverse tasks. However, existing spatial affordance prediction methods mainly focus on locating the contact regions while delegating the pose to independent pose estimation approaches, which can lead to task failures due to inconsistencies between predicted contact regions and candidate poses. In this work, we propose RoboPCA, a pose-centered affordance prediction framework that jointly predicts task-appropriate contact regions and poses conditioned on instructions. To enable scalable data collection for pose-centered affordance learning, we devise Human2Afford, a data curation pipeline that automatically recovers scene-level 3D information and infers pose-centered affordance annotations from human demonstrations. With Human2Afford, scene depth and the interaction object's mask are extracted to provide 3D context and object localization, while pose-centered affordance annotations are obtained by tracking object points within the contact region and analyzing hand-object interaction patterns to establish a mapping from the 3D hand mesh to the robot end-effector orientation. By integrating geometry-appearance cues through an RGB-D encoder and incorporating mask-enhanced features to emphasize task-relevant object regions into the diffusion-based framework, RoboPCA outperforms baseline methods on image datasets, simulation, and real robots, and exhibits strong generalization across tasks and categories.
comment: Accepted to ICRA 2026
UniUncer: Unified Dynamic Static Uncertainty for End to End Driving ICRA 2026
End-to-end (E2E) driving has become a cornerstone of both industry deployment and academic research, offering a single learnable pipeline that maps multi-sensor inputs to actions while avoiding hand-engineered modules. However, the reliability of such pipelines strongly depends on how well they handle uncertainty: sensors are noisy, semantics can be ambiguous, and interaction with other road users is inherently stochastic. Uncertainty also appears in multiple forms: classification vs. localization, and, crucially, in both static map elements and dynamic agents. Existing E2E approaches model only static-map uncertainty, leaving planning vulnerable to overconfident and unreliable inputs. We present UniUncer, the first lightweight, unified uncertainty framework that jointly estimates and uses uncertainty for both static and dynamic scene elements inside an E2E planner. Concretely: (1) we convert deterministic heads to probabilistic Laplace regressors that output per-vertex location and scale for vectorized static and dynamic entities; (2) we introduce an uncertainty-fusion module that encodes these parameters and injects them into object/map queries to form uncertainty-aware queries; and (3) we design an uncertainty-aware gate that adaptively modulates reliance on historical inputs (ego status or temporal perception queries) based on current uncertainty levels. The design adds minimal overhead and drops throughput by only $\sim$0.5 FPS while remaining plug-and-play for common E2E backbones. On nuScenes (open-loop), UniUncer reduces average L2 trajectory error by 7\%. On NavsimV2 (pseudo closed-loop), it improves overall EPDMS by 10.8\% and notable stage two gains in challenging, interaction-heavy scenes. Ablations confirm that dynamic-agent uncertainty and the uncertainty-aware gate are both necessary.
comment: ICRA 2026
Low-Cost Teleoperation Extension for Mobile Manipulators
Teleoperation of mobile bimanual manipulators requires simultaneous control of high-dimensional systems, often necessitating expensive specialized equipment. We present an open-source teleoperation framework that enables intuitive whole body control using readily available commodity hardware. Our system combines smartphone-based head tracking for camera control, leader arms for bilateral manipulation, and foot pedals for hands-free base navigation. Using a standard smartphone with IMU and display, we eliminate the need for costly VR helmets while maintaining immersive visual feedback. The modular architecture integrates seamlessly with the XLeRobot framework, but can be easily adapted to other types of mobile manipulators. We validate our approach through user studies that demonstrate improved task performance and reduced cognitive load compared to keyboard-based control.
DAISS: Phase-Aware Imitation Learning for Dual-Arm Robotic Ultrasound-Guided Interventions
Imitation learning has shown strong potential for automating complex robotic manipulation. In medical robotics, ultrasound-guided needle insertion demands precise bimanual coordination, as clinicians must simultaneously manipulate an ultrasound probe to maintain an optimal acoustic view while steering an interventional needle. Automating this asymmetric workflow -- and reliably transferring expert strategies to robots -- remains highly challenging. In this paper, we present the Dual-Arm Interventional Surgical System (DAISS), a teleoperated platform that collects high-fidelity dual-arm demonstrations and learns a phase-aware imitation policy for ultrasound-guided interventions. To avoid constraining the operator's natural behavior, DAISS uses a flexible NDI-based leader interface for teleoperating two coordinated follower arms. To support robust execution under real-time ultrasound feedback, we develop a lightweight, data-efficient imitation policy. Specifically, the policy incorporates a phase-aware architecture and a dynamic mask loss tailored to asymmetric bimanual control. Conditioned on a planned trajectory, the network fuses real-time ultrasound with external visual observations to generate smooth, coordinated dual-arm motions. Experimental results show that DAISS can learn personalized expert strategies from limited demonstrations. Overall, these findings highlight the promise of phase-aware imitation-learning-driven dual-arm robots for improving precision and reducing cognitive workload in image-guided interventions.
comment: 8 pages, 8 figures
Multi-Agent Off-World Exploration for Sparse Evidence Discovery via Gaussian Belief Mapping and Dual-Domain Coverage
Off-world multi-robot exploration is challenged by sparse targets, limited sensing, hazardous terrain, and restricted communication. Many scientifically valuable clues are visually ambiguous and often require close-range observations, making efficient and safe informative path planning essential. Existing methods often rely on predefined areas of interest (AOIs), which may be incomplete or biased, and typically handle terrain risk only through soft penalties, which are insufficient for avoiding non-recoverable regions. To address these issues, we propose a multi-agent informative path planning framework for sparse evidence discovery based on Gaussian belief mapping and dual-domain coverage. The method maintains Gaussian-process-based interest and risk beliefs and combines them with trajectory-intent representations to support coordinated sequential decision-making among multiple agents. It further prioritizes search inside the AOI while preserving limited exploration outside it, thereby improving robustness to AOI bias. In addition, the risk-aware design helps agents balance information gain and operational safety in hazardous environments. Experimental results in simulated lunar environments show that the proposed method consistently outperforms sampling-based and greedy baselines under different budgets and communication ranges. In particular, it achieves lower final uncertainty in risk-aware settings and remains robust under limited communication, demonstrating its effectiveness for cooperative off-world robotic exploration.
AtomicVLA: Unlocking the Potential of Atomic Skill Learning in Robots CVPR2026
Recent advances in Visual-Language-Action (VLA) models have shown promising potential for robotic manipulation tasks. However, real-world robotic tasks often involve long-horizon, multi-step problem-solving and require generalization for continual skill acquisition, extending beyond single actions or skills. These challenges present significant barriers for existing VLA models, which use monolithic action decoders trained on aggregated data, resulting in poor scalability. To address these challenges, we propose AtomicVLA, a unified planning-and-execution framework that jointly generates task-level plans, atomic skill abstractions, and fine-grained actions. AtomicVLA constructs a scalable atomic skill library through a Skill-Guided Mixture-of-Experts (SG-MoE), where each expert specializes in mastering generic yet precise atomic skills. Furthermore, we introduce a flexible routing encoder that automatically assigns dedicated atomic experts to new skills, enabling continual learning. We validate our approach through extensive experiments. In simulation, AtomicVLA outperforms $π_{0}$ by 2.4\% on LIBERO, 10\% on LIBERO-LONG, and outperforms $π_{0}$ and $π_{0.5}$ by 0.22 and 0.25 in average task length on CALVIN. Additionally, our AtomicVLA consistently surpasses baselines by 18.3\% and 21\% in real-world long-horizon tasks and continual learning. These results highlight the effectiveness of atomic skill abstraction and dynamic expert composition for long-horizon and lifelong robotic tasks. The project page is \href{https://zhanglk9.github.io/atomicvla-web/}{here}.
comment: Accepted by CVPR2026
TempoFit: Plug-and-Play Layer-Wise Temporal KV Memory for Long-Horizon Vision-Language-Action Manipulation
Pretrained Vision-Language-Action (VLA) policies have achieved strong single-step manipulation, but their inference remains largely memoryless, which is brittle in non-Markovian long-horizon settings with occlusion, state aliasing, and subtle post-action changes. Prior approaches inject history either by stacking frames, which scales visual tokens and latency while adding near-duplicate pixels, or by learning additional temporal interfaces that require (re-)training and may break the original single-frame inference graph. We present TempoFit, a training-free temporal retrofit that upgrades frozen VLAs through state-level memory. Our key insight is that prefix attention K/V already form a model-native, content-addressable runtime state; reusing them across timesteps introduces history without new tokens or trainable modules. TempoFit stores layer-wise FIFO prefix K/V at selected intermediate layers, performs parameter-free K-to-K retrieval with Frame-Gap Temporal Bias (FGTB), a fixed recency bias inspired by positional biases in NLP, to keep decisions present-dominant, and injects the retrieved context via pre-attention residual loading with norm-preserving rescaling to avoid distribution shift under frozen weights. On LIBERO-LONG, TempoFit improves strong pretrained backbones by up to +4.0% average success rate while maintaining near-real-time latency, and it transfers consistently to CALVIN and real-robot long-horizon tasks.
PanoDP: Learning Collision-Free Navigation with Panoramic Depth and Differentiable Physics
Autonomous collision-free navigation in cluttered environments requires safe decision-making under partial observability with both static structure and dynamic obstacles. We present \textbf{PanoDP}, a communication-free learning framework that combines four-view panoramic depth perception with differentiable-physics-based training signals. PanoDP encodes panoramic depth using a lightweight CNN and optimizes policies with dense differentiable collision and motion-feasibility terms, improving training stability beyond sparse terminal collisions. We evaluate PanoDP on a controlled ring-to-center benchmark with systematic sweeps over agent count, obstacle density/layout, and dynamic behaviors, and further test out-of-distribution generalization in an external simulator (e.g., AirSim). Across settings, PanoDP increases collision-free and completion rates over single-view and non-physics-guided baselines under matched training budgets, and ablations (view masking, rotation augmentation) confirm the policy leverages 360-degree information. Code will be open source upon acceptance.
Exoskeleton Control through Learning to Reduce Biological Joint Moments in Simulations
Data-driven joint-moment predictors offer a scalable alternative to laboratory-based inverse-dynamics pipelines for biomechanics estimation and exoskeleton control. Meanwhile, physics-based reinforcement learning (RL) enables simulation-trained controllers to learn dynamics-aware assistance strategies without extensive human experimentation. However, quantitative verification of simulation-trained exoskeleton torque predictors, and their impact on human joint power injection, remains limited. This paper presents (1) an RL framework to learn exoskeleton assistance policies that reduce biological joint moments, and (2) a validation pipeline that verifies the trained control networks using an open-source gait dataset through inference and comparison with biological joint moments. Simulation-trained multilayer perceptron (MLP) controllers are developed for level-ground and ramp walking, mapping short-horizon histories of bilateral hip and knee kinematics to normalized assistance torques. Results show that predicted assistance preserves task-intensity trends across speeds and inclines. Agreement is particularly strong at the hip, with cross-correlation coefficients reaching 0.94 at 1.8 m/s and 0.98 during 5° decline walking, demonstrating near-matched temporal structure. Discrepancies increase at higher speeds and steeper inclines, especially at the knee, and are more pronounced in joint power comparisons. Delay tuning biases assistance toward greater positive power injection; modest timing shifts increase positive power and improve agreement in specific gait intervals. Together, these results establish a quantitative validation framework for simulation-trained exoskeleton controllers, demonstrate strong sim-to-data consistency at the torque level, and highlight both the promise and the remaining challenges for sim-to-real transfer.
GeoLoco: Leveraging 3D Geometric Priors from Visual Foundation Model for Robust RGB-Only Humanoid Locomotion
The prevailing paradigm of perceptive humanoid locomotion relies heavily on active depth sensors. However, this depth-centric approach fundamentally discards the rich semantic and dense appearance cues of the visual world, severing low-level control from the high-level reasoning essential for general embodied intelligence. While monocular RGB offers a ubiquitous, information-dense alternative, end-to-end reinforcement learning from raw 2D pixels suffers from extreme sample inefficiency and catastrophic sim-to-real collapse due to the inherent loss of geometric scale. To break this deadlock, we propose GeoLoco, a purely RGB-driven locomotion framework that conceptualizes monocular images as high-dimensional 3D latent representations by harnessing the powerful geometric priors of a frozen, scale-aware Visual Foundation Model (VFM). Rather than naive feature concatenation, we design a proprioceptive-query multi-head cross-attention mechanism that dynamically attends to task-critical topological features conditioned on the robot's real-time gait phase. Crucially, to prevent the policy from overfitting to superficial textures, we introduce a dual-head auxiliary learning scheme. This explicit regularization forces the high-dimensional latent space to strictly align with the physical terrain geometry, ensuring robust zero-shot sim-to-real transfer. Trained exclusively in simulation, GeoLoco achieves robust zero-shot transfer to the Unitree G1 humanoid and successfully negotiates challenging terrains.
comment: 8 pages, 6 figures, conference
SMAT: Staged Multi-Agent Training for Co-Adaptive Exoskeleton Control
Effective exoskeleton assistance requires co-adaptation: as the device alters joint dynamics, the user reorganizes neuromuscular coordination, creating a non-stationary learning problem. Most learning-based approaches do not explicitly account for the sequential nature of human motor adaptation, leading to training instability and poorly timed assistance. We propose Staged Multi-Agent Training (SMAT), a four-stage curriculum designed to mirror how users naturally acclimate to a wearable device. In SMAT, a musculoskeletal human actor and a bilateral hip exoskeleton actor are trained progressively: the human first learns unassisted gait, then adapts to the added device mass; the exoskeleton subsequently learns a positive assistance pattern against a stabilized human policy, and finally both agents co-adapt with full torque capacity and bidirectional feedback. We implement SMAT in the MyoAssist simulation environment using a 26-muscle lower-limb model and an attached hip exoskeleton. Our musculoskeletal simulations demonstrate that the learned exoskeleton control policy produces an average 10.1% reduction in hip muscle activation relative to the no-assist condition. We validated the learned controller in an offline setting using open-source gait data, then deployed it to a physical hip exoskeleton for treadmill experiments with five subjects. The resulting policy delivers consistent assistance and predominantly positive mechanical power without the need for any explicitly imposed timing shift (mean positive power: 13.6 W at 6 Nm RMS torque to 23.8 W at 9.3 Nm RMS torque, with minimal negative power) consistently across all subjects without subject-specific retraining.
Model-Based and Neural-Aided Approaches for Dog Dead Reckoning
Modern canine applications span medical and service roles, while robotic legged dogs serve as autonomous platforms for high-risk industrial inspection, disaster response, and search and rescue operations. For both, accurate positioning remains a significant challenge due to the cumulative drift inherent in inertial sensing. To bridge this gap, we propose three algorithms for accurate positioning using only inertial sensors, collectively referred to as dog dead reckoning (DDR). To evaluate our approaches, we designed DogMotion, a wearable unit for canine data recording. Using DogMotion, we recorded a dataset of 13 minutes. Additionally, we utilized a robotic legged dog dataset with a duration of 116 minutes. Across the two distinct datasets we demonstrate that our neural-aided methods consistently outperform model-based approaches, achieving an absolute distance error of less than 10\%. Consequently, we provide a lightweight and low-cost positioning solution for both biological and legged robotic dogs. To support reproducibility, our codebase and associated datasets have been made publicly available.
FeasibleCap: Real-Time Embodiment Constraint Guidance for In-the-Wild Robot Demonstration Collection
Gripper-in-hand data collection decouples demonstration acquisition from robot hardware, but whether a trajectory is executable on the target robot remains unknown until a separate replay-and-validate stage. Failed demonstrations therefore inflate the effective cost per usable trajectory through repeated collection, diagnosis, and validation. Existing collection-time feedback systems mitigate this issue but rely on head-worn AR/VR displays, robot-in-the-loop hardware, or learned dynamics models; real-time executability feedback has not yet been integrated into the gripper-in-hand data collection paradigm. We present \textbf{FeasibleCap}, a gripper-in-hand data collection system that brings real-time executability guidance into robot-free capture. At each frame, FeasibleCap checks reachability, joint-rate limits, and collisions against a target robot model and closes the loop through on-device visual overlays and haptic cues, allowing demonstrators to correct motions during collection without learned models, headsets, or robot hardware. On pick-and-place and tossing tasks, FeasibleCap improves replay success and reduces the fraction of infeasible frames, with the largest gains on tossing. Simulation experiments further indicate that enforcing executability constraints during collection does not sacrifice cross-embodiment transfer across robot platforms. Hardware designs and software are available at https://github.com/aod321/FeasibleCap.
Approximate Imitation Learning for Event-based Quadrotor Flight in Cluttered Environments
Event cameras offer high temporal resolution and low latency, making them ideal sensors for high-speed robotic applications where conventional cameras suffer from image degradations such as motion blur. In addition, their low power consumption can enhance endurance, which is critical for resource-constrained platforms. Motivated by these properties, we present a novel approach that enables a quadrotor to fly through cluttered environments at high speed by perceiving the environment with a single event camera. Our proposed method employs an end-to-end neural network trained to map event data directly to control commands, eliminating the reliance on standard cameras. To enable efficient training in simulation, where rendering synthetic event data is computationally expensive, we propose Approximate Imitation Learning, a novel imitation learning framework. Our approach leverages a large-scale offline dataset to learn a task-specific representation space. Subsequently, the policy is trained through online interactions that rely solely on lightweight, simulated state information, eliminating the need to render events during training. This enables the efficient training of event-based control policies for fast quadrotor flight, highlighting the potential of our framework for other modalities where data simulation is costly or impractical. Our approach outperforms standard imitation learning baselines in simulation and demonstrates robust performance in real-world flight tests, achieving speeds up to 9.8 ms-1 in cluttered environments.
ReconDrive: Fast Feed-Forward 4D Gaussian Splatting for Autonomous Driving Scene Reconstruction
High-fidelity visual reconstruction and novel-view synthesis are essential for realistic closed-loop evaluation in autonomous driving. While 4D Gaussian Splatting (4DGS) offers a promising balance of accuracy and efficiency, existing per-scene optimization methods require costly iterative refinement, rendering them unscalable for extensive urban environments. Conversely, current feed-forward approaches often suffer from degraded photometric quality. To address these limitations, we propose ReconDrive, a feed-forward framework that leverages and extends the 3D foundation model VGGT for rapid, high-fidelity 4DGS generation. Our architecture introduces two core adaptations to tailor the foundation model to dynamic driving scenes: (1) Hybrid Gaussian Prediction Heads, which decouple the regression of spatial coordinates and appearance attributes to overcome the photometric deficiencies inherent in generalized foundation features; and (2) a Static-Dynamic 4D Composition strategy that explicitly captures temporal motion via velocity modeling to represent complex dynamic environments. Benchmarked on nuScenes, ReconDrive significantly outperforms existing feed-forward baselines in reconstruction, novel-view synthesis, and 3D perception. It achieves performance competitive with per-scene optimization while being orders of magnitude faster, providing a scalable and practical solution for realistic driving simulation.
ACCURATE: Arbitrary-shaped Continuum Reconstruction Under Robust Adaptive Two-view Estimation
Accurate reconstruction of arbitrary-shaped long slender continuum bodies, such as guidewires, catheters and other soft continuum manipulators, is essential for accurate mechanical simulation. However, existing image-based reconstruction approaches often suffer from limited accuracy because they often underutilize camera geometry, or lack generality as they rely on rigid geometric assumptions that may fail for continuum robots with complex and highly deformable shapes. To address these limitations, we propose ACCURATE, a 3D reconstruction framework integrating an image segmentation neural network with a geometry-constrained topology traversal and dynamic programming algorithm that enforces global biplanar geometric consistency, minimizes the cumulative point-to-epipolar-line distance, and remains robust to occlusions and epipolar ambiguities cases caused by noise and discretization. Our method achieves high reconstruction accuracy on both simulated and real phantom datasets acquired using a clinical X-ray C-arm system, with mean absolute errors below 1.0 mm.
ICLR: In-Context Imitation Learning with Visual Reasoning ICLR
In-context imitation learning enables robots to adapt to new tasks from a small number of demonstrations without additional training. However, existing approaches typically condition only on state-action trajectories and lack explicit representations of task intent. This limitation hinders performance in complex and ambiguous task settings where the same actions may be consistent with different objectives. To address this, we present In-Context Imitation Learning with Visual Reasoning (ICLR), a novel framework that augments demonstration prompts with structured visual reasoning traces representing anticipated future robot trajectories in image space. ICLR also jointly learns to generate reasoning traces and low-level actions within a unified autoregressive transformer, enabling the model to mimic not only action prediction but also the reasoning process that leads to those actions. We extensively evaluate ICLR in both simulation and real-world manipulation tasks and demonstrate consistent improvements in success rates and generalization to unseen tasks and novel object configurations compared to other in-context imitation learning methods. These results suggest that incorporating embodied visual reasoning represents a promising direction for enhancing the robustness and generalization of robotic in-context learning systems.
comment: Project website: https://toannguyen1904.github.io/ICLR
InterReal: A Unified Physics-Based Imitation Framework for Learning Human-Object Interaction Skills
Interaction is one of the core abilities of humanoid robots. However, most existing frameworks focus on non-interactive whole-body control, which limits their practical applicability. In this work, we develop InterReal, a unified physics-based imitation learning framework for Real-world human-object Interaction (HOI) control. InterReal enables humanoid robots to track HOI reference motions, facilitating the learning of fine-grained interactive skills and their deployment in real-world settings. Within this framework, we first introduce a HOI motion data augmentation scheme with hand-object contact constraints, and utilize the augmented motions to improve policy stability under object perturbations. Second, we propose an automatic reward learner to address the challenge of large-scale reward shaping. A meta-policy guided by critical tracking error metrics explores and allocates reward signals to the low-level reinforcement learning objective, which enables more effective learning of interactive policies. Experiments on HOI tasks of box-picking and box-pushing demonstrate that InterReal achieves the best tracking accuracy and the highest task success rate compared to recent baselines. Furthermore, we validate the framework on the real-world robot Unitree G1, which demonstrates its practical effectiveness and robustness beyond simulation.
Inverse-dynamics observer design for a linear single-track vehicle model with distributed tire dynamics
Accurate estimation of the vehicle's sideslip angle and tire forces is essential for enhancing safety and handling performances in unknown driving scenarios. To this end, the present paper proposes an innovative observer that combines a linear single-track model with a distributed representation of the tires and information collected from standard sensors. In particular, by adopting a comprehensive representation of the tires in terms of hyperbolic partial differential equations (PDEs), the proposed estimation strategy exploits dynamical inversion to reconstruct the lumped and distributed vehicle states solely from yaw rate and lateral acceleration measurements. Simulation results demonstrate the effectiveness of the observer in estimating the sideslip angle and tire forces even in the presence of noise and model uncertainties.
comment: 6 pages, 5 figures. Accepted at ECC 2026
Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification ICLR 2026
Verifiers--functions assigning rewards to agent behavior--have been key to AI progress in math, code, and games. However, extending gains to domains without clear-cut success criteria remains a challenge: while humans can recognize desired outcomes, translating this intuition into scalable rules is nontrivial. Multimodal LLMs (MLLMs) offer a promising solution, given their world knowledge, human-preference alignment, and reasoning capabilities. We evaluate MLLM verifiers across web navigation, computer use, and robotics, spanning 13+ models, 28+ designs, and thousands of trajectories from diverse agents. We identify a critical limitation: a strong tendency for MLLMs to over-validate agent behavior--a phenomenon we term agreement bias. This bias is pervasive, resilient to test-time scaling, and can harm applications relying on MLLM judgments/rewards (e.g., self-improvement, steering, online supervision). We discuss several considerations for evaluating and designing MLLM verifiers, and introduce SGV, a lightweight method that better leverages their capabilities by modulating (un)conditional generation. First, an MLLM is elicited to generate broad priors about desired behavior, independent of the data under evaluation. Then, conditioned on self-generated priors, it reasons over and evaluates a candidate trajectory. Our methods yield more human-aligned verifiers, improving failure detection by 25pp and accuracy by 14pp. In self-improvement and online supervision, they boost task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena--surpassing the previous state of the art by 20pp. As a byproduct, we release an update of VisualWebArena featuring strong agent baselines, more human-aligned oracles, container parallelism with high fidelity and proper resets, >10x speedups, and VWA-Lite, a 1/3 subset with comparable evaluation fidelity.
comment: ICLR 2026. Code, models, and data publicly available at https://mshalimay.github.io/agreement-bias-sgv/
MEM: Multi-Scale Embodied Memory for Vision Language Action Models
Conventionally, memory in end-to-end robotic learning involves inputting a sequence of past observations into the learned policy. However, in complex multi-stage real-world tasks, the robot's memory must represent past events at multiple levels of granularity: from long-term memory that captures abstracted semantic concepts (e.g., a robot cooking dinner should remember which stages of the recipe are already done) to short-term memory that captures recent events and compensates for occlusions (e.g., a robot remembering the object it wants to pick up once its arm occludes it). In this work, our main insight is that an effective memory architecture for long-horizon robotic control should combine multiple modalities to capture these different levels of abstraction. We introduce Multi-Scale Embodied Memory (MEM), an approach for mixed-modal long-horizon memory in robot policies. MEM combines video-based short-horizon memory, compressed via a video encoder, with text-based long-horizon memory. Together, they enable robot policies to perform tasks that span up to fifteen minutes, like cleaning up a kitchen, or preparing a grilled cheese sandwich. Additionally, we find that memory enables MEM policies to intelligently adapt manipulation strategies in-context.
comment: Website: https://pi.website/research/memory
Whole-Brain Connectomic Graph Model Enables Whole-Body Locomotion Control in Fruit Fly
Whole-brain biological neural networks naturally support the learning and control of whole-body movements. However, the use of brain connectomes as neural network controllers in embodied reinforcement learning remains unexplored. We investigate using the exact neural architecture of an adult fruit fly's brain for the control of its body movement. We develop Fly-connectomic Graph Model (FlyGM), whose static structure is identical to the complete connectome of an adult Drosophila for whole-body locomotion control. To perform dynamical control, FlyGM represents the static connectome as a directed message-passing graph to impose a biologically grounded information flow from sensory inputs to motor outputs. Integrated with a biomechanical fruit fly model, our method achieves stable control across diverse locomotion tasks without task-specific architectural tuning. To verify the structural advantages of the connectome-based model, we compare it against a degree-preserving rewired graph, a random graph, and multilayer perceptrons, showing that FlyGM yields higher sample efficiency and superior performance. This work demonstrates that static brain connectomes can be transformed to instantiate effective neural policy for embodied learning of movement control.
Ego-Vision World Model for Humanoid Contact Planning
Enabling humanoid robots to exploit physical contact, rather than simply avoid collisions, is crucial for autonomy in unstructured environments. Traditional optimization-based planners struggle with contact complexity, while on-policy reinforcement learning (RL) is sample-inefficient and has limited multi-task ability. We propose a framework combining a learned world model with sampling-based Model Predictive Control (MPC), trained on a demonstration-free offline dataset to predict future outcomes in a compressed latent space. To address sparse contact rewards and sensor noise, the MPC uses a learned surrogate value function for dense, robust planning. Our single, scalable model supports contact-aware tasks, including wall support after perturbation, blocking incoming objects, and traversing height-limited arches, with improved sample efficiency and multi-task capability over on-policy RL. Deployed on a physical humanoid, our system achieves robust, real-time contact planning from proprioception and ego-centric depth images. Code and dataset are available at our website: https://ego-vcp.github.io/
IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models
Motion planning involves determining a sequence of robot configurations to reach a desired pose, subject to movement and safety constraints. Traditional motion planning finds collision-free paths, but this is overly restrictive in clutter, where it may not be possible for a robot to accomplish a task without contact. In addition, contacts range from relatively benign (e.g. brushing a soft pillow) to more dangerous (e.g. toppling a glass vase), making it difficult to characterize which may be acceptable. In this paper, we propose IMPACT, a novel motion planning framework that uses Vision-Language Models (VLMs) to infer environment semantics, identifying which parts of the environment can best tolerate contact based on object properties and locations. Our approach generates an anisotropic cost map that encodes directional push safety. We pair this map with a contact-aware A* planner to find stable contact-rich paths. We perform experiments using 20 simulation and 10 real-world scenes and assess using task success rate, object displacements, and feedback from human evaluators. Our results over 3200 simulation and 200 real-world trials suggest that IMPACT enables efficient contact-rich motion planning in cluttered settings while outperforming alternative methods and ablations. Our project website is available at https://impact-planning.github.io/.
GeoAware-VLA: Implicit Geometry Aware Vision-Language-Action Model
Vision-Language-Action (VLA) models often fail to generalize to unseen camera viewpoints, a limitation stemming from their difficulty in inferring robust 3D geometry from 2D images. We introduce GeoAware-VLA, a simple yet effective approach that enhances viewpoint invariance by integrating strong geometric priors into the vision backbone. Instead of training a visual encoder or relying on explicit 3D data, we leverage a frozen, pretrained geometric vision model as a feature extractor. A lightweight, trainable projection layer then adapts these geometrically-rich features for the policy decoder, relieving it of the burden of learning 3D consistency from scratch. Through extensive evaluations on the LIBERO and CALVIN benchmarks, we show that GeoAware-VLA preserves and even improves in-distribution performance while achieving substantial gains in zero-shot generalization to unseen camera poses, improving unseen-view success rates by an average of 35 percentage points on LIBERO and over 11 percentage points on CALVIN compared to their respective baselines. Crucially, these gains transfer to the physical world, where our model shows significant improvement on a real robotic platform. Our approach proves effective across both continuous and discrete action spaces, highlighting that robust geometric grounding is a key ingredient for building more generalizable robotic agents.
comment: Under Review, Project Page https://alisharey.github.io/GeoAware-VLA/
OVerSeeC: Open-Vocabulary Costmap Generation from Satellite Images and Natural Language
Aerial imagery provides essential global context for autonomous navigation, enabling route planning at scales inaccessible to onboard sensing. We address the problem of generating global costmaps for long-range planning directly from satellite imagery when entities and mission-specific traversal rules are expressed in natural language at test time. This setting is challenging since mission requirements vary, terrain entities may be unknown at deployment, and user prompts often encode compositional traversal logic. Existing approaches relying on fixed ontologies and static cost mappings cannot accommodate such flexibility. While foundation models excel at language interpretation and open-vocabulary perception, no single model can simultaneously parse nuanced mission directives, locate arbitrary entities in large-scale imagery, and synthesize them into an executable cost function for planners. We therefore propose OVerSeeC, a zero-shot modular framework that decomposes the problem into Interpret-Locate-Synthesize: (i) an LLM extracts entities and ranked preferences, (ii) an open-vocabulary segmentation pipeline identifies these entities from high-resolution imagery, and (iii) the LLM uses the user's natural language preferences and masks to synthesize executable costmap code. Empirically, OVerSeeC handles novel entities, respects ranked and compositional preferences, and produces routes consistent with human-drawn trajectories across diverse regions, demonstrating robustness to distribution shifts. This shows that modular composition of foundation models enables open-vocabulary, preference-aligned costmap generation for scalable, mission-adaptive global planning.
comment: Website : https://amrl.cs.utexas.edu/overseec/
Stable Multi-Drone GNSS Tracking System for Marine Robots
Stable and accurate tracking is essential for marine robotics, yet Global Navigation Satellite System (GNSS) signals vanish immediately below the sea surface. Traditional alternatives suffer from error accumulation, high computational demands, or infrastructure dependence. In this work, we present a multi-drone GNSS-based tracking system for surface and near-surface marine robots. Our approach combines efficient visual detection, lightweight multi-object tracking, GNSS-based triangulation, and a confidence-weighted Extended Kalman Filter (EKF) to provide stable GNSS estimation in real time. We further introduce a cross-drone tracking ID alignment algorithm that enforces global consistency across views, enabling robust multi-robot tracking with cooperative aerial coverage. We validate our system in diversified complex settings to show the accuracy and robustness of the proposed algorithm.
DropVLA: An Action-Level Backdoor Attack on Vision-Language-Action Models
Vision-Language-Action (VLA) models map multimodal perception and language instructions to executable robot actions, making them particularly vulnerable to behavioral backdoor manipulation: a hidden trigger introduced during training can induce unintended physical actions while nominal task performance remains intact. Prior work on VLA backdoors primarily studies untargeted attacks or task-level hijacking, leaving fine-grained control over individual actions largely unexplored. In this work, we present DropVLA, an action-level backdoor attack that forces a reusable action primitive (e.g., open_gripper) to execute at attacker-chosen decision points under a realistic pipeline-black-box setting with limited data-poisoning access, using a window-consistent relabeling scheme for chunked fine-tuning. On OpenVLA-7B evaluated with LIBERO, vision-only poisoning achieves 98.67%-99.83% attack success rate (ASR) with only 0.31% poisoned episodes while preserving 98.50%-99.17% clean-task retention, and successfully triggers the targeted action within 25 control steps at 500 Hz (0.05 s). Text-only triggers are unstable at low poisoning budgets, and combining text with vision provides no consistent ASR improvement over vision-only attacks. The backdoor remains robust to moderate trigger variations and transfers across evaluation suites (96.27%, 99.09%), whereas text-only largely fails (0.72%). We further validate physical-world feasibility on a 7-DoF Franka arm with pi0-fast, demonstrating non-trivial attack efficacy under camera-relative motion that induces image-plane trigger drift. These results reveal that VLA models can be covertly steered at the granularity of safety-critical actions with minimal poisoning and without observable degradation of nominal performance.
comment: 8 pages, 6 tables, 3 figures. Under review
Holistic Optimization of Modular Robots
Modular robots have the potential to revolutionize automation, as one can optimize their composition for any given task. However, finding optimal compositions is non-trivial. In addition, different compositions require different base positions and trajectories to fully use the potential of modular robots. We address this problem holistically for the first time by jointly optimizing the composition, base placement, and trajectory to minimize the cycle time of a given task. Our approach is evaluated on over 300 industrial benchmarks requiring point-to-point movements. Overall, we reduce cycle time by up to 25 % and find feasible solutions in twice as many benchmarks compared to optimizing the module composition alone. In the first real-world validation of modular robots optimized for point-to-point movement, we find that the optimized robot is successfully deployed in nine out of ten cases in less than an hour.
comment: 14 Pages, 6 figures, 8 tables. Please find and reference the open-access published version at https://ieeexplore.ieee.org/document/11227125
Smart placement, faster robots-a comparison of algorithms for robot base-pose optimization
Robotic automation is a key technology that increases the efficiency and flexibility of manufacturing processes. However, one of the challenges in deploying robots in novel environments is finding the optimal base pose for the robot, which affects its reachability and deployment cost. Yet, existing research on automatically optimizing the base pose of robots has not been compared. We address this problem by optimizing the base pose of industrial robots with Bayesian optimization (BO), exhaustive search (ES), genetic algorithms (GAs), and stochastic gradient descent (SGD), and we find that all algorithms can reduce the cycle time for various evaluated tasks in synthetic and real-world environments. Stochastic gradient descent shows superior performance with regard to the success rate, solving more than 90% of our real-world tasks, while genetic algorithms show the lowest final costs. All benchmarks and implemented methods are available as baselines against which novel approaches can be compared.
comment: 10 pages, 3 Figures, 1 Table. Find visualizations and source code at https://cobra.cps.cit.tum.de/tools/rbo. Supplementary Tables can be found at https://www.frontiersin.org/journals/manufacturing-technology/articles/10.3389/fmtec.2025.1642524/full
ReViP: Mitigating False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance
Vision-Language-Action (VLA) models have advanced robotic manipulation by combining vision, language, and proprioception to predict actions. However, previous methods fuse proprioceptive signals directly with vision-language features, resulting in state-dominant bias and \textbf{false completions} despite visible execution failures. We systematically analyze this failure mode, attributing it to modality imbalance, where policies overly rely on internal state progression and underuse visual evidence. To address this, we introduce the first \textbf{False-Completion Benchmark Suite}, featuring eight tasks with three controlled perturbations (\emph{Object Drop}, \emph{Distractor Swap}, \emph{Relayout}) to comprehensively evaluate false completion. Moreover, we propose \textbf{ReViP}, a novel VLA framework with \textbf{Vi}sion-\textbf{P}roprioception \textbf{Re}balance to enhance visual grounding and robustness under perturbations. The key insight is to introduce auxiliary \emph{progress-aware visual cues} to adaptively modulate the coupling between semantic perception and proprioceptive dynamics. Specifically, progress-aware visual cues are extracted by an external Task-Stage Observer, which performs task-relevant reasoning on real-time observations to drive task-stage feature-wise linear modulation, enhancing environmental awareness and mitigating state-driven errors. Extensive experiments show that ReViP effectively mitigates false completion and improves success rates over strong VLA baselines, achieving a \textbf{26\%} gain over $π_0$ model on our suite, with gains extending to LIBERO, RoboTwin 2.0, and real-world evaluations.
They See Me Rolling: High-Speed Event Vision-Based Tactile Roller Sensor for Large Surface Inspection
Inspecting large-scale industrial surfaces like aircraft fuselages for quality control requires capturing their precise 3D surface geometry at high resolution. Vision-based tactile sensors (VBTSs) offer high local resolution but require slow 'press-and-lift' measurements stitched for large areas. Approaches with sliding or roller/belt VBTS designs provide measurements continuity. However, they face significant challenges respectively: sliding struggles with friction/wear and both approaches are speed-limited by conventional camera frame rates and motion blur, making large-area scanning time consuming. Thus, a rapid, continuous, high-resolution method is needed. We introduce a novel tactile sensor integrating a neuromorphic camera in a rolling mechanism to achieve this. Leveraging its high temporal resolution and robustness to motion blur, our system uses a modified event-based multi-view stereo approach for 3D reconstruction. We demonstrate state-of-the-art scanning speeds up to 0.5 m/s, achieving Mean Absolute Error below 100 microns -- 11 times faster than prior continuous tactile sensing methods. A multi-reference Bayesian fusion strategy enhances accuracy (reducing MAE by 25.2\% compared to EMVS) and mitigates curvature errors. We also validate high-speed feature recognition via Braille reading 2.6 times faster than previous approaches.
comment: Accepted to IEEE T-RO - Project Page: https://akramekhairi.github.io/TheySeeMeRolling/
A Robust Placeability Metric for Model-Free Unified Pick-and-Place Reasoning
Reliable manipulation of previously unseen objects remains a fundamental challenge for autonomous robotic systems operating in unstructured environments. In particular, robust pick-and-place planning directly from noisy and only partial real-world observations, where object surfaces are inherently incomplete due to occlusions (e.g., bottom faces on a tabletop), is difficult. As a result, many existing methods rely on strong object priors (e.g., CAD models) or to assume placement on continuous, flat support surfaces such as planar tabletops, without explicitly accounting for edge proximity or inclined supports. In this work, we introduce a robust probabilistic placeability metric that evaluates 6D object placement poses from partial observations by jointly scoring object stability, graspability, and clearance from raw point cloud geometry. Using this metric, we generate diverse multi-orientation placement candidates and condition grasp scoring on these placements, enabling model-free unified pick-and-place reasoning. Simulation and real-robot experiments on unseen objects and challenging support geometries confirm that our metric yields accurate stability predictions and consistently improves end-to-end pick-and-place success by producing stable, collision-free grasp-place pairs directly from partial point clouds.
SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation
Imitation Learning (IL) enables robots to acquire manipulation skills from expert demonstrations. Diffusion Policy (DP) models multi-modal expert behaviors but suffers performance degradation as observation horizons increase, limiting long-horizon manipulation. We propose Self-Evolving Gated Attention (SEGA), a temporal module that maintains a time-evolving latent state via gated attention, enabling efficient recurrent updates that compress long-horizon observations into a fixed-size representation while filtering irrelevant temporal information. Integrating SEGA into DP yields Self-Evolving Diffusion Policy (SeedPolicy), which resolves the temporal modeling bottleneck and enables scalable horizon extension with moderate overhead. On the RoboTwin 2.0 benchmark with 50 manipulation tasks, SeedPolicy outperforms DP and other IL baselines. Averaged across both CNN and Transformer backbones, SeedPolicy achieves 36.8% relative improvement in clean settings and 169% relative improvement in randomized challenging settings over the DP. Compared to vision-language-action models such as RDT with 1.2B parameters, SeedPolicy achieves competitive performance with one to two orders of magnitude fewer parameters, demonstrating strong efficiency and scalability. These results establish SeedPolicy as a state-of-the-art imitation learning method for long-horizon robotic manipulation. Code is available at: https://github.com/Youqiang-Gui/SeedPolicy.
comment: 16 pages, 13 figures
ORN-CBF: Learning Observation-conditioned Residual Neural Control Barrier Functions via Hypernetworks
Control barrier functions (CBFs) have been demonstrated as an effective method for safety-critical control of autonomous systems. Although CBFs are simple to deploy, their design remains challenging, motivating the development of learning-based approaches. Yet, issues such as suboptimal safe sets, applicability in partially observable environments, and lack of rigorous safety guarantees persist. In this work, we propose observation-conditioned neural CBFs based on Hamilton-Jacobi (HJ) reachability analysis, which approximately recover the maximal safe sets. We exploit certain mathematical properties of the HJ value function, ensuring that the predicted safe set never intersects with the observed failure set. Moreover, we leverage a hypernetwork-based architecture that is particularly suitable for the design of observation-conditioned safety filters. The proposed method is examined both in simulation and hardware experiments for a ground robot and a quadcopter. The results show improved success rates and generalization to out-of-domain environments compared to the baselines.
Multiagent Systems
Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification ICLR 2026
Verifiers--functions assigning rewards to agent behavior--have been key to AI progress in math, code, and games. However, extending gains to domains without clear-cut success criteria remains a challenge: while humans can recognize desired outcomes, translating this intuition into scalable rules is nontrivial. Multimodal LLMs (MLLMs) offer a promising solution, given their world knowledge, human-preference alignment, and reasoning capabilities. We evaluate MLLM verifiers across web navigation, computer use, and robotics, spanning 13+ models, 28+ designs, and thousands of trajectories from diverse agents. We identify a critical limitation: a strong tendency for MLLMs to over-validate agent behavior--a phenomenon we term agreement bias. This bias is pervasive, resilient to test-time scaling, and can harm applications relying on MLLM judgments/rewards (e.g., self-improvement, steering, online supervision). We discuss several considerations for evaluating and designing MLLM verifiers, and introduce SGV, a lightweight method that better leverages their capabilities by modulating (un)conditional generation. First, an MLLM is elicited to generate broad priors about desired behavior, independent of the data under evaluation. Then, conditioned on self-generated priors, it reasons over and evaluates a candidate trajectory. Our methods yield more human-aligned verifiers, improving failure detection by 25pp and accuracy by 14pp. In self-improvement and online supervision, they boost task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena--surpassing the previous state of the art by 20pp. As a byproduct, we release an update of VisualWebArena featuring strong agent baselines, more human-aligned oracles, container parallelism with high fidelity and proper resets, >10x speedups, and VWA-Lite, a 1/3 subset with comparable evaluation fidelity.
comment: ICLR 2026. Code, models, and data publicly available at https://mshalimay.github.io/agreement-bias-sgv/
NAAMSE: Framework for Evolutionary Security Evaluation of Agents ICLR 2026
AI agents are increasingly deployed in production, yet their security evaluations remain bottlenecked by manual red-teaming or static benchmarks that fail to model adaptive, multi-turn adversaries. We propose NAAMSE, an evolutionary framework that reframes agent security evaluation as a feedback-driven optimization problem. Our system employs a single autonomous agent that orchestrates a lifecycle of genetic prompt mutation, hierarchical corpus exploration, and asymmetric behavioral scoring. By using model responses as a fitness signal, the framework iteratively compounds effective attack strategies while simultaneously ensuring "benign-use correctness", preventing the degenerate security of blanket refusal. Our experiments across a diverse suite of state-of-the-art large language models demonstrate that evolutionary mutation systematically amplifies vulnerabilities missed by one-shot methods, with controlled ablations revealing that the synergy between exploration and targeted mutation uncovers high-severity failure modes. We show that this adaptive approach provides a more realistic and scalable assessment of agent robustness in the face of evolving threats. The code for NAAMSE is open source and available at https://github.com/HASHIRU-AI/NAAMSE.
comment: Published at ICLR 2026 Workshop on Agents in the Wild
Systems and Control (EESS)
Online Tracking with Predictions for Nonlinear Systems with Koopman Linear Embedding
We study the problem of online tracking in unknown nonlinear dynamical systems, where only short-horizon predictions of future target states are available. This setting arises in practical scenarios where full future information and exact system dynamics are unavailable. We focus on a class of nonlinear systems that admit a Koopman linear embedding, enabling the dynamics to evolve linearly in a lifted space. Exploiting this structure, we analyze a model-free predictive tracking algorithm based on Willems' fundamental lemma, which imposes dynamic constraints using only past data within a receding-horizon control framework. We show that, for Koopman-linearizable systems, the cumulative cost and dynamic regret of the nonlinear tracking problem coincide with those of the lifted linear counterpart. Moreover, we prove that the dynamic regret of our algorithm decays exponentially with the prediction horizon, as validated by numerical experiments.
Underwater Embodied Intelligence for Autonomous Robots: A Constraint-Coupled Perspective on Planning, Control, and Deployment
Autonomous underwater robots are increasingly deployed for environmental monitoring, infrastructure inspection, subsea resource exploration, and long-horizon exploration. Yet, despite rapid advances in learning-based planning and control, reliable autonomy in real ocean environments remains fundamentally constrained by tightly coupled physical limits. Hydrodynamic uncertainty, partial observability, bandwidth-limited communication, and energy scarcity are not independent challenges; they interact within the closed perception-planning-control loop and often amplify one another over time. This Review develops a constraint-coupled perspective on underwater embodied intelligence, arguing that planning and control must be understood within tightly coupled sensing, communication, coordination, and resource constraints in real ocean environments. We synthesize recent progress in reinforcement learning, belief-aware planning, hybrid control, multi-robot coordination, and foundation-model integration through this embodied perspective. Across representative application domains, we show how environmental monitoring, inspection, exploration, and cooperative missions expose distinct stress profiles of cross-layer coupling. To unify these observations, we introduce a cross-layer failure taxonomy spanning epistemic, dynamic, and coordination breakdowns, and analyze how errors cascade across autonomy layers under uncertainty. Building on this structure, we outline research directions toward physics-grounded world models, certifiable learning-enabled control, communication-aware coordination, and deployment-aware system design. By internalizing constraint coupling rather than treating it as an external disturbance, underwater embodied intelligence may evolve from performance-driven adaptation toward resilient, scalable, and verifiable autonomy under real ocean conditions.
comment: This article is currently under review
A Novel Phase-Noise Module for the QUCS Circuit Simulator. Part II : Noise Analysis
The paper documents the implementation of a novel phase-noise analysis module within the open-source QUCS circuit simulator environment. The underlying algorithm is based on a rigorous, unified time-domain methodology of (coupled) oscillator noise-response, recently proposed by the authors. The theoretical approach used to develop this model is entirely unconstrained by any empirical and/or phenomenological modelling techniques, such as e.g. LTI and LTV theory, and this differentiates it from all prior proposals on this topic. The paper introduces important, and previously unpublished, extensions to this framework, in the form of novel unified closed-form expressions for both the amplitude and phase-amplitude correlation response of a general coupled oscillating circuit perturbed by noise. The research discussed herein has many important scientific and industrial applications w.r.t. predicting, synthesizing and optimizing the performance of noise-perturbed free-running and coupled autonomous circuits operating under large-signal steady-state conditions. These timing circuits are ubiquitous in all modern communication and remote-sensing systems and the developed simulation tools will prove to have great impact in various areas of industrial circuit design. This paper represents second part of a two-part series with the first part discussing the implementation of the underlying steady-state analysis module. The open-source simulator, discussed and developed herein, applies advanced state-of-the-art stochastic modelling techniques, in-order to produce noise simulation tools with capabilities and scope which, in many areas, exceed what is found in the commercial EDAs currently on the market.
comment: 16 pages, 6 figures
Leveraging Quantum Annealing for Large-Scale Household Energy Scheduling with Hydrogen Storage
Hydrogen integration into microgrids facilitates the absorption of intermittencies from renewable energy resources. However, significant challenges remain due to complex optimization problems, particularly in large-scale applications involving multiple fuel cells (FCs) and electrolyzers (ELs) with numerous binary decision variables. This paper presents a hierarchical quantum annealing (QA) model predictive control-based power allocation framework aimed at accelerating these optimization problems. First, in a day-ahead stage, the framework determines the startup and shutdown of the FCs and ELs. The short-term stage then refines the output power of the FCs and the hydrogen generation rate of the ELs. The feasibility is evaluated through a case study consisting of multiple households in Australia. Our findings demonstrate that while the traditional optimization approach performs satisfactorily in scenarios with a small number of households, the QA approach becomes more appropriate and effectively solves the problem within an acceptable range as the number of connected households increases.
A Curved Monopole Antenna for HF Radar with Enhanced Gain and Bandwidth
This paper presents the design and simulation of a new curved monopole antenna optimized for skywave HF radar applications, with a systematic investigation of the effects of curvature and fixed-section length on antenna performance. The proposed design achieves improved impedance matching, broader bandwidth, and enhanced realized gain compared to a conventional quarter-wavelength monopole at 15 MHz. Parametric analysis shows that fully bending the monopole degrades performance, whereas introducing a straight section and carefully optimizing the curvature enables a 18.5% gain increase and a 400 kHz bandwidth expansion. The single-element design is further extended to a 12-element linear array with 0.45λ spacing (where λ is the wavelength), demonstrating stable embedded-element behavior and improved low-to- moderate elevation gain for skywave over-the-horizon radar operation. At θ = 30°, the proposed array achieves 14.04 dBi compared to 13.11 dBi for the reference array, corresponding to 24% gain enhancement, which is significant in high-power HF radar systems. These results confirm that the proposed curved monopole antenna provides a compact, broadband, and scalable solution for next-generation HF radar arrays.
Temperature-Aware Scheduling of LLM Inference in Large-Scale Geo-Distributed Edge Data Centers with Distributed Optimization
The environmental impact of Large Language Models (LLMs) on data centers hosting these models is becoming a significant concern. While many efforts have focused on reducing the substantial training overhead of LLMs, carbon and water consumption during the inference phase can often surpass the costs associated with their training. The cooling systems of data centers are crucial in this context, but they are frequently modeled with a location-independent efficiency term. However, their energy efficiency is highly influenced by ambient temperature, which can vary significantly across different geographical locations. Leveraging this temperature diversity can help reduce total cooling energy costs and improve the performance of edge data centers. To address these critical sustainability issues related to LLMs, this study proposes a temperature-aware approach that co-optimizes LLM energy costs, carbon emissions, time-to-first token, and water consumption. The approach employs a distributed optimization algorithm based on an alternating direction method of multipliers, aimed at enhancing the sustainability of LLM hosting across geo-distributed edge data centers in Australia. Our method demonstrates reductions in cooling energy consumption and improves overall cost efficiency for geo-distributed cloud environments.
Robust Cooperative Output Regulation of Discrete-Time Heterogeneous Multi-Agent Systems
This article considers robust cooperative output regulation of discrete-time uncertain heterogeneous (in dimension) multi-agent systems (MASs). We show that the solvability of this problem with an internal model-based distributed control law reduces to the existence of a structured control gain that makes the nominal closed-loop system matrix of the MAS Schur. Accordingly, this article focuses on global and agent-wise local sufficient conditions for the existence and design of such a structured control gain. Based on a structured Lyapunov inequality, we present a convexification that yields a linear matrix inequality (LMI), whose feasibility is a global sufficient condition for the existence and design. Considering the individual nominal dynamics of each agent, the existence is also ensured if each agent solves a structure-free control problem. Its convexification yields LMIs that allow each agent to separately design its structure-free control gain. Lastly, we study the relationships between the sets of control gains emerging from both global and local perspectives.
comment: Under review
Tunable Input-to-State Safety with Input Constraints
Tunable input-to-state safety (TISSf) generalizes the input-to-state safety (ISSf) framework by incorporating a tuning function that regulates safety conservatism while preserving robustness against perturbations. Despite its flexibility, the TISSf tuning function is often designed without explicitly incorporating actuator limits, which can lead to incompatibility with input constraints. To address this gap, this paper proposes a framework that integrates general compact input constraints into tuning function synthesis. Leveraging a geometric perspective, we characterize the TISSf condition as a state-dependent half-space constraint and derive a verifiable certificate for input compatibility using support functions. This characterization transforms the compatibility requirement into a design constraint on the tuning function, yielding a prescriptive lower bound that defines an admissible family of tunings under input constraints. These results are specialized to norm-bounded, polyhedral, and box constraints, yielding tractable control design conditions. We show that these conditions, combined with tuning function monotonicity, guarantee input compatibility and recursive feasibility of the resulting quadratic program (QP)-based safety filter. Furthermore, an offline parameter selection procedure using a covering-based sampling strategy ensures compatibility across the entire safe set via a linear program (LP). A connected cruise control (CCC) application demonstrates robust safety under TISSf while enforcing input constraints by design.
A Lightweight MPC Bidding Framework for Brand Auction Ads
Brand advertising plays a critical role in building long-term consumer awareness and loyalty, making it a key objective for advertisers across digital platforms. Although real-time bidding has been extensively studied, there is limited literature on algorithms specifically tailored for brand auction ads that fully leverage their unique characteristics. In this paper, we propose a lightweight Model Predictive Control (MPC) framework designed for brand advertising campaigns, exploiting the inherent attributes of brand ads -- such as stable user engagement patterns and fast feedback loops -- to simplify modeling and improve efficiency. Our approach utilizes online isotonic regression to construct monotonic bid-to-spend and bid-to-conversion models directly from streaming data, eliminating the need for complex machine learning models. The algorithm operates fully online with low computational overhead, making it highly practical for real-world deployment. Simulation results demonstrate that our approach significantly improves spend efficiency and cost control compared to baseline strategies, providing a scalable and easily implementable solution for modern brand advertising platforms.
VB-NET: A physics-constrained gray-box deep learning framework for modeling air conditioning systems as virtual batteries
The increasing penetration of renewable energy necessitates unlocking demand-side flexibility. While air conditioning (AC) systems offer significant thermal inertia, existing physical and data-driven models struggle with parameter acquisition, interpretability, and data scarcity. This paper proposes VB-NET, a physics-constrained gray-box deep learning framework that transforms complex AC thermodynamics into a standardized Virtual Battery (VB) model. We first mathematically prove the isomorphic equivalence between the AC and VB models. Subsequently, VB-NET is designed to strictly enforces physical laws by decoupling shared meteorological drivers from private building thermal fingerprints and embedding a differentiable physics layer. Experimental results demonstrate that VB-NET significantly outperforms conventional black-box models in state of charge tracking while successfully recovering underlying thermodynamic laws to yield physically consistent parameters. Furthermore, utilizing multi-task learning and terminal sensitivity modulation, VB-NET overcomes the cold-start dilemma, achieving high-precision modeling for new AC units using only 2% to 6% of historical data. Ultimately, this study provides an interpretable and data-efficient pathway for aggregating decentralized AC resources for grid regulation.
Inverse-dynamics observer design for a linear single-track vehicle model with distributed tire dynamics
Accurate estimation of the vehicle's sideslip angle and tire forces is essential for enhancing safety and handling performances in unknown driving scenarios. To this end, the present paper proposes an innovative observer that combines a linear single-track model with a distributed representation of the tires and information collected from standard sensors. In particular, by adopting a comprehensive representation of the tires in terms of hyperbolic partial differential equations (PDEs), the proposed estimation strategy exploits dynamical inversion to reconstruct the lumped and distributed vehicle states solely from yaw rate and lateral acceleration measurements. Simulation results demonstrate the effectiveness of the observer in estimating the sideslip angle and tire forces even in the presence of noise and model uncertainties.
comment: 6 pages, 5 figures. Accepted at ECC 2026
IQC-Based Output-Feedback Control of LPV Systems with Time-Varying Input Delays
Input delays are a common source of performance degradation and instability in control systems. This paper addresses the $\mathcal{H}_\infty$ output-feedback control problem for LPV systems with time-varying input delays under the integral quadratic constraint (IQC) framework. By integrating parameter-dependent Lyapunov functions with dynamic IQC multipliers, we derive convex, delay-dependent synthesis conditions formulated as parameter-dependent LMIs, enabled by the proposed exact-memory controller structure. An explicit controller reconstruction formula is provided to recover the LPV controller from the LMI solution, avoiding the need to specify the functional form of the parameter-dependent controller gains. While the synthesis problem for memoryless control is inherently non-convex, the proposed approach demonstrates significant performance improvement, reduced conservatism, and computational efficiency for standard output-feedback design. Numerical examples illustrate the effectiveness and broad applicability of the method to LPV systems with time-varying input delays.
Cost-Driven Representation Learning for Linear Quadratic Gaussian Control: Part II
We study the problem of state representation learning for control from partial and potentially high-dimensional observations. We approach this problem via cost-driven state representation learning, in which we learn a dynamical model in a latent state space by predicting cumulative costs. In particular, we establish finite-sample guarantees on finding a near-optimal representation function and a near-optimal controller using the learned latent model for infinite-horizon time-invariant Linear Quadratic Gaussian (LQG) control. We study two approaches to cost-driven representation learning, which differ in whether the transition function of the latent state is learned explicitly or implicitly. The first approach has also been investigated in Part I of this work, for finite-horizon time-varying LQG control. The second approach closely resembles MuZero, a recent breakthrough in empirical reinforcement learning, in that it learns latent dynamics implicitly by predicting cumulative costs. A key technical contribution of this Part II is to prove persistency of excitation for a new stochastic process that arises from the analysis of quadratic regression in our approach, and may be of independent interest.
comment: 38 pages; preliminary version appeared in IEEE CDC 2023; this is the extended journal version, with an end-to-end guarantee added
Machine Learning for the Internet of Underwater Things: From Fundamentals to Implementation
The Internet of Underwater Things (IoUT) is becoming a critical infrastructure for ocean observation, marine resource management, and climate science. Its development is hindered by severe acoustic attenuation, propagation delays far exceeding those of terrestrial wireless systems, strict energy constraints, and dynamic topologies shaped by ocean currents. Machine learning (ML) has emerged as a key enabler for addressing these limitations, offering data driven mechanisms that enhance performance across all layers of underwater wireless sensor networks. This tutorial survey synthesises ML methodologies supervised, unsupervised, reinforcement, and deep learning specifically contextualised for underwater communication environments. It outlines the algorithmic principles of each paradigm and examines the conditions under which particular approaches deliver superior performance. A layer wise analysis highlights physical layer gains in localisation and channel estimation, MAC layer adaptations that improve channel utilisation, network layer routing strategies that extend operational lifetime, and transport layer mechanisms capable of reducing packet loss by up to 91 percent. At the application layer, ML enables substantial data compression and object detection accuracies reaching 92 percent. Drawing on 300 studies from 2012 to 2025, the survey documents energy efficiency gains of 7 to 29 times, throughput improvements over traditional protocols, and cross layer optimisation benefits of up to 42 percent. It also identifies persistent barriers, including limited datasets, computational constraints, and the gap between theoretical models and real world deployment. The survey concludes with emerging research directions and a technology roadmap supporting ML adoption in operational underwater networks.
comment: 78 pages, 14 figures,
Ego-Vision World Model for Humanoid Contact Planning
Enabling humanoid robots to exploit physical contact, rather than simply avoid collisions, is crucial for autonomy in unstructured environments. Traditional optimization-based planners struggle with contact complexity, while on-policy reinforcement learning (RL) is sample-inefficient and has limited multi-task ability. We propose a framework combining a learned world model with sampling-based Model Predictive Control (MPC), trained on a demonstration-free offline dataset to predict future outcomes in a compressed latent space. To address sparse contact rewards and sensor noise, the MPC uses a learned surrogate value function for dense, robust planning. Our single, scalable model supports contact-aware tasks, including wall support after perturbation, blocking incoming objects, and traversing height-limited arches, with improved sample efficiency and multi-task capability over on-policy RL. Deployed on a physical humanoid, our system achieves robust, real-time contact planning from proprioception and ego-centric depth images. Code and dataset are available at our website: https://ego-vcp.github.io/
Toward 6G Sidelink Reliability: MAC PRR Modeling for NR Mode 2 SPS and ns-3 Validation
5G New Radio (NR) Sidelink (SL) Mode 2 has enabled decentralized, infrastructure-less direct communications which is evolving to serve reliability-critical services in 6G SL. Particularly, the channel access in NR SL Mode 2 relies on the Sensing-based Semi-Persistent Scheduling (SPS) whose key features significantly influence the packet reception ratio (PRR). While SPS has been widely studied, existing analytical models typically abstract or omit several NR-specific SPS features that are standardized in the 3rd Generation Partnership Project (3GPP), limiting their ability to explain how SPS parameters shape MAC collision dynamics and PRR. This paper develops an analytical MAC-layer PRR model for broadcast NR SL mode 2 by explicitly modeling SPS-driven MAC collision events. The model captures (i) Collisions caused by simultaneous resource reselection and (ii) Persistent collisions induced by resource keeping across resource reservation intervals (RRIs). Based on the event-level characterization, we derive closed-form expressions for the steady-state MAC collision probability and PRR. We further extend the analysis to incorporate under-explored SPS features, including the duplicate transmissions per RRI and the minimum resource-availability requirement for reselection, and quantify their impact on PRR in under-saturated regimes. The analytical results are validated using ns-3 simulations based on the 5G-LENA framework, showing close agreement under under-saturation and revealing deviations as the system approaches saturation. The proposed model provides mechanistic insight and design guidance of tuning the SPS parameters to improve 6G SL reliability.
comment: This work has been submitted to the IEEE for possible publication. 29 pages, 22 figures
Power flow and optimal power flow using quantum and digital annealers: a computational scalability analysis
This study further explores reformulating power flow (PF) analysis as a discrete combinatorial optimization problem, proposed in our earlier study using the Adiabatic Quantum Power Flow (AQPF) algorithm, which can be executed on Ising machines, including quantum and quantum-inspired hardware. This approach provides a new representation of the underlying equations, analogous to how neural networks approximate complex functions using simple operations. While the resulting combinatorial optimization problem is NP-hard, it is compatible with emerging quantum hardware designed to address such complexity. We introduce the Adiabatic Quantum Optimal Power Flow (AQOPF) algorithm, which transforms the classical optimal power flow (OPF) equations into quadratic unconstrained binary optimization (QUBO) models. Furthermore, the AQPF and AQOPF algorithms are evaluated on standard test cases ranging from 4- to 1354-bus systems using D-Wave's Advantage\texttrademark\ system (QA), its hybrid quantum-classical solver (HA), and Fujitsu's third-generation Digital Annealer (DAv3) and Quantum-Inspired Integrated Optimization (QIIO) platform. Both full and partitioned formulations are investigated, with particular attention to scalability and robustness in ill-conditioned scenarios. The results demonstrate that the algorithms can reproduce feasible PF and OPF solutions and exhibit promising computational scalability when supported by scalable hardware.
comment: 17 pages, 2 pseudo codes, 2 figures, 5 tables
Robustness to Model Approximation, Model Learning From Data, and Sample Complexity in Wasserstein Regular MDPs
The paper studies the robustness properties of discrete-time stochastic optimal control under Wasserstein model approximation for both discounted-cost and average-cost criteria. Specifically, we study the performance loss when applying an optimal policy designed for an approximate model to the true dynamics compared with the optimal cost for the true model under the sup-norm-induced metric, and relate it to the Wasserstein-1 distance between the approximate and true transition kernels. A primary motivation of this analysis is empirical model learning, as well as empirical noise distribution learning, where Wasserstein convergence holds under mild conditions but stronger convergence criteria, such as total variation, may not. We discuss applications of the results to the disturbance estimation problem, where sample complexity bounds are given, and also to a general empirical model learning approach, obtained under either Markov or i.i.d. learning settings.
comment: 38 pages
ORN-CBF: Learning Observation-conditioned Residual Neural Control Barrier Functions via Hypernetworks
Control barrier functions (CBFs) have been demonstrated as an effective method for safety-critical control of autonomous systems. Although CBFs are simple to deploy, their design remains challenging, motivating the development of learning-based approaches. Yet, issues such as suboptimal safe sets, applicability in partially observable environments, and lack of rigorous safety guarantees persist. In this work, we propose observation-conditioned neural CBFs based on Hamilton-Jacobi (HJ) reachability analysis, which approximately recover the maximal safe sets. We exploit certain mathematical properties of the HJ value function, ensuring that the predicted safe set never intersects with the observed failure set. Moreover, we leverage a hypernetwork-based architecture that is particularly suitable for the design of observation-conditioned safety filters. The proposed method is examined both in simulation and hardware experiments for a ground robot and a quadcopter. The results show improved success rates and generalization to out-of-domain environments compared to the baselines.
Cost-Driven Representation Learning for Linear Quadratic Gaussian Control: Part I
We study the task of learning state representations from potentially high-dimensional observations, with the goal of controlling an unknown partially observable system. We pursue a cost-driven approach, where a dynamic model in some latent state space is learned by predicting the costs without predicting the observations or actions. In particular, we focus on an intuitive cost-driven state representation learning method for solving Linear Quadratic Gaussian (LQG) control, one of the most fundamental partially observable control problems. As our main results, we establish finite-sample guarantees of finding a near-optimal state representation function and a near-optimal controller using the directly learned latent model, for finite-horizon time-varying LQG control problems. To the best of our knowledge, despite various empirical successes, finite-sample guarantees of such a cost-driven approach remain elusive. Our result underscores the value of predicting multi-step costs, an idea that is key to our theory, and notably also an idea that is known to be empirically valuable for learning state representations. A second part of this work, that is to appear as Part II, addresses the infinite-horizon linear time-invariant setting; it also extends the results to an approach that implicitly learns the latent dynamics, inspired by the recent empirical breakthrough of MuZero in model-based reinforcement learning.
comment: 51 pages; preliminary version appeared in L4DC 2023; this is the extended journal version, with an end-to-end guarantee added
Robotics
A Distributed Gaussian Process Model for Multi-Robot Mapping ICRA 2026
We propose DistGP: a multi-robot learning method for collaborative learning of a global function using only local experience and computation. We utilise a sparse Gaussian process (GP) model with a factorisation that mirrors the multi-robot structure of the task, and admits distributed training via Gaussian belief propagation (GBP). Our loopy model outperforms Tree-Structured GPs \cite{bui2014tree} and can be trained online and in settings with dynamic connectivity. We show that such distributed, asynchronous training can reach the same performance as a centralised, batch-trained model, albeit with slower convergence. Last, we compare to DiNNO \cite{yu2022dinno}, a distributed neural network (NN) optimiser, and find DistGP achieves superior accuracy, is more robust to sparse communication and is better able to learn continually.
comment: ICRA 2026, 8 pages
A Lightweight Digital-Twin-Based Framework for Edge-Assisted Vehicle Tracking and Collision Prediction
Vehicle tracking, motion estimation, and collision prediction are fundamental components of traffic safety and management in Intelligent Transportation Systems (ITS). Many recent approaches rely on computationally intensive prediction models, which limits their practical deployment on resource-constrained edge devices. This paper presents a lightweight digital-twin-based framework for vehicle tracking and spatiotemporal collision prediction that relies solely on object detection, without requiring complex trajectory prediction networks. The framework is implemented and evaluated in Quanser Interactive Labs (QLabs), a high-fidelity digital twin of an urban traffic environment that enables controlled and repeatable scenario generation. A YOLO-based detector is deployed on simulated edge cameras to localize vehicles and extract frame-level centroid trajectories. Offline path maps are constructed from multiple traversals and indexed using K-D trees to support efficient online association between detected vehicles and road segments. During runtime, consistent vehicle identifiers are maintained, vehicle speed and direction are estimated from the temporal evolution of path indices, and future positions are predicted accordingly. Potential collisions are identified by analyzing both spatial proximity and temporal overlap of predicted future trajectories. Our experimental results across diverse simulated urban scenarios show that the proposed framework predicts approximately 88% of collision events prior to occurrence while maintaining low computational overhead suitable for edge deployment. Rather than introducing a computationally intensive prediction model, this work introduces a lightweight digital-twin-based solution for vehicle tracking and collision prediction, tailored for real-time edge deployment in ITS.
comment: 6 pages, 2 figures, IEEE ICC 2026 Workshops (under submission)
Faster-HEAL: An Efficient and Privacy-Preserving Collaborative Perception Framework for Heterogeneous Autonomous Vehicles
Collaborative perception (CP) is a promising paradigm for improving situational awareness in autonomous vehicles by overcoming the limitations of single-agent perception. However, most existing approaches assume homogeneous agents, which restricts their applicability in real-world scenarios where vehicles use diverse sensors and perception models. This heterogeneity introduces a feature domain gap that degrades detection performance. Prior works address this issue by retraining entire models/major components, or using feature interpreters for each new agent type, which is computationally expensive, compromises privacy, and may reduce single-agent accuracy. We propose Faster-HEAL, a lightweight and privacy-preserving CP framework that fine-tunes a low-rank visual prompt to align heterogeneous features with a unified feature space while leveraging pyramid fusion for robust feature aggregation. This approach reduces the trainable parameters by 94%, enabling efficient adaptation to new agents without retraining large models. Experiments on the OPV2V-H dataset show that Faster-HEAL improves detection performance by 2% over state-of-the-art methods with significantly lower computational overhead, offering a practical solution for scalable heterogeneous CP.
comment: Accepted to appear in the 2026 IEEE Intelligent Vehicles Symposium (IV 2026), Detroit, MI, USA, June 22-25, 2026. 6 pages, 1 figure, 4 tables
Soft Rigid Hybrid Gripper with Inflatable Silicone Pockets for Tunable Frictional Grasping
Grasping objects with diverse mechanical properties, such as heavy, slippery, or fragile items, remains a significant challenge in robotics. Conventional rigid grippers typically rely on increasing the normal forces to secure an object, however, this can cause damage to fragile objects due to excessive force. To address this limitation, we propose a soft rigid hybrid gripper finger that combines rigid structural shells with soft, inflatable silicone pockets, which could be integrated into a conventional gripper. The hybrid gripper can actively modulate its surface friction by varying the internal air pressure of the silicone pockets, enabling the gripper to securely grasp objects without increasing the gripping force. This is demonstrated by fundamental experimental results, in which an increase in internal pressure leads to a proportional increase in the effective coefficient of friction. The gripping experiments also show that the integrated gripper can stably lift heavy and slippery objects or fragile, deformable objects, such as eggs, tofu, fruits, and paper cups, with minimal damage by increasing friction rather than applying high force.
Kinematics-Aware Latent World Models for Data-Efficient Autonomous Driving SC
Data-efficient learning remains a central challenge in autonomous driving due to the high cost and safety risks of large-scale real-world interaction. Although world-model-based reinforcement learning enables policy optimization through latent imagination, existing approaches often lack explicit mechanisms to encode spatial and kinematic structure essential for driving tasks. In this work, we build upon the Recurrent State-Space Model (RSSM) and propose a kinematics-aware latent world model framework for autonomous driving. Vehicle kinematic information is incorporated into the observation encoder to ground latent transitions in physically meaningful motion dynamics, while geometry-aware supervision regularizes the RSSM latent state to capture task-relevant spatial structure beyond pixel reconstruction. The resulting structured latent dynamics improve long-horizon imagination fidelity and stabilize policy optimization. Experiments in a driving simulation benchmark demonstrate consistent gains over both model-free and pixel-based world-model baselines in terms of sample efficiency and driving performance. Ablation studies further verify that the proposed design enhances spatial representation quality within the latent space. These results suggest that integrating kinematic grounding into RSSM-based world models provides a scalable and physically grounded paradigm for autonomous driving policy learning.
comment: 6 pages, 5 figures. Under review at IEEE ITSC
Vision-Guided MPPI for Agile Drone Racing: Navigating Arbitrary Gate Poses via Neural Signed Distance Fields
Autonomous drone racing requires the tight coupling of perception, planning, and control under extreme agility. However, recent approaches typically rely on precomputed spatial reference trajectories or explicit 6-DoF gate pose estimation, rendering them brittle to spatial perturbations, unmodeled track changes, and sensor noise. Conversely, end-to-end learning policies frequently overfit to specific track layouts and struggle with zero-shot generalization. To address these fundamental limitations, we propose a fully onboard, vision guided optimal control framework that enables reference-free agile flight through arbitrarily placed and oriented gates. Central to our approach is Gate-SDF, a novel, implicitly learned neural signed distance field. Gate-SDF directly processes raw, noisy depth images to predict a continuous spatial field that provides both collision repulsion and active geometric guidance toward the valid traversal area. We seamlessly integrate this representation into a sampling-based Model Predictive Path Integral (MPPI) controller. By fully exploiting GPU parallelism, the framework evaluates these continuous spatial constraints across thousands of simulated trajectory rollouts simultaneously in real time. Furthermore, our formulation inherently maintains spatial consistency, ensuring robust navigation even under severe visual occlusion during aggressive maneuvers. Extensive simulations and real-world experiments demonstrate that the proposed system achieves high-speed agile flight and successfully navigates unseen tracks subject to severe unmodeled gate displacements and orientation perturbations. Videos are available at https://zhaofangguo.github.io/vision_guided_mppi/
RoTri-Diff: A Spatial Robot-Object Triadic Interaction-Guided Diffusion Model for Bimanual Manipulation ICRA 2026
Bimanual manipulation is a fundamental robotic skill that requires continuous and precise coordination between two arms. While imitation learning (IL) is the dominant paradigm for acquiring this capability, existing approaches, whether robot-centric or object-centric, often overlook the dynamic geometric relationship among the two arms and the manipulated object. This limitation frequently leads to inter-arm collisions, unstable grasps, and degraded performance in complex tasks. To address this, in this paper we explicitly models the Robot-Object Triadic Interaction (RoTri) representation in bimanual systems, by encoding the relative 6D poses between the two arms and the object to capture their spatial triadic relationship and establish continuous triangular geometric constraints. Building on this, we further introduce RoTri-Diff, a diffusion-based imitation learning framework that combines RoTri constraints with robot keyposes and object motion in a hierarchical diffusion process. This enables the generation of stable, coordinated trajectories and robust execution across different modes of bimanual manipulation. Extensive experiments show that our approach outperforms state-of-the-art baselines by 10.2% on 11 representative RLBench2 tasks and achieves stable performance on 4 challenging real-world bimanual tasks. Project website: https://rotri-diff.github.io/.
comment: ICRA 2026
Tutorial on Aided Inertial Navigation Systems: A Modern Treatment Using Lie-Group Theoretical Methods
This tutorial presents a control-oriented introduction to aided inertial navigation systems using a Lie-group formulation centered on the extended Special Euclidean group SE_2(3). The focus is on developing a clear and implementation-oriented geometric framework for fusing inertial measurements with aiding information, while making the role of invariance and symmetry explicit. Recent extensions, including higher-order state representations, synchronous observer designs, and equivariant filtering methods, are discussed as natural continuations of the same underlying principles. The goal is to provide readers with a coherent system-theoretic perspective that supports both understanding and practical use of modern aided inertial navigation methods.
Model-based thermal drift compensation for high-precision hexapod robot actuators
Thermal expansion is a significant source of positioning error in high-precision hexapod robots (Gough-Stewart platforms). Any variation in the temperature of the hexapod's parts induces expansion, which alters their kinematic model and reduces the robot's accuracy and repeatability. These variations may arise from internal heat sources (such as motors, encoders, and electronics) or from environmental changes. In this study, a method is proposed to anticipate and therefore correct the thermal drift of one of the hexapod precision electro-mechanical actuators. This method is based on determining a model that links the expansion state of the actuator at any given moment to the temperature of some well-chosen points on its surface. This model was initially developed theoretically. Its coefficients were then adjusted experimentally on a specific test-bench, based on a rigorous measurement campaign of actuator expansion using a high-precision interferometric measurement system. Experimental validation demonstrates a reduction of thermally induced expansion by more than 80%. This paves the way for thermal drift correction across the entire robot or similar robotics parts.
DexKnot: Generalizable Visuomotor Policy Learning for Dexterous Bag-Knotting Manipulation
Knotting plastic bags is a common task in daily life, yet it is challenging for robots due to the bags' infinite degrees of freedom and complex physical dynamics. Existing methods often struggle in generalization to unseen bag instances or deformations. To address this, we present DexKnot, a framework that combines keypoint affordance with diffusion policy to learn a generalizable bag-knotting policy. Our approach learns a shape-agnostic representation of bags from keypoint correspondence data collected through real-world manual deformation. For an unseen bag configuration, the keypoints can be identified by matching the representation to a reference. These keypoints are then provided to a diffusion transformer, which generates robot action based on a small number of human demonstrations. DexKnot enables effective policy generalization by reducing the dimensionality of observation space into a sparse set of keypoints. Experiments show that DexKnot achieves reliable and consistent knotting performance across a variety of previously unseen instances and deformations.
Efficient Trajectory Optimization for Autonomous Racing via Formula-1 Data-Driven Initialization
Trajectory optimization is a central component of fast and efficient autonomous racing. However practical optimization pipelines remain highly sensitive to initialization and may converge slowly or to suboptimal local solutions when seeded with heuristic trajectories such as the centerline or minimum-curvature paths. To address this limitation, we leverage expert driving behavior as a initialization prior and propose a learning-informed initialization strategy based on real-world Formula 1 telemetry. To this end, we first construct a multi-track Formula~1 trajectory dataset by reconstructing and aligning noisy GPS telemetry to a standardized reference-line representation across 17 tracks. Building on this, we present a neural network that predicts an expert-like raceline offset directly from local track geometry, without explicitly modeling vehicle dynamics or forces. The predicted raceline is then used as an informed seed for a minimum-time optimal control solver. Experiments on all 17 tracks demonstrate that the learned initialization accelerates solver convergence and significantly reduces runtime compared to traditional geometric baselines, while preserving the final optimized lap time.
Learning From Failures: Efficient Reinforcement Learning Control with Episodic Memory
Reinforcement learning has achieved remarkable success in robot learning. However, under challenging exploration and contact-rich dynamics, early-stage training is frequently dominated by premature terminations such as collisions and falls. As a result, learning is overwhelmed by short-horizon, low-return trajectories, which hinder convergence and limit long-horizon exploration. To alleviate this issue, we propose a technique called Failure Episodic Memory Alert (FEMA). FEMA explicitly stores short-horizon failure experiences through an episodic memory module. During interactions, it retrieves similar failure experiences and prevents the robot from recurrently relapsing into unstable states, guiding the policy toward long-horizon trajectories with greater long-term value. FEMA can be combined easily with model-free reinforcement learning algorithms, and yields a substantial sample-efficiency improvement of 33.11% on MuJoCo tasks across several classical RL algorithms. Furthermore, integrating FEMA into a parallelized PPO training pipeline demonstrates its effectiveness on a real-world bipedal robot task.
Towards Scalable Probabilistic Human Motion Prediction with Gaussian Processes for Safe Human-Robot Collaboration IROS 2026
Accurate human motion prediction with well-calibrated uncertainty is critical for safe human-robot collaboration (HRC), where robots must anticipate and react to human movements in real time. We propose a structured multitask variational Gaussian Process (GP) framework for full-body human motion prediction that captures temporal correlations and leverages joint-dimension-level factorization for scalability, while using a continuous 6D rotation representation to preserve kinematic consistency. Evaluated on Human3.6M (H3.6M), our model achieves up to 50 lower kernel density estimate negative log-likelihood (KDE NLL) than strong baselines, a mean continuous ranked probability score (CRPS) of 0.021 m, and deterministic mean angle error (MAE) that is 3-18% higher than competitive deep learning methods. Empirical coverage analysis shows that the fraction of ground-truth outcomes contained within predicted confidence intervals gradually decreases with horizon, remaining conservative for lower-confidence intervals and near-nominal for higher-confidence intervals, with only modest calibration drift at longer horizons. Despite its probabilistic formulation, our model requires only 0.24-0.35 M parameters, roughly eight times fewer than comparable approaches, and exhibits modest inference times, indicating suitability for real-time deployment. Extensive ablation studies further validated the choice of 6D rotation representation and Matern 3/2 + Linear kernel, and guided the selection of the number of inducing points and latent dimensionality. These results demonstrate that scalable GP-based models can deliver competitive accuracy together with reliable and interpretable uncertainty estimates for downstream robotics tasks such as motion planning and collision avoidance.
comment: Submitted to IROS 2026
ACLM: ADMM-Based Distributed Model Predictive Control for Collaborative Loco-Manipulation
Collaborative transportation of heavy payloads via loco-manipulation is a challenging yet essential capability for legged robots operating in complex, unstructured environments. Centralized planning methods, e.g., holistic trajectory optimization, capture dynamic coupling among robots and payloads but scale poorly with system size, limiting real-time applicability. In contrast, hierarchical and fully decentralized approaches often neglect force and dynamic interactions, leading to conservative behavior. This study proposes an Alternating Direction Method of Multipliers (ADMM)-based distributed model predictive control framework for collaborative loco-manipulation with a team of quadruped robots with manipulators. By exploiting the payload-induced coupling structure, the global optimal control problem is decomposed into parallel individual-robot-level subproblems with consensus constraints. The distributed planner operates in a receding-horizon fashion and achieves fast convergence, requiring only a few ADMM iterations per planning cycle. A wrench-aware whole-body controller executes the planned trajectories, tracking both motion and interaction wrenches. Extensive simulations with up to four robots demonstrate scalability, real-time performance, and robustness to model uncertainty.
VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness
Vision-and-Language Navigation (VLN) increasingly relies on large vision-language models, but their inference cost conflicts with real-time deployment. Token caching is a promising training-free strategy that avoids redundant computation by reusing stable visual tokens across frames. However, existing methods assume a static camera and fixed semantic focus, assumptions that VLN fundamentally violates. We identify two failure modes: (1) visual dynamics, where viewpoint shift displaces token positions across frames, causing position-wise matching to pair misaligned content; (2) semantic dynamics, where token relevance shifts across task stages as navigation progresses, making cached states stale. We propose VLN-Cache, a visual-dynamic-aware and semantic-dynamic-aware caching framework that introduces view-aligned remapping to recover geometric correspondences and a task-relevance saliency filter to veto reuse at semantic transitions. A layer-adaptive entropy policy further balances the per-layer reuse budget. Experiments on the R2R-CE simulation benchmark show up to 1.52x speedup while maintaining competitive navigation success rates.
The Talking Robot: Distortion-Robust Acoustic Models for Robot-Robot Communication
We present Artoo, a learned acoustic communication system for robots that replaces hand-designed signal processing with end-to-end co-trained neural networks. Our system pairs a lightweight text-to-speech (TTS) transmitter (1.18M parameters) with a conformer-based automatic speech recognition (ASR) receiver (938K parameters), jointly optimized through a differentiable channel. Unlike human speech, robot-to-robot communication is paralinguistics-free: the system need not preserve timbre, prosody, or naturalness, only maximize decoding accuracy under channel distortion. Through a three-phase co-training curriculum, the TTS transmitter learns to produce distortion-robust acoustic encodings that surpass the baseline under noise, achieving 8.3% CER at 0 dB SNR. The entire system requires only 2.1M parameters (8.4 MB) and runs in under 13 ms end-to-end on a CPU, making it suitable for deployment on resource-constrained robotic platforms.
Morphology-Independent Facial Expression Imitation for Human-Face Robots
Accurate facial expression imitation on human-face robots is crucial for achieving natural human-robot interaction. Most existing methods have achieved photorealistic expression imitation through mapping 2D facial landmarks to a robot's actuator commands. Their imitation of landmark trajectories is susceptible to interference from facial morphology, which would lead to a performance drop. In this paper, we propose a morphology-independent expression imitation method that decouples expressions from facial morphology to eliminate morphological influence and produce more realistic expressions for human-face robots. Specifically, we construct an expression decoupling module to learn expression semantics by disentangling the expression representation from the morphology representation in a self-supervised manner. We devise an expression transfer module to map the representations to the robot's actuator commands through a learning objective of perceiving expression errors, producing accurate facial expressions based on the learned expression semantics. To support experimental validation, a custom-designed and highly expressive human-face robot, namely Pengrui, is developed to serve as an experimental platform for realistic expression imitation. Extensive experiments demonstrate that our method enables the human-face robot to reproduce a wide range of human-like expressions effectively. All code and implementation details of the robot will be released.
GuideTWSI: A Diverse Tactile Walking Surface Indicator Dataset from Synthetic and Real-World Images for Blind and Low-Vision Navigation
Tactile Walking Surface Indicators (TWSIs) are safety-critical landmarks that blind and low-vision (BLV) pedestrians use to locate crossings and hazard zones. From our observation sessions with BLV guide dog handlers, trainers, and an O&M specialist, we confirmed the critical importance of reliable and accurate TWSI segmentation for navigation assistance of BLV individuals. Achieving such reliability requires large-scale annotated data. However, TWSIs are severely underrepresented in existing urban perception datasets, and even existing dedicated paving datasets are limited: they lack robot-relevant viewpoints (e.g., egocentric or top-down) and are geographically biased toward East Asian directional bars - raised parallel strips used for continuous guidance along sidewalks. This narrow focus overlooks truncated domes - rows of round bumps used primarily in North America and Europe as detectable warnings at curbs, crossings, and platform edges. As a result, models trained only on bar-centric data struggle to generalize to dome-based warnings, leading to missed detections and false stops in safety-critical environments.
TacDexGrasp: Compliant and Robust Dexterous Grasping with Tactile Feedback
Multi-fingered hands offer great potential for compliant and robust grasping of unknown objects, yet their high-dimensional force control presents a significant challenge. This work addresses two key problems: (1) distributing forces across multiple contacts to counteract an object's weight, and (2) preventing rotational slip caused by gravitational torque when a grasp is distant from the object's center of mass. We address these challenges via tactile feedback and a Second-Order Cone Programming (SOCP)-based controller, without explicit torque modeling or slip detection. Our key insights are (1) rotational slip inevitably induces translational slip at some contact points for a multi-fingered grasp, and (2) the ratio of tangential to normal force at each contact is an effective early stability indicator. By actively constraining this ratio for each finger below the estimated friction coefficient, our controller maintains grasp stability against both translational and rotational slip. Real-world experiments on 12 diverse objects demonstrate the robustness and compliance of our approach.
comment: 8pages, 7 figures
SSP: Safety-guaranteed Surgical Policy via Joint Optimization of Behavioral and Spatial Constraints
The paradigm of robot-assisted surgery is shifting toward data-driven autonomy, where policies learned via Reinforcement Learning (RL) or Imitation Learning (IL) enable the execution of complex tasks. However, these ``black-box" policies often lack formal safety guarantees, a critical requirement for clinical deployment. In this paper, we propose the Safety-guaranteed Surgical Policy (SSP) framework to bridge the gap between data-driven generality and formal safety. We utilize Neural Ordinary Differential Equations (Neural ODEs) to learn an uncertainty-aware dynamics model from demonstration data. This learned model underpins a robust Control Barrier Function (CBF) safety controller, which minimally alters the actions of a surgical policy to ensure strict safety under uncertainty. Our controller enforces two constraint categories: behavioral constraints (restricting the task space of the agent) and spatial constraints (defining surgical no-go zones). We instantiate the SSP framework with surgical policies derived from RL, IL and Control Lyapunov Functions (CLF). Validation on in both the SurRoL simulation and da Vinci Research Kit (dVRK) demonstrates that our method achieves a near-zero constraint violation rate while maintaining high task success rates compared to unconstrained baselines.
Two-Stage Path Following for Mobile Manipulators via Dimensionality-Reduced Graph Search and Numerical Optimization
Efficient path following for mobile manipulators is often hindered by high-dimensional configuration spaces and kinematic constraints. This paper presents a robust two-stage configuration planning framework that decouples the 8-DoF planning problem into a tractable 2-DoF base optimization under a yaw-fixed base planning assumption. In the first stage, the proposed approach utilizes IRM to discretize the task-space path into a multi-layer graph, where an initial feasible path is extracted via a Dijkstra-based dynamic programming approach to ensure computational efficiency and global optimality within the discretized graph. In the second stage, to overcome discrete search quantization, feasible base regions are transformed into convex hulls, enabling subsequent continuous refinement via the L-BFGS algorithm to maximize trajectory smoothness while strictly enforcing reachability constraints. Simulation results demonstrate the theoretical precision of the proposed method by achieving sub-millimeter kinematic accuracy in simulation, and physical experiments on an omnidirectional mobile manipulator further validate the framework's robustness and practical applicability.
Foundational World Models Accurately Detect Bimanual Manipulator Failures
Deploying visuomotor robots at scale is challenging due to the potential for anomalous failures to degrade performance, cause damage, or endanger human life. Bimanual manipulators are no exception; these robots have vast state spaces comprised of high-dimensional images and proprioceptive signals. Explicitly defining failure modes within such state spaces is infeasible. In this work, we overcome these challenges by training a probabilistic, history informed, world model within the compressed latent space of a pretrained vision foundation model (NVIDIA's Cosmos Tokenizer). The model outputs uncertainty estimates alongside its predictions that serve as non-conformity scores within a conformal prediction framework. We use these scores to develop a runtime monitor, correlating periods of high uncertainty with anomalous failures. To test these methods, we use the simulated Push-T environment and the Bimanual Cable Manipulation dataset, the latter of which we introduce in this work. This new dataset features trajectories with multiple synchronized camera views, proprioceptive signals, and annotated failures from a challenging data center maintenance task. We benchmark our methods against baselines from the anomaly detection and out-of-distribution detection literature, and show that our approach considerably outperforms statistical techniques. Furthermore, we show that our approach requires approximately one twentieth of the trainable parameters as the next-best learning-based approach, yet outperforms it by 3.8% in terms of failure detection rate, paving the way toward safely deploying manipulator robots in real-world environments where reliability is non-negotiable.
comment: 8 pages, 5 figures, accepted at the 2026 IEEE International Conference on Robotics and Automation
VSL-Skin: Individually Addressable Phase-Change Voxel Skin for Variable-Stiffness and Virtual Joints Bridging Soft and Rigid Robots ICRA 2026
Soft robots are compliant but often cannot support loads or hold their shape, while rigid robots provide structural strength but are less adaptable. Existing variable-stiffness systems usually operate at the scale of whole segments or patches, which limits precise control over stiffness distribution and virtual joint placement. This paper presents the Variable Stiffness Lattice Skin (VSL-Skin), the first system to enable individually addressable voxel-level morphological control with centimeter-scale precision. The system provides three main capabilities: nearly two orders of magnitude stiffness modulation across axial (15-1200 N/mm), shear (45-850 N/mm), bending (8*10^2 - 3*10^4 N/deg), and torsional modes with centimeter-scale spatial control; the first demonstrated 30% axial compression in phase-change systems while maintaining structural integrity; and autonomous component-level self-repair through thermal cycling, which eliminates fatigue accumulation and enables programmable sacrificial joints for predictable failure management. Selective voxel activation creates six canonical virtual joint types with programmable compliance while preserving structural integrity in non-activated regions. The platform incorporates closed-form design models and finite element analysis for predictive synthesis of stiffness patterns and joint placement. Experimental validation demonstrates 30% axial contraction, thermal switching in 75-second cycles, and cut-to-fit integration that preserves addressability after trimming. The row-column architecture enables platform-agnostic deployment across diverse robotic systems without specialized infrastructure. This framework establishes morphological intelligence as an engineerable system property and advances autonomous reconfigurable robotics.
comment: ICRA 2026
Energy-Efficient Collaborative Transport of Tether-Suspended Payloads via Rotating Equilibrium
Collaborative aerial transportation of tethered payloads is fundamentally limited by space, power, and weight constraints. Conventional approaches rely on static equilibrium conditions, where each vehicle tilts to generate the forces that ensure they maintain a formation geometry that avoids aerodynamic interactions and collision. This horizontal thrust component represents a significant energy penalty compared to the ideal case in which each vehicle produces purely vertical thrust to lift the payload. Operating in tighter tether configurations can minimize this effect, but at the cost of either having to fly the vehicles in closer proximity, which risks collision, or significantly increasing the length of the tether, which increases complexity and reduces potential use-cases. We propose operating the tether-suspended flying system at a rotating equilibrium. By maintaining steady circular motion, centrifugal forces provide the necessary horizontal tether tension, allowing each quadrotor to generate purely vertical thrust and thus reducing the total force (and power) required compared to an equilibrium where the thrusts are not vertical. It also allows for a wider range of tether configurations to be used without sacrificing efficiency. Results demonstrate that rotating equilibria can reduce power consumption relative to static lifting by up to 20%, making collaborative aerial solutions more practically relevant.
comment: 7 pages, 8 figures
Is Your Safe Controller Actually Safe? A Critical Review of CBF Tautologies and Hidden Assumptions
This tutorial provides a critical review of the practical application of Control Barrier Functions (CBFs) in robotic safety. While the theoretical foundations of CBFs are well-established, I identify a recurring gap between the mathematical assumption of a safe controller's existence and its constructive realization in systems with input constraints. I highlight the distinction between candidate and valid CBFs by analyzing the interplay of system dynamics, actuation limits, and class-K functions. I further show that some purported demonstrations of safe robot policies or controllers are limited to passively safe systems, such as single integrators or kinematic manipulators, where safety is already inherited from the underlying physics and even naive geometric hard constraints suffice to prevent collisions. By revisiting simple low-dimensional examples, I show when CBF formulations provide valid safety guarantees and when they fail due to common misuses. I then provide practical guidelines for constructing realizable safety arguments for systems without such passive safety. The goal of this tutorial is to bridge the gap between theoretical guarantees and actual implementation, supported by an open-source interactive web demonstration that visualizes these concepts intuitively.
comment: Interactive web demo: https://cbf.taekyung.me
SAC-Loco: Safe and Adjustable Compliant Quadrupedal Locomotion
Quadruped robots are designed to achieve agile and robust locomotion by drawing inspiration from legged animals. However, most existing control methods for quadruped robots lack a key capacity observed in animals: the ability to exhibit diverse compliance behaviors while ensuring stability when experiencing external forces. In particular, achieving adjustable compliance while maintaining robust safety under force disturbances remains a significant challenge. In this work, we propose a safety aware compliant locomotion framework that integrates adjustable disturbance compliance with robust failure prevention. We first train a force compliant policy with adjustable compliance levels using a teacher student reinforcement learning framework, allowing deployment without explicit force sensing. To handle disturbances beyond the limits of compliant control, we develop a safety oriented policy for rapid recovery and stabilization. Finally, we introduce a learned safety critic that monitors the robot's safety in real time and coordinates between compliant locomotion and recovery behaviors. Together, this framework enables quadruped robots to achieve smooth force compliance and robust safety under a wide range of external force disturbances.
Vision-Guided Targeted Grasping and Vibration for Robotic Pollination in Controlled Environments
Robotic pollination offers a promising alternative to manual labor and bumblebee-assisted methods in controlled agriculture, where wind-driven pollination is absent and regulatory restrictions limit the use of commercial pollinators. In this work, we present and validate a vision-guided robotic framework that uses data from an end-effector mounted RGB-D sensor and combines 3D plant reconstruction, targeted grasp planning, and physics-based vibration modeling to enable precise pollination. First, the plant is reconstructed in 3D and registered to the robot coordinate frame to identify obstacle-free grasp poses along the main stem. Second, a discrete elastic rod model predicts the relationship between actuation parameters and flower dynamics, guiding the selection of optimal pollination strategies. Finally, a manipulator with soft grippers grasps the stem and applies controlled vibrations to induce pollen release. End-to-end experiments demonstrate a 92.5\% main-stem grasping success rate, and simulation-guided optimization of vibration parameters further validates the feasibility of our approach, ensuring that the robot can safely and effectively perform pollination without damaging the flower. To our knowledge, this is the first robotic system to jointly integrate vision-based grasping and vibration modeling for automated precision pollination.
comment: YouTube: https://youtu.be/XHLA7pEXhZU; GitHub: https://github.com/StructuresComp/robotic-pollination
xTED: Cross-Domain Adaptation via Diffusion-Based Trajectory Editing
Reusing pre-collected data from different domains is an appealing solution for decision-making tasks, especially when data in the target domain are limited. Existing cross-domain policy transfer methods mostly aim at learning domain correspondences or corrections to facilitate policy learning, such as learning task/domain-specific discriminators, representations, or policies. This design philosophy often results in heavy model architectures or task/domain-specific modeling, lacking flexibility. This reality makes us wonder: can we directly bridge the domain gaps universally at the data level, instead of relying on complex downstream cross-domain policy transfer procedures? In this study, we propose the Cross-Domain Trajectory EDiting (xTED) framework that employs a specially designed diffusion model for cross-domain trajectory adaptation. Our proposed model architecture effectively captures the intricate dependencies among states, actions, and rewards, as well as the dynamics patterns within target data. Edited by adding noises and denoising with the pre-trained diffusion model, source domain trajectories can be transformed to align with target domain properties while preserving original semantic information. This process effectively corrects underlying domain gaps, enhancing state realism and dynamics reliability in source data, and allowing flexible integration with various single-domain and cross-domain downstream policy learning methods. Despite its simplicity, xTED demonstrates superior performance in extensive simulation and real-robot experiments.
comment: xTED offers a novel, generic, flexible, simple and effective paradigm that casts cross-domain policy adaptation as a data pre-processing problem
Safe Navigation of Bipedal Robots via Koopman Operator-Based Model Predictive Control
Nonlinearity in dynamics has long been a major challenge in robotics, often causing significant performance degradation in existing control algorithms. For example, the navigation of bipedal robots can exhibit nonlinear behaviors even under simple velocity commands, as their actual dynamics are governed by complex whole-body movements and discrete contacts. In this work, we propose a safe navigation framework inspired by Koopman operator theory. We first train a low-level locomotion policy using deep reinforcement learning, and then capture its low-frequency, base-level dynamics by learning linearized dynamics in a high-dimensional lifted space. Then, our model-predictive controller (MPC) efficiently optimizes control signals via a standard quadratic objective and the linear dynamics constraint in the lifted space. We demonstrate that the Koopman model more accurately predicts bipedal robot trajectories than baseline approaches. We also show that the proposed navigation framework achieves improved safety with better success rates in dense environments with narrow passages.
comment: 9 pages
ActivePusher: Active Learning and Planning with Residual Physics for Nonprehensile Manipulation ICRA 2026
Planning with learned dynamics models offers a promising approach toward versatile real-world manipulation, particularly in nonprehensile settings such as pushing or rolling, where accurate analytical models are difficult to obtain. However, collecting training data for learning-based methods can be costly and inefficient, as it often relies on randomly sampled interactions that are not necessarily the most informative. Furthermore, learned models tend to exhibit high uncertainty in underexplored regions of the skill space, undermining the reliability of long-horizon planning. To address these challenges, we propose ActivePusher, a novel framework that combines residual-physics modeling with uncertainty-based active learning, to focus data acquisition on the most informative skill parameters. Additionally, ActivePusher seamlessly integrates with model-based kinodynamic planners, leveraging uncertainty estimates to bias control sampling toward more reliable actions. We evaluate our approach in both simulation and real-world environments, and demonstrate that it consistently improves data efficiency and achieves higher planning success rates in comparison to baseline methods. The source code is available at https://github.com/elpis-lab/ActivePusher.
comment: Accepted by the 2026 IEEE International Conference on Robotics & Automation (ICRA 2026)
Accelerating Robotic Reinforcement Learning with Agent Guidance
Reinforcement Learning (RL) offers a powerful paradigm for autonomous robots to master generalist manipulation skills through trial-and-error. However, its real-world application is stifled by low sample efficiency. Recent Human-in-the-Loop (HIL) methods accelerate training by using human corrections, yet this approach faces a scalability barrier. Reliance on human supervisors imposes a 1:1 supervision ratio that limits scalability, suffers from operator fatigue over extended sessions, and introduces high variance due to inconsistent human proficiency. We present Agent-guided Policy Search (AGPS), a framework that automates the training pipeline by replacing human supervisors with a multimodal agent. Our key insight is that the agent can be viewed as a semantic world model, injecting intrinsic value priors to structure physical exploration. By using tools, the agent provides precise guidance via corrective waypoints and spatial constraints for exploration pruning. We validate our approach on three tasks, ranging from precision insertion to deformable object manipulation. Results demonstrate that AGPS outperforms HIL methods in sample efficiency. This automates the supervision pipeline, unlocking the path to labor-free and scalable robot learning. Project website: https://agps-rl.github.io/agps/.
DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving ICLR 2026
Video generation models, as one form of world models, have emerged as one of the most exciting frontiers in AI, promising agents the ability to imagine the future by modeling the temporal evolution of complex scenes. In autonomous driving, this vision gives rise to driving world models: generative simulators that imagine ego and agent futures, enabling scalable simulation, safe testing of corner cases, and rich synthetic data generation. Yet, despite fast-growing research activity, the field lacks a rigorous benchmark to measure progress and guide priorities. Existing evaluations remain limited: generic video metrics overlook safety-critical imaging factors; trajectory plausibility is rarely quantified; temporal and agent-level consistency is neglected; and controllability with respect to ego conditioning is ignored. Moreover, current datasets fail to cover the diversity of conditions required for real-world deployment. To address these gaps, we present DrivingGen, the first comprehensive benchmark for generative driving world models. DrivingGen combines a diverse evaluation dataset curated from both driving datasets and internet-scale video sources, spanning varied weather, time of day, geographic regions, and complex maneuvers, with a suite of new metrics that jointly assess visual realism, trajectory plausibility, temporal coherence, and controllability. Benchmarking 14 state-of-the-art models reveals clear trade-offs: general models look better but break physics, while driving-specific ones capture motion realistically but lag in visual quality. DrivingGen offers a unified evaluation framework to foster reliable, controllable, and deployable driving world models, enabling scalable simulation, planning, and data-driven decision-making.
comment: ICLR 2026 Poster; Project Website: https://drivinggen-bench.github.io/
Hybrid Diffusion Policies with Projective Geometric Algebra for Efficient Robot Manipulation Learning ICRA 2026
Diffusion policies are a powerful paradigm for robot learning, but their training is often inefficient. A key reason is that networks must relearn fundamental spatial concepts, such as translations and rotations, from scratch for every new task. To alleviate this redundancy, we propose embedding geometric inductive biases directly into the network architecture using Projective Geometric Algebra (PGA). PGA provides a unified algebraic framework for representing geometric primitives and transformations, allowing neural networks to reason about spatial structure more effectively. In this paper, we introduce hPGA-DP, a novel hybrid diffusion policy that capitalizes on these benefits. Our architecture leverages the Projective Geometric Algebra Transformer (P-GATr) as a state encoder and action decoder, while employing established U-Net or Transformer-based modules for the core denoising process. Through extensive experiments and ablation studies in both simulated and real-world environments, we demonstrate that hPGA-DP significantly improves task performance and training efficiency. Notably, our hybrid approach achieves substantially faster convergence compared to both standard diffusion policies and architectures that rely solely on P-GATr. The project website is available at: https://apollo-lab-yale.github.io/26-ICRA-hPGA-website/.
comment: Accepted to ICRA 2026
Automated Pest Counting in Water Traps through Active Robotic Stirring for Occlusion Handling
Existing image-based pest counting methods rely on single static images and often produce inaccurate results under occlusion. To address this issue, this paper proposes an automated pest counting method in water traps through active robotic stirring. First, an automated robotic arm-based stirring system is developed to redistribute pests and reveal occluded individuals for counting. Then, the effects of different stirring patterns on pest counting performance are investigated. Six stirring patterns are designed and evaluated across different pest density scenarios to identify the optimal one. Finally, a heuristic counting confidence-driven closed-loop control system is proposed for adaptive-speed robotic stirring, adjusting the stirring speed based on the average change rate of counting confidence between consecutive frames. Experimental results show that the four circles is the optimal stirring pattern, achieving the lowest overall mean absolute counting error of 4.384 and the highest overall mean counting confidence of 0.721. Compared with constant-speed stirring, adaptive-speed stirring reduces task execution time by up to 44.7% and achieves more stable performance across different pest density scenarios. Moreover, the proposed pest counting method reduces the mean absolute counting error by up to 3.428 compared to the single static image counting method under high-density scenarios where occlusion is severe.
ActivePose: Active 6D Object Pose Estimation and Tracking for Robotic Manipulation
Accurate 6-DoF object pose estimation and tracking are critical for reliable robotic manipulation. However, zero-shot methods often fail under viewpoint-induced ambiguities and fixed-camera setups struggle when objects move or become self-occluded. To address these challenges, we propose an active pose estimation pipeline that combines a Vision-Language Model (VLM) with "robotic imagination" to dynamically detect and resolve ambiguities in real time. In an offline stage, we render a dense set of views of the CAD model, compute the FoundationPose entropy for each view, and construct a geometric-aware prompt that includes low-entropy (unambiguous) and high-entropy (ambiguous) examples. At runtime, the system: (1) queries the VLM on the live image for an ambiguity score; (2) if ambiguity is detected, imagines a discrete set of candidate camera poses by rendering virtual views, scores each based on a weighted combination of VLM ambiguity probability and FoundationPose entropy, and then moves the camera to the Next-Best-View (NBV) to obtain a disambiguated pose estimation. Furthermore, since moving objects may leave the camera's field of view, we introduce an active pose tracking module: a diffusion-policy trained via imitation learning, which generates camera trajectories that preserve object visibility and minimize pose ambiguity. Experiments in simulation and real-world show that our approach significantly outperforms classical baselines.
comment: 6D Pose, Diffusion Policy, Robot Learning
Radio-based Multi-Robot Odometry and Relative Localization
Radio-based methods such as Ultra-Wideband (UWB) and RAdio Detection And Ranging (radar), which have traditionally seen limited adoption in robotics, are experiencing a boost in popularity thanks to their robustness to harsh environmental conditions and cluttered environments. This work proposes a multi-robot UGV-UAV localization system that leverages the two technologies with inexpensive and readily-available sensors, such as Inertial Measurement Units (IMUs) and wheel encoders, to estimate the relative position of an aerial robot with respect to a ground robot. The first stage of the system pipeline includes a nonlinear optimization framework to trilaterate the location of the aerial platform based on UWB range data, and a radar pre-processing module with loosely coupled ego-motion estimation which has been adapted for a multi-robot scenario. Then, the pre-processed radar data as well as the relative transformation are fed to a pose-graph optimization framework with odometry and inter-robot constraints. The system, implemented for the Robotic Operating System (ROS 2) with the Ceres optimizer, has been validated in Software-in-the-Loop (SITL) simulations and in a real-world dataset. The proposed relative localization module outperforms state-of-the-art closed-form methods which are less robust to noise. Our SITL environment includes a custom Gazebo plugin for generating realistic UWB measurements modeled after real data. Conveniently, the proposed factor graph formulation makes the system readily extensible to full Simultaneous Localization And Mapping (SLAM). Finally, all the code and experimental data is publicly available to support reproducibility and to serve as a common open dataset for benchmarking.
ELHPlan: Efficient Long-Horizon Task Planning for Multi-Agent Collaboration
Large Language Models (LLMs) enable intelligent multi-robot collaboration but face fundamental trade-offs: open-loop methods that compile tasks into formal representations for external executors produce sound plans but lack adaptability in partially observable environments, while iterative methods incur prohibitive computational costs that scale poorly with team size and task complexity. In this paper, we propose Efficient Long-Horizon Planning (ELHPlan), a novel framework that introduces Action Chains, sequences of actions explicitly bound to sub-goal intentions, as the fundamental planning primitive. ELHPlan operates via a cyclical process: 1) constructing intention-bound action sequences, 2) proactively validating for conflicts and feasibility, 3) refining issues through targeted mechanisms, and 4) executing validated actions. This design balances adaptability and efficiency by providing intention-bound action sequences with longer lookahead while avoiding expensive full re-planning. We further advocate comprehensive efficiency metrics, including token consumption and planning time, to more holistically evaluate multi-agent collaboration. Our experiments on benchmarks TDW-MAT and C-WAH demonstrate that ELHPlan achieves comparable task success rates while consuming only 30-40% of the tokens required by state-of-the-art methods. Our research establishes a new efficiency-effectiveness frontier for LLM-based multi-agent planning systems.
EB-MBD: Emerging-Barrier Model-Based Diffusion for Safe Trajectory Optimization in Highly Constrained Environments ICRA 2026
We propose enforcing constraints on Model-Based Diffusion by introducing emerging barrier functions inspired by interior point methods. We demonstrate that the standard Model-Based Diffusion algorithm can lead to catastrophic performance degradation in highly constrained environments, even on simple 2D systems due to sample inefficiency in the Monte Carlo approximation of the score function. We introduce Emerging-Barrier Model-Based Diffusion (EB-MBD) which uses progressively introduced barrier constraints to avoid these problems, significantly improving solution quality, without expensive projection operations such as projections. We analyze the sampling liveliness of samples at each iteration to inform barrier parameter scheduling choice. We demonstrate results for 2D collision avoidance and a 3D underwater manipulator system and show that our method achieves lower cost solutions than Model-Based Diffusion, and requires orders of magnitude less computation time than projection based methods.
comment: Accepted to ICRA 2026. Code available at https://github.com/acfr/emerging_barrier_mbd
Green-VLA: Staged Vision-Language-Action Model for Generalist Robots
We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) reinforcement-learning (RL) policy alignment. We couple a scalable data-processing pipeline (3,000 hours of demonstrations) with temporal alignment and quality filtering, and use a unified, embodiment-aware action interface enabling a single policy to control humanoids, mobile manipulators, and fixed-base arms. At inference, the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and joint-prediction-based guidance to improve safety and precise target selection. Experiments on Simpler BRIDGE WidowX and CALVIN ABC-D, as well as real-robot evaluations, show strong generalization and performance gains from RL alignment in success rate, robustness, and long-horizon efficiency.
comment: 22 pages, 14 figures
Agile in the Face of Delay: Asynchronous End-to-End Learning for Real-World Aerial Navigation
Robust autonomous navigation for Autonomous Aerial Vehicles (AAVs) in complex environments is a critical capability. However, modern end-to-end navigation faces a key challenge: the high-frequency control loop needed for agile flight conflicts with low-frequency perception streams, which are limited by sensor update rates and significant computational cost. This mismatch forces conventional synchronous models into undesirably low control rates. To resolve this, we propose an asynchronous reinforcement learning framework that decouples perception and control, enabling a high-frequency policy to act on the latest IMU state for immediate reactivity, while incorporating perception features asynchronously. To manage the resulting data staleness, we introduce a theoretically-grounded Temporal Encoding Module (TEM) that explicitly conditions the policy on perception delays, a strategy complemented by a two-stage curriculum to ensure stable and efficient training. Validated in extensive simulations, our method was successfully deployed in zero-shot sim-to-real transfer on an onboard NUC, where it sustains a 100~Hz control rate and demonstrates robust, agile navigation in cluttered real-world environments. Our source code will be released for community reference.
SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving
In autonomous driving, end-to-end (E2E) driving systems that predict control commands directly from sensor data have achieved significant advancements. For safe driving in unexpected scenarios, these systems may additionally rely on human interventions such as natural language instructions. Using a multi-modal large language model (MLLM) facilitates human-vehicle interaction and can improve performance in such scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and numerous visual tokens from sensor inputs, which are limited in autonomous vehicles. Many MLLM studies have explored reducing visual tokens, but often suffer end-task performance degradation compared to using all tokens. To enable efficient E2E driving while maintaining performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for multi-modal LLMs (SToRM). The proposed framework consists of three key elements. First, a lightweight importance predictor with short-term sliding windows estimates token importance scores. Second, a supervised training approach uses an auxiliary path to obtain pseudo-supervision signals from an all-token LLM pass. Third, an anchor-context merging module partitions tokens into anchors and context tokens, and merges context tokens into relevant anchors to reduce redundancy while minimizing information loss. Experiments on the LangAuto benchmark show that SToRM outperforms state-of-the-art E2E driving MLLMs under the same reduced-token budget, maintaining all-token performance while reducing computational cost by up to 30x.
Bio-inspired tail oscillation enables robot fast crawling on deformable granular terrains
Deformable substrates such as sand and mud present significant challenges for terrestrial robots due to complex robot-terrain interactions. Inspired by mudskippers, amphibious animals that naturally adjust their tail morphology and movement jointly to navigate such environments, we investigate how tail design and control can jointly enhance flipper-driven locomotion on granular media. Using a bio-inspired robot modeled after the mudskipper, we experimentally compared locomotion performance between idle and actively oscillating tail configurations. Tail oscillation increased robot speed by 67% and reduced body drag by 46%. Shear force measurements revealed that this improvement was enabled by tail oscillation fluidizing the substrate, thereby reducing resistance. Additionally, tail morphology strongly influenced the oscillation strategy: designs with larger horizontal surface areas leveraged the oscillation-reduced shear resistance more effectively by limiting insertion depth. Based on these findings, we present a design principle to inform tail action selection based on substrate strength and tail morphology. Our results offer new insights into tail design and control for improving robot locomotion on deformable substrates, with implications for agricultural robotics, search and rescue, and environmental exploration.
CDE: Concept-Driven Exploration for Reinforcement Learning
Intelligent exploration remains a critical challenge in reinforcement learning (RL), especially in visual control tasks. Unlike low-dimensional state-based RL, visual RL must extract task-relevant structure from raw pixels, making exploration inefficient. We propose Concept-Driven Exploration (CDE), which leverages a pre-trained vision-language model (VLM) to generate object-centric visual concepts from textual task descriptions as weak, potentially noisy supervisory signals. Rather than directly conditioning on these noisy signals, CDE trains a policy to reconstruct the concepts via an auxiliary objective, learning general representations of the concepts and using reconstruction accuracy as an intrinsic reward to guide exploration toward task-relevant objects. Across five challenging simulated visual manipulation tasks, CDE achieves efficient, targeted exploration and remains robust to both synthetic errors and noisy VLM predictions. Finally, we demonstrate real-world transfer by deploying CDE on a Franka arm, attaining an 80\% success rate in a real-world manipulation task.
comment: Preprint
VL-Nav: A Neuro-Symbolic Approach for Reasoning-based Vision-Language Navigation
Navigating unseen, large-scale environments based on complex and abstract human instructions remains a formidable challenge for autonomous mobile robots. Addressing this requires robots to infer implicit semantics and efficiently explore large-scale task spaces. However, existing methods, ranging from end-to-end learning to foundation model-based modular architectures, often lack the capability to decompose complex tasks or employ efficient exploration strategies, leading to robot aimless wandering or target recognition failures. To address these limitations, we propose VL-Nav, a neuro-symbolic (NeSy) vision-language navigation system. The proposed system intertwines neural reasoning with symbolic guidance through two core components: (1) a NeSy task planner that leverages a symbolic 3D scene graph and image memory system to enhance the vision language models' (VLMs) neural reasoning capabilities for task decomposition and replanning; and (2) a NeSy exploration system that couples neural semantic cues with the symbolic heuristic function to efficiently gather the task-related information while minimizing unnecessary repeat travel during exploration. Validated on the DARPA TIAMAT Challenge navigation tasks, our system achieved an 83.4% success rate (SR) in indoor environments and 75% in outdoor scenarios. VL-Nav achieved an 86.3% SR in real-world experiments, including a challenging 483-meter run. Finally, we validate the system with complex instructions in a 3D multi-floor scenario.
Diffusion Stabilizer Policy for Automated Surgical Robot Manipulations ICRA 2026
Intelligent surgical robots have the potential to revolutionize clinical practice by enabling more precise and automated surgical procedures. However, the automation of such robot for surgical tasks remains under-explored compared to recent advancements in solving household manipulation tasks. These successes have been largely driven by (1) advanced models, such as transformers and diffusion models, and (2) large-scale data utilization. Aiming to extend these successes to the domain of surgical robotics, we propose a diffusion-based policy learning framework, called Diffusion Stabilizer Policy (DSP), which enables training with imperfect or even failed trajectories. Our approach consists of two stages: first, we train the diffusion stabilizer policy using only clean data. Then, the policy is continuously updated using a mixture of clean and perturbed data, with filtering based on the prediction error on actions. Comprehensive experiments conducted in various surgical environments demonstrate the superior performance of our method in perturbation-free settings and its robustness when handling perturbed demonstrations.
comment: ICRA 2026
Vectorized Online POMDP Planning ICRA 2026
Planning under partial observability is an essential capability of autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for planning under partial observability problems, capturing the stochastic effects of actions and the limited information available through noisy observations. POMDP solving could benefit tremendously from massive parallelization on today's hardware, but parallelizing POMDP solvers has been challenging. Most solvers rely on interleaving numerical optimization over actions with the estimation of their values, which creates dependencies and synchronization bottlenecks between parallel processes that can offset the benefits of parallelization. In this paper, we propose Vectorized Online POMDP Planner (VOPP), a novel parallel online solver that leverages a recent POMDP formulation which analytically solves part of the optimization component, leaving numerical computations to consist of only estimation of expectations. VOPP represents all data structures related to planning as a collection of tensors, and implements all planning steps as fully vectorized computations over this representation. The result is a massively parallel online solver with no dependencies or synchronization bottlenecks between concurrent processes. Experimental results indicate that VOPP is at least $20\times$ more efficient in computing near-optimal solutions compared to an existing state-of-the-art parallel online solver. Moreover, VOPP outperforms state-of-the-art sequential online solvers, while using a planning budget that is $1000\times$ smaller.
comment: 8 pages, 3 figures. Accepted at ICRA 2026
RetoVLA: Reusing Register Tokens for Spatial Reasoning in Vision-Language-Action Models
Vision-Language-Action (VLA) models have demonstrated robust performance across diverse robotic tasks. However, their high memory and computational demands often limit real-time deployment. While existing model compression techniques reduce the parameter footprint, they often drop in 3D spatial reasoning and scene layout understanding. This work introduces RetoVLA, an architecture designed to maintain spatial awareness in lightweight models by repurposing Register Tokens-learnable parameters originally introduced to mitigate attention artifacts in Vision Transformers. While these tokens are generally discarded once used, we repurpose them for their dense representation of global spatial context. RetoVLA integrates these recycled tokens directly into the action-planning module through a dedicated spatial context injection path. Our proposed design enables the recovery of global context without increasing the total parameter count. Real-world experiments using a 7-DOF manipulator show a 17.1%p improvement in average success rates over the baseline. Our results demonstrate that leveraging internal register tokens provides a highly effective mechanism for developing efficient, spatially-aware robotic agents. A video demonstration is available at: https://youtu.be/2CseBR-snZg
Multiagent Systems
Learning When to Cooperate Under Heterogeneous Goals
A significant element of human cooperative intelligence lies in our ability to identify opportunities for fruitful collaboration; and conversely to recognise when the task at hand is better pursued alone. Research on flexible cooperation in machines has left this meta-level problem largely unexplored, despite its importance for successful collaboration in heterogeneous open-ended environments. Here, we extend the typical Ad Hoc Teamwork (AHT) setting to incorporate the idea of agents having heterogeneous goals that in any given scenario may or may not overlap. We introduce a novel approach to learning policies in this setting, based on a hierarchical combination of imitation and reinforcement learning, and show that it outperforms baseline methods across extended versions of two cooperative environments. We also investigate the contribution of an auxiliary component that learns to model teammates by predicting their actions, finding that its effect on performance is inversely related to the amount of observable information about teammate goals.
NarrativeLoom: Enhancing Creative Storytelling through Multi-Persona Collaborative Improvisation
Large Language Models show promise for AI-assisted storytelling, yet current tools often generate predictable, unoriginal narratives. To address this limitation, we present NarrativeLoom, a multi-persona co-creative system grounded in Campbell's Blind Variation and Selective Retention theory. NarrativeLoom deploys specialized AI personas to generate diverse narrative options (blind variation), while users act as creative directors to select and refine them (selective retention). We designed a controlled study with 50 participants and found that stories co-authored with NarrativeLoom were not only perceived by users as more novel and diverse but were also objectively rated by experts as significantly better across all Torrance Test creativity dimensions: fluency, flexibility, originality, and elaboration. Stories are significantly longer with richer settings and more dialogue. Writing expertise emerged as a moderator: novices benefited more from structured scaffolding. This demonstrates the value of theory-informed co-creative systems and the importance of adapting them to varying user expertise.
comment: 19 pages, 10 figures
Randomise Alone, Reach as a Team
We study concurrent graph games where n players cooperate against an opponent to reach a set of target states. Unlike traditional settings, we study distributed randomisation: team players do not share a source of randomness, and their private random sources are hidden from the opponent and from each other. We show that memoryless strategies are sufficient for the threshold problem (deciding whether there is a strategy for the team that ensures winning with probability that exceeds a threshold), a result that not only places the problem in the Existential Theory of the Reals (\exists\mathbb{R}) but also enables the construction of value iteration algorithms. We additionally show that the threshold problem is NP-hard. For the almost-sure reachability problem, we prove NP-completeness. We introduce Individually Randomised Alternating-time Temporal Logic (IRATL). This logic extends the standard ATL framework to reason about probability thresholds, with semantics explicitly designed for coalitions that lack a shared source of randomness. On the practical side, we implement and evaluate a solver for the threshold and almost-sure problem based on the algorithms that we develop.
comment: 50 pages, 7 figures
Electoral Systems Simulator: An Open Framework for Comparing Electoral Mechanisms Across Voter Distribution Scenarios
Here we present \texttt{electoral\_sim}, an open-source Python framework for simulating and comparing electoral systems across diverse voter preference distributions. The framework represents voters and candidates as points in a two-dimensional ideological space, derives sincere ballot profiles from Euclidean preference distances, and evaluates several standard electoral mechanisms -- including plurality, ranked-choice, approval, score, Condorcet, and two proportional representation systems -- against a common primary metric: the Euclidean distance between the electoral outcome and the geometric median of the voter distribution. We evaluate these systems across many empirically-grounded scenarios ranging from unimodal consensus electorates to sharply polarised bimodal configurations, reporting both single-run and Monte Carlo stability results across 200 trials per scenario. As a case study in framework extensibility, we implement and evaluate a novel hypothetical mechanism that is not currently implemented in any jurisdiction -- in which each voter's influence is distributed across candidates via a Boltzmann softmax kernel. This system is included as a theoretical benchmark characterising an approximate upper bound on centroid-seeking performance, rather than as a policy proposal. All code is released publicly at https://github.com/mukhes3/electoral_sim.
Beyond Reward Suppression: Reshaping Steganographic Communication Protocols in MARL via Dynamic Representational Circuit Breaking
In decentralized Multi-Agent Reinforcement Learning (MARL), steganographic collusion -- where agents develop private protocols to evade monitoring -- presents a critical AI safety threat. Existing defenses, limited to behavioral or reward layers, fail to detect coordination in latent communication channels. We introduce the Dynamic Representational Circuit Breaker (DRCB), an architectural defense operating at the optimization substrate. Building on the AI Mother Tongue (AIM) framework, DRCB utilizes a Vector Quantized Variational Autoencoder (VQ-VAE) bottleneck to convert unobservable messages into auditable statistical objects. DRCB monitors signals including Jensen-Shannon Divergence drift, L2-norm codebook displacement, and Randomized Observer Pool accuracy to compute an EMA-based Collusion Score. Threshold breaches trigger four escalating interventions: dynamic adaptation, gradient-space penalty injection into the Advantage function A^pi, temporal reward suppression, and full substrate circuit breaking via codebook shuffling and optimizer state reset. Experiments on a Contextual Prisoner's Dilemma with MNIST labels show that while static monitoring fails (p = 0.3517), DRCB improves observer mean accuracy from 0.858 to 0.938 (+9.3 percent) and reduces volatility by 43 percent, while preserving mean joint reward (p = 0.854). Analysis of 214,298 symbol samples confirms "Semantic Degradation," where high-frequency sequences converge to zero entropy, foreclosing complex steganographic encodings. We identify a "Transparency Paradox" where agents achieve surface-level determinism while preserving residual capacity in long-tail distributions, reflecting Goodhart's Law. This task-agnostic methodology provides a technical path toward MICA-compliant (Multi-Agent Internal Coupling Audit) pre-deployment auditing for autonomous systems.
comment: 38 pages, includes 5 figures and 8 tables, preliminary version, AI safety / multi-agent reinforcement learning
Generalized Per-Agent Advantage Estimation for Multi-Agent Policy Optimization AAMAS 2026
In this paper, we propose a novel framework for multi-agent reinforcement learning that enhances sample efficiency and coordination through accurate per-agent advantage estimation. The core of our approach is Generalized Per-Agent Advantage Estimator (GPAE), which employs a per-agent value iteration operator to compute precise per-agent advantages. This operator enables stable off-policy learning by indirectly estimating values via action probabilities, eliminating the need for direct Q-function estimation. To further refine estimation, we introduce a double-truncated importance sampling ratio scheme. This scheme improves credit assignment for off-policy trajectories by balancing sensitivity to the agent's own policy changes with robustness to non-stationarity from other agents. Experiments on benchmarks demonstrate that our approach outperforms existing approaches, excelling in coordination and sample efficiency for complex scenarios.
comment: Accepted at the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
Systems and Control (EESS)
Multi-Agentic AI for Conflict-Aware rApp Policy Orchestration in Open RAN
Open Radio Access Network (RAN) enables flexible, AI-driven control of mobile networks through disaggregated, multi-vendor components. In this architecture, xApps handle real-time functions, whereas rApps in the non-real-time controller generate strategic policies. However, current rApp development remains largely manual, brittle, and poorly scalable as xApp diversity proliferates. In this work, we propose a Multi-Agentic AI framework to automate rApp policy generation and orchestration. The architecture integrates three specialized large language model (LLM)-based agents, Perception, Reasoning, and Refinement, supported by retrieval-augmented generation (RAG) and memory-based analogical reasoning. These agents collectively analyze potential conflicts, synthesize intent-aligned control pipelines, and incrementally refine deployment decisions. Experiments across diverse deployment scenarios demonstrate that the proposed system achieves over 70% improvement in deployment accuracy and 95% reduction in reasoning cost compared to baseline methods, while maintaining zero-shot generalization to unseen intents. These results establish a scalable and conflict-aware solution for fully autonomous, zero-touch rApp orchestration in Open RAN.
Towards Network-Aware Operation of Integrated Energy Systems: A Comprehensive Review
Integrated Energy Systems (IES) are systems of interconnected electricity, gas, heating, and cooling networks, where the carriers interact and depend on one another. Beyond these core vectors, IES may also incorporate additional infrastructures, such as hydrogen, transportation and water networks, whenever sector coupling or cross-vector exchanges are relevant. Although modern cities already function as multi-energy systems, these networks are still planned and operated in isolation, which leads to inefficiencies and unused flexibility. As distributed energy resources (DERs) grow, local coupling among electricity, heating, and gas networks becomes stronger, so coordinated operation across carriers and infrastructures is essential. IES can improve efficiency, flexibility, and renewable integration, yet operation is challenging because of complex interdependencies, non-convex behaviors, and multi-scale dynamics of the energy networks. A key point that the literature often overlooks is the explicit role of network constraints and topology, which shape feasible operating regions, affect scalability, and determine how uncertainty and formal guarantees can be addressed. This review provides a first comprehensive analysis of network-aware modeling, optimization, and control methods for IES. We identify methodological limitations related to tractability, feasibility guarantees, and scalability. Building on these insights, we outline research directions that include distributed optimization with theoretical guarantees and control approaches informed by operational data. The review offers a foundation for scalable, network-aware operational frameworks for future low-carbon energy systems.
comment: Submitted to Proceedings of the IEEE (review). 23 pages, 6 figures, 2 tables. Includes IEEE preprint notice; network-aware IES survey; v2 fixes references
Reinforcement Learning for Vehicle-to-Grid Voltage Regulation: Single-Hub to Multi-Hub Coordination with Battery-Aware Constraints
This paper presents a Vehicle-to-Grid (V2G) coordination framework using reinforcement learning (RL). {An intelligent control strategy based on the soft actor-critic algorithm is developed for voltage regulation through single and multi-hub charging systems while respecting realistic fleet constraints. A two-phase training approach integrates stability-focused learning with battery-aware deployment to ensure practical feasibility. Simulation studies on the IEEE 34-bus system validate the framework against a standard Volt-Var/Volt-Watt droop controller. Results indicate that the RL agent achieves performance comparable to the baseline control strategy in nominal scenarios. Under aggressive overloading, it provides robust voltage recovery (within 10% of the baseline) while prioritizing fleet availability and state-of-charge preservation, demonstrating the viability of constraint-aware learning for critical grid services.}
Tutorial on Aided Inertial Navigation Systems: A Modern Treatment Using Lie-Group Theoretical Methods
This tutorial presents a control-oriented introduction to aided inertial navigation systems using a Lie-group formulation centered on the extended Special Euclidean group SE_2(3). The focus is on developing a clear and implementation-oriented geometric framework for fusing inertial measurements with aiding information, while making the role of invariance and symmetry explicit. Recent extensions, including higher-order state representations, synchronous observer designs, and equivariant filtering methods, are discussed as natural continuations of the same underlying principles. The goal is to provide readers with a coherent system-theoretic perspective that supports both understanding and practical use of modern aided inertial navigation methods.
ACLM: ADMM-Based Distributed Model Predictive Control for Collaborative Loco-Manipulation
Collaborative transportation of heavy payloads via loco-manipulation is a challenging yet essential capability for legged robots operating in complex, unstructured environments. Centralized planning methods, e.g., holistic trajectory optimization, capture dynamic coupling among robots and payloads but scale poorly with system size, limiting real-time applicability. In contrast, hierarchical and fully decentralized approaches often neglect force and dynamic interactions, leading to conservative behavior. This study proposes an Alternating Direction Method of Multipliers (ADMM)-based distributed model predictive control framework for collaborative loco-manipulation with a team of quadruped robots with manipulators. By exploiting the payload-induced coupling structure, the global optimal control problem is decomposed into parallel individual-robot-level subproblems with consensus constraints. The distributed planner operates in a receding-horizon fashion and achieves fast convergence, requiring only a few ADMM iterations per planning cycle. A wrench-aware whole-body controller executes the planned trajectories, tracking both motion and interaction wrenches. Extensive simulations with up to four robots demonstrate scalability, real-time performance, and robustness to model uncertainty.
Statistical Contraction for Chance-Constrained Trajectory Optimization of Non-Gaussian Stochastic Systems
This paper presents novel method for distribution-free robust trajectory optimization and control of discrete-time, nonlinear, and non-Gaussian stochastic systems, with closed-loop guarantees on chance constraint satisfaction. Our framework employs conformal inference to generate coverage-based confidence sets for the closed-loop dynamics around arbitrary reference trajectories, by constructing a joint nonconformity score to quantify both the validity of contraction (i.e., incremental stability) conditions and the impact of external stochastic disturbance on the closed-loop dynamics, without any distributional assumptions. Via appropriate constraint tightening, chance constraints can be reformulated into tractable, statistically valid deterministic constraints on the reference trajectories. This enables a formal pathway to leverage and validate learning-based motion planners and controllers, such as those with neural contraction metrics, in safety-critical real-world applications. Notably, our statistical guarantees are non-diverging and can be computed with finite samples of the underlying uncertainty, without overly conservative structural priors. We demonstrate our approach in motion planning problems for designing safe, dynamically feasible trajectories in both numerical simulation and hardware experiments.
GuideTWSI: A Diverse Tactile Walking Surface Indicator Dataset from Synthetic and Real-World Images for Blind and Low-Vision Navigation
Tactile Walking Surface Indicators (TWSIs) are safety-critical landmarks that blind and low-vision (BLV) pedestrians use to locate crossings and hazard zones. From our observation sessions with BLV guide dog handlers, trainers, and an O&M specialist, we confirmed the critical importance of reliable and accurate TWSI segmentation for navigation assistance of BLV individuals. Achieving such reliability requires large-scale annotated data. However, TWSIs are severely underrepresented in existing urban perception datasets, and even existing dedicated paving datasets are limited: they lack robot-relevant viewpoints (e.g., egocentric or top-down) and are geographically biased toward East Asian directional bars - raised parallel strips used for continuous guidance along sidewalks. This narrow focus overlooks truncated domes - rows of round bumps used primarily in North America and Europe as detectable warnings at curbs, crossings, and platform edges. As a result, models trained only on bar-centric data struggle to generalize to dome-based warnings, leading to missed detections and false stops in safety-critical environments.
Animating Petascale Time-varying Data on Commodity Hardware with LLM-assisted Scripting
Scientists face significant visualization challenges as time-varying datasets grow in speed and volume, often requiring specialized infrastructure and expertise to handle massive datasets. Petascale climate models generated in NASA laboratories require a dedicated group of graphics and media experts and access to high-performance computing resources. Scientists may need to share scientific results with the community iteratively and quickly. However, the time-consuming trial-and-error process incurs significant data transfer overhead and far exceeds the time and resources allocated for typical post-analysis visualization tasks, disrupting the production workflow. Our paper introduces a user-friendly framework for creating 3D animations of petascale, time-varying data on a commodity workstation. Our contributions: (i) Generalized Animation Descriptor (GAD) with a keyframe-based adaptable abstraction for animation, (ii) efficient data access from cloud-hosted repositories to reduce data management overhead, (iii) tailored rendering system, and (iv) an LLM-assisted conversational interface as a scripting module to allow domain scientists with no visualization expertise to create animations of their region of interest. We demonstrate the framework's effectiveness with two case studies: first, by generating animations in which sampling criteria are specified based on prior knowledge, and second, by generating AI-assisted animations in which sampling parameters are derived from natural-language user prompts. In all cases, we use large-scale NASA climate-oceanographic datasets that exceed 1PB in size yet achieve a fast turnaround time of 1 minute to 2 hours. Users can generate a rough draft of the animation within minutes, then seamlessly incorporate as much high-resolution data as needed for the final version.
comment: Will appear in ICBDA 2026. N.B. Due to the limitation "The abstract field cannot be longer than 1,920 characters", the abstract here is shorter than that in the PDF file Subjects
Communication Network-Aware Missing Data Recovery for Enhanced Distribution Grid Visibility
Power distribution systems increasingly rely on dense sensor networks for real-time monitoring, yet unreliable communication links and equipment malfunctions often result in missing or incomplete measurement sets at the operating center, requiring accurate data recovery techniques. Most existing approaches operate solely on the available measurements and overlook the role of the communication network that delivers sensor data, leading to large, spatially correlated losses when multiple sensors share failing communication links. This paper proposes a communication-aware framework that integrates routing constraints with low-rank matrix completion to improve data recovery accuracy under communication failures. Sensors are grouped into balanced clusters, and routing paths are designed to limit intracluster sensors sharing a common communication path, preventing complete data loss within any cluster. The remaining measurements for each cluster are then recovered using an optimal singular value thresholding (OSVT) method. Simulation results on the IEEE standard test feeder with real-world data demonstrate that the proposed framework significantly improves recovery accuracy compared to communication-agnostic, measurement-only methods.
comment: 5 pages, 6 figures. Accepted for presentation at IEEE Power & Energy Society General Meeting (PESGM) 2026
Topology-Aware Reinforcement Learning over Graphs for Resilient Power Distribution Networks
Extreme weather events and cyberattacks can cause component failures and disrupt the operation of power distribution networks (DNs), during which reconfiguration and load shedding are often adopted for resilience enhancement. This study introduces a topology-aware graph reinforcement learning (RL) framework for outage management that embeds higher-order topological features of the DN into a graph-based RL model, enabling reconfiguration and load shedding to maximize energy supply while maintaining operational stability. Results on the modified IEEE 123-bus feeder across 300 diverse outage scenarios demonstrate that incorporating the topological data analysis (TDA) tool, persistence homology (PH), yields 9-18% higher cumulative rewards, up to 6% increase in power delivery, and 6-8% fewer voltage violations compared to a baseline graph-RL model. These findings highlight the potential of integrating RL with TDA to enable self-healing in DNs, facilitating fast, adaptive, and automated restoration.
A SISA-based Machine Unlearning Framework for Power Transformer Inter-Turn Short-Circuit Fault Localization
In practical data-driven applications on electrical equipment fault diagnosis, training data can be poisoned by sensor failures, which can severely degrade the performance of machine learning (ML) models. However, once the ML model has been trained, removing the influence of such harmful data is challenging, as full retraining is both computationally intensive and time-consuming. To address this challenge, this paper proposes a SISA (Sharded, Isolated, Sliced, and Aggregated)-based machine unlearning (MU) framework for power transformer inter-turn short-circuit fault (ITSCF) localization. The SISA method partitions the training data into shards and slices, ensuring that the influence of each data point is isolated within specific constituent models through independent training. When poisoned data are detected, only the affected shards are retrained, avoiding retraining the entire model from scratch. Experiments on simulated ITSCF conditions demonstrate that the proposed framework achieves almost identical diagnostic accuracy to full retraining, while reducing retraining time significantly.
Energy-Efficient Collaborative Transport of Tether-Suspended Payloads via Rotating Equilibrium
Collaborative aerial transportation of tethered payloads is fundamentally limited by space, power, and weight constraints. Conventional approaches rely on static equilibrium conditions, where each vehicle tilts to generate the forces that ensure they maintain a formation geometry that avoids aerodynamic interactions and collision. This horizontal thrust component represents a significant energy penalty compared to the ideal case in which each vehicle produces purely vertical thrust to lift the payload. Operating in tighter tether configurations can minimize this effect, but at the cost of either having to fly the vehicles in closer proximity, which risks collision, or significantly increasing the length of the tether, which increases complexity and reduces potential use-cases. We propose operating the tether-suspended flying system at a rotating equilibrium. By maintaining steady circular motion, centrifugal forces provide the necessary horizontal tether tension, allowing each quadrotor to generate purely vertical thrust and thus reducing the total force (and power) required compared to an equilibrium where the thrusts are not vertical. It also allows for a wider range of tether configurations to be used without sacrificing efficiency. Results demonstrate that rotating equilibria can reduce power consumption relative to static lifting by up to 20%, making collaborative aerial solutions more practically relevant.
comment: 7 pages, 8 figures
Is Your Safe Controller Actually Safe? A Critical Review of CBF Tautologies and Hidden Assumptions
This tutorial provides a critical review of the practical application of Control Barrier Functions (CBFs) in robotic safety. While the theoretical foundations of CBFs are well-established, I identify a recurring gap between the mathematical assumption of a safe controller's existence and its constructive realization in systems with input constraints. I highlight the distinction between candidate and valid CBFs by analyzing the interplay of system dynamics, actuation limits, and class-K functions. I further show that some purported demonstrations of safe robot policies or controllers are limited to passively safe systems, such as single integrators or kinematic manipulators, where safety is already inherited from the underlying physics and even naive geometric hard constraints suffice to prevent collisions. By revisiting simple low-dimensional examples, I show when CBF formulations provide valid safety guarantees and when they fail due to common misuses. I then provide practical guidelines for constructing realizable safety arguments for systems without such passive safety. The goal of this tutorial is to bridge the gap between theoretical guarantees and actual implementation, supported by an open-source interactive web demonstration that visualizes these concepts intuitively.
comment: Interactive web demo: https://cbf.taekyung.me
SAC-Loco: Safe and Adjustable Compliant Quadrupedal Locomotion
Quadruped robots are designed to achieve agile and robust locomotion by drawing inspiration from legged animals. However, most existing control methods for quadruped robots lack a key capacity observed in animals: the ability to exhibit diverse compliance behaviors while ensuring stability when experiencing external forces. In particular, achieving adjustable compliance while maintaining robust safety under force disturbances remains a significant challenge. In this work, we propose a safety aware compliant locomotion framework that integrates adjustable disturbance compliance with robust failure prevention. We first train a force compliant policy with adjustable compliance levels using a teacher student reinforcement learning framework, allowing deployment without explicit force sensing. To handle disturbances beyond the limits of compliant control, we develop a safety oriented policy for rapid recovery and stabilization. Finally, we introduce a learned safety critic that monitors the robot's safety in real time and coordinates between compliant locomotion and recovery behaviors. Together, this framework enables quadruped robots to achieve smooth force compliance and robust safety under a wide range of external force disturbances.
State Feedback Control of State-Delayed LPV Systems using Dynamic IQCs
This paper develops a new control framework for linear parameter-varying (LPV) systems with time-varying state delays by integrating parameter-dependent Lyapunov functions with integral quadratic constraints (IQCs). A novel delay-dependent state-feedback controller structure is proposed, consisting of a linear state-feedback law augmented with an additional term that captures the delay-dependent dynamics of the plant. Closed-loop stability and $\mathcal{L}_2$-gain performance are analyzed using dynamic IQCs and parameter-dependent quadratic Lyapunov functions, leading to convex synthesis conditions that guarantee performance in terms of parameter-dependent linear matrix inequalities (LMIs). Unlike traditional delay control approaches, the proposed IQC-based framework provides a flexible and systematic methodology for handling delay effects, enabling enhanced control capability, reduced conservatism, and improved closed-loop performance.
Cognitive-Flexible Control via Latent Model Reorganization with Predictive Safety Guarantees
Learning-enabled control systems must maintain safety when system dynamics and sensing conditions change abruptly. Although stochastic latent-state models enable uncertainty-aware control, most existing approaches rely on fixed internal representations and can degrade significantly under distributional shift. This letter proposes a \emph{cognitive-flexible control} framework in which latent belief representations adapt online, while the control law remains explicit and safety-certified. We introduce a Cognitive-Flexible Deep Stochastic State-Space Model (CF--DeepSSSM) that reorganizes latent representations subject to a bounded \emph{Cognitive Flexibility Index} (CFI), and embeds the adapted model within a Bayesian model predictive control (MPC) scheme. We establish guarantees on bounded posterior drift, recursive feasibility, and closed-loop stability. Simulation results under abrupt changes in system dynamics and observations demonstrate safe representation adaptation with rapid performance recovery, highlighting the benefits of learning-enabled, rather than learning-based, control for nonstationary cyber--physical systems.
Home Energy Management under Tiered Peak Power Charges
We consider the problem of operating a battery in a grid-connected home to minimize electricity cost, which includes an energy charge and a tiered peak power charge based on the average of the $N$ largest daily peak powers over a month. With perfect foresight of loads and prices, the minimum cost can be found by solving a mixed-integer linear program (MILP), which provides a lower bound on achievable cost. We propose a model predictive control (MPC) policy that uses simple forecasts of prices and loads and solves a small MILP at each time step. Numerical experiments on data from a home in Trondheim, Norway, show that the MPC policy achieves a cost within $1.7\%$ of the prescient bound.
Pricing for Routing and Flow-Control in Payment Channel Networks
A payment channel network is a blockchain-based overlay mechanism that allows parties to transact more efficiently than directly using the blockchain. These networks are composed of payment channels that carry transactions between pairs of users. Due to its design, a payment channel cannot sustain a net flow of money in either direction indefinitely. Therefore, a payment channel network cannot serve transaction requests arbitrarily over a long period of time. We introduce DEBT control, a joint routing and flow-control protocol that guides a payment channel network towards an optimal operating state for any steady-state demand. In this protocol, each channel sets a price for routing transactions through it. Transacting users make flow-control and routing decisions by responding to these prices. A channel updates its price based on the net flow of money through it. The protocol is developed by formulating a network utility maximization problem and solving its dual through gradient descent. We provide convergence guarantees for the protocol and also illustrate its behavior through simulations.
comment: 17 pages, 7 figures. Published in IEEE/ACM Transactions on Networking
A Coordinated Routing Approach for Enhancing Bus Timeliness and Travel Efficiency in Mixed-Traffic Environment
This paper proposes a coordinated routing approach that investigates the use of connected and automated vehicles (CAVs) in dedicated bus lanes. The aim is to improve bus schedule adherence while enhancing the travel efficiency of CAVs during the transitional phase of mixed traffic environments. Our approach utilizes real-time traffic data to dynamically reroute CAVs in anticipation of congestion. By continuously monitoring traffic conditions on dedicated lanes and tracking the real-time positions of buses, the system adjusts CAV routes in advance to avoid potential interference with operating buses. This cooperation reduces CAV travel times and minimizes delays that impact transit services. The proposed strategy is validated using microscopic traffic simulations in SUMO. The results demonstrate significant improvements in both transit on-time performance and CAV travel efficiency across a range of traffic conditions.
NashOpt -- A Python Library for Computing Generalized Nash Equilibria
NashOpt is an open-source Python library for computing and designing generalized Nash equilibria (GNEs) in noncooperative games with shared constraints and real-valued decision variables. The library exploits the joint Karush-Kuhn-Tucker (KKT) conditions of all players to handle both general nonlinear GNEs and linear-quadratic games, including their variational versions. Nonlinear games are solved via nonlinear least-squares formulations, relying on JAX for automatic differentiation. Linear-quadratic GNEs are reformulated as mixed-integer linear programs, enabling efficient computation of multiple equilibria. The framework also supports inverse-game and Stackelberg game-design problems. The capabilities of NashOpt are demonstrated through several examples, including noncooperative game-theoretic control problems of linear quadratic regulation and model predictive control. The library is available at https://github.com/bemporad/nashopt
comment: 24 pages, 7 figures
EB-MBD: Emerging-Barrier Model-Based Diffusion for Safe Trajectory Optimization in Highly Constrained Environments ICRA 2026
We propose enforcing constraints on Model-Based Diffusion by introducing emerging barrier functions inspired by interior point methods. We demonstrate that the standard Model-Based Diffusion algorithm can lead to catastrophic performance degradation in highly constrained environments, even on simple 2D systems due to sample inefficiency in the Monte Carlo approximation of the score function. We introduce Emerging-Barrier Model-Based Diffusion (EB-MBD) which uses progressively introduced barrier constraints to avoid these problems, significantly improving solution quality, without expensive projection operations such as projections. We analyze the sampling liveliness of samples at each iteration to inform barrier parameter scheduling choice. We demonstrate results for 2D collision avoidance and a 3D underwater manipulator system and show that our method achieves lower cost solutions than Model-Based Diffusion, and requires orders of magnitude less computation time than projection based methods.
comment: Accepted to ICRA 2026. Code available at https://github.com/acfr/emerging_barrier_mbd
Transformers as Implicit State Estimators: In-Context Learning in Dynamical Systems
Predicting the behavior of a dynamical system from noisy observations of its past outputs is a classical problem encountered across engineering and science. For linear systems with Gaussian inputs, the Kalman filter -- the best linear minimum mean-square error estimator of the state trajectory -- is optimal in the Bayesian sense. For nonlinear systems, Bayesian filtering is typically approached using suboptimal heuristics such as the Extended Kalman Filter (EKF), or numerical methods such as particle filtering (PF). In this work, we show that transformers, employed in an in-context learning (ICL) setting, can implicitly infer hidden states in order to predict the outputs of a wide family of dynamical systems, without test-time gradient updates or explicit knowledge of the system model. Specifically, when provided with a short context of past input-output pairs and, optionally, system parameters, a frozen transformer accurately predicts the current output. In linear-Gaussian regimes, its predictions closely match those of the Kalman filter; in nonlinear regimes, its performance approaches that of EKF and PF. Moreover, prediction accuracy degrades gracefully when key parameters, such as the state-transition matrix, are withheld from the context, demonstrating robustness and implicit parameter inference. These findings suggest that transformer in-context learning provides a flexible, non-parametric alternative for output prediction in dynamical systems, grounded in implicit latent-state estimation.
Reliable Grid Forecasting: State Space Models for Safety-Critical Energy Systems
Accurate grid load forecasting is safety-critical: under-predictions risk supply shortfalls, while symmetric error metrics can mask this operational asymmetry. We introduce an operator-legible evaluation framework -- Under-Prediction Rate (UPR), tail $\text{Reserve}_{99.5}^{\%}$ requirements, and explicit inflation diagnostics ($\text{Bias}_{24h}$/OPR) -- to quantify one-sided reliability risk beyond MAPE. Using this framework, we evaluate five neural architectures -- two state space models (S-Mamba, PowerMamba), two Transformers (iTransformer, PatchTST), an LSTM, and a probabilistic SSM variant (Mamba-ProbTSF) -- on a weather-aligned California Independent System Operator (CAISO) dataset spanning Nov 2023--Nov 2025 (84,498 hourly records across 5 regional transmission areas) under a rolling-origin walk-forward backtest. We develop and evaluate thermal-lag-aligned weather fusion strategies matched to each architecture's inductive bias. Our results demonstrate that standard accuracy metrics are insufficient proxies for operational safety: models with comparable MAPE can imply materially different tail reserve requirements ($\text{Reserve}_{99.5}^{\%}$). We show that explicit weather integration narrows error distributions, with the magnitude of improvement being architecturally determined -- iTransformer's cross-variate attention benefits significantly more than PatchTST's channel-independent design. Crucially, we identify a widespread susceptibility to "fake safety" in risk-averse forecasting: while probabilistic calibration reduces upper-tail errors, it achieves this by systematically inflating schedules (e.g., increasing bias by over 1,700 MW in severe cases) if left unconstrained. To solve this, we introduce Bias/OPR-constrained objectives that enable auditable trade-offs between minimizing tail risk and preventing trivial over-forecasting.
comment: 17 pages, 7 figures, 9 tables
Constraint Learning in Multi-Agent Dynamic Games from Demonstrations of Local Nash Interactions
We present an inverse dynamic game-based algorithm to learn parametric constraints from a given dataset of local Nash equilibrium interactions between multiple agents. Specifically, we introduce mixed-integer linear programs (MILP) encoding the Karush-Kuhn-Tucker (KKT) conditions of the interacting agents, which recover constraints consistent with the local Nash stationarity of the interaction demonstrations. We establish theoretical guarantees that our method learns inner approximations of the true safe and unsafe sets. We also use the interaction constraints recovered by our method to design motion plans that robustly satisfy the underlying constraints. Across simulations and hardware experiments, our methods accurately inferred constraints and designed safe interactive motion plans for various classes of constraints, both convex and non-convex, from interaction demonstrations of agents with nonlinear dynamics.
Machine Learning Guided Cooling System Optimization for Data Center
Effective data center cooling is crucial for reliable operation; however, cooling systems often exhibit inefficiencies that result in excessive energy consumption. This paper presents a three-stage, physics-guided machine learning framework for identifying and reducing cooling energy waste in high-performance computing facilities. Using one year of 10-minute resolution operational data from the Frontier exascale supercomputer, we first train a monotonicity-constrained gradient boosting surrogate that predicts facility accessory power from coolant flow rates, temperatures, and server power. The surrogate achieves a mean absolute error of 0.026 MW and predicts power usage effectiveness within 0.01 of measured values for 98.7% of test samples. In the second stage, the surrogate serves as a physics-consistent baseline to quantify excess cooling energy, revealing approximately 85 MWh of annual inefficiency concentrated in specific months, hours, and operating regimes. The third stage evaluates guardrail-constrained counterfactual adjustments to supply temperature and subloop flows, demonstrating that up to 96% of identified excess can be recovered through small, safe setpoint changes while respecting thermal limits and operational constraints. The framework yields interpretable recommendations, supports counterfactual analyses such as flow reduction during low-load periods and redistribution of thermal duty across cooling loops, and provides a practical pathway toward quantifiable reductions in accessory power. The developed framework is readily compatible with model predictive control and provides a template that, with site-specific recalibration, could be adapted to other liquid-cooled data centers with different configurations and cooling requirements.
comment: 11 pages, 11 figures
Digital Twin-Based Cooling System Optimization for Data Center
Data center cooling systems consume significant auxiliary energy, yet optimization studies rarely quantify the gap between theoretically optimal and operationally deployable control strategies. This paper develops a digital twin of the liquid cooling infrastructure at the Frontier exascale supercomputer, in which a hot-temperature water system comprises three parallel subloops, each serving dedicated coolant distribution unit clusters through plate heat exchangers and variable-speed pumps. The surrogate model is built based on Modelica and validated through one full calendar year of 10-minute operational data following ASHRAE Guideline 14. The model achieves a subloop coefficient of variation of the root mean square error below 2.7% and a normalized mean bias error within 2.5%. Using this validated surrogate model, a layered optimization framework evaluates three progressively constrained strategies: an analytical flow-only optimization achieves 20.4% total energy saving, unconstrained joint optimization of flow rate and supply temperature demonstrates 30.1% total energy saving, and ramp-constrained optimization of flow rate and supply temperature, enforcing actuator rate limits, can reach total energy saving of 27.8%. The analysis reveals that the baseline system operates at 2.9 times the minimum thermally safe flow rate, and the co-optimizing supply temperature with flow rate nearly doubles the savings achievable by flow reduction alone.
comment: 30 pages, 8 figures
Robotics
Underactuated multimodal jumping robot for extraterrestrial exploration ICRA 2026
We present a rolling and jumping underactuated monopedal robot designed to explore multimodal locomotion on low-gravity bodies. It uses only two reaction wheels to control its spatial orientation with two controllers: a balancing controller which can aim the robot's jump direction on the ground, and an aerial reorientation controller which can aim the robot's leg for landing after flight. We demonstrate rolling, targeted jumping and landing, and self-righting using only three actuators total, keeping system size to 0.33m and 1.25kg. Simple switching between locomotion modes enables the system to deal with differing landscapes and environmental conditions.
comment: 8 pages, 14 figures, Accepted for ICRA 2026
SG-DOR: Learning Scene Graphs with Direction-Conditioned Occlusion Reasoning for Pepper Plants
Robotic harvesting in dense crop canopies requires effective interventions that depend not only on geometry, but also on explicit, direction-conditioned relations identifying which organs obstruct a target fruit. We present SG-DOR (Scene Graphs with Direction-Conditioned Occlusion Reasoning), a relational framework that, given instance-segmented organ point clouds, infers a scene graph encoding physical attachments and direction-conditioned occlusion. We introduce an occlusion ranking task for retrieving and ranking candidate leaves for a target fruit and approach direction, and propose a direction-aware graph neural architecture with per-fruit leaf-set attention and union-level aggregation. Experiments on a multi-plant synthetic pepper dataset show improved occlusion prediction (F1=0.73, NDCG@3=0.85) and attachment inference (edge F1=0.83) over strong ablations, yielding a structured relational signal for downstream intervention planning.
CFEAR-Teach-and-Repeat: Fast and Accurate Radar-only Localization ICRA
Reliable localization in prior maps is essential for autonomous navigation, particularly under adverse weather, where optical sensors may fail. We present CFEAR-TR, a teach-and-repeat localization pipeline using a single spinning radar, which is designed for easily deployable, lightweight, and robust navigation in adverse conditions. Our method localizes by jointly aligning live scans to both stored scans from the teach mapping pass, and to a sliding window of recent live keyframes. This ensures accurate and robust pose estimation across different seasons and weather phenomena. Radar scans are represented using a sparse set of oriented surface points, computed from Doppler-compensated measurements. The map is stored in a pose graph that is traversed during localization. Experiments on the held-out test sequences from the Boreas dataset show that CFEAR-TR can localize with an accuracy as low as 0.117 m and 0.096°, corresponding to improvements of up to 63% over the previous state of the art, while running efficiently at 29 Hz. These results substantially narrow the gap to lidar-level localization, particularly in heading estimation. We make the C++ implementation of our work available to the community.
comment: This paper has been accepted for publication in the IEEE International Conference on Robotics and Automation (ICRA), 2026
A Unified Low-Dimensional Design Embedding for Joint Optimization of Shape, Material, and Actuation in Soft Robots
Soft robots achieve functionality through tight coupling among geometry, material composition, and actuation. As a result, effective design optimization requires these three aspects to be considered jointly rather than in isolation. This coupling is computationally challenging: nonlinear large-deformation mechanics increase simulation cost, while contact, collision handling, and non-smooth state transitions limit the applicability of standard gradient-based approaches. We introduce a smooth, low-dimensional design embedding for soft robots that unifies shape morphing, multi-material distribution, and actuation within a single structured parameter space. Shape variation is modeled through continuous deformation maps of a reference geometry, while material properties are encoded as spatial fields. Both are constructed from shared basis functions. This representation enables expressive co-design while drastically reducing the dimensionality of the search space. In our experiments, we show that design expressiveness increases with the number of basis functions, unlike comparable neural network encodings whose representational capacity does not scale predictably with parameter count. We further show that joint co-optimization of shape, material, and actuation using our unified embedding consistently outperforms sequential strategies. All experiments are performed independently of the underlying simulator, confirming compatibility with black-box simulation pipelines. Across multiple dynamic tasks, the proposed embedding surpasses neural network and voxel-based baseline parameterizations while using significantly fewer design parameters. Together, these findings demonstrate that structuring the design space itself enables efficient co-design of soft robots.
comment: This work has been submitted to the IEEE for possible publication
Control Barrier Corridors: From Safety Functions to Safe Sets
Safe autonomy is a critical requirement and a key enabler for robots to operate safely in unstructured complex environments. Control barrier functions and safe motion corridors are two widely used but technically distinct safety methods, functional and geometric, respectively, for safe motion planning and control. Control barrier functions are applied to the safety filtering of control inputs to limit the decay rate of system safety, whereas safe motion corridors are geometrically constructed to define a local safe zone around the system state for use in motion optimization and reference-governor design. This paper introduces a new notion of control barrier corridors, which unifies these two approaches by converting control barrier functions into local safe goal regions for reference goal selection in feedback control systems. We show, with examples on fully actuated systems, kinematic unicycles, and linear output regulation systems, that individual state safety can be extended locally over control barrier corridors for convex barrier functions, provided the control convergence rate matches the barrier decay rate, highlighting a trade-off between safety and reactiveness. Such safe control barrier corridors enable safely reachable persistent goal selection over continuously changing barrier corridors during system motion, which we demonstrate for verifiably safe and persistent path following in autonomous exploration of unknown environments.
comment: 12 pages, 6 figures, an extended preprint version of a conference paper
History-Conditioned Spatio-Temporal Visual Token Pruning for Efficient Vision-Language Navigation
Vision-Language Navigation (VLN) enables robots to follow natural-language instructions in visually grounded environments, serving as a key capability for embodied robotic systems. Recent Vision-Language-Action (VLA) models have demonstrated strong navigation performance, but their high computational cost introduces latency that limits real-time deployment. We propose a training-free spatio-temporal vision token pruning framework tailored to VLA-based VLN. We apply spatial token selection to the current view, alongside spatio-temporal compression for historical memories, enabling efficient long-horizon inference while reducing redundant computation. Leveraging attention-based token importance and query-guided spatio-temporal filtering, the proposed approach preserves navigation-relevant information without retraining or modifying pretrained models, allowing plug-and-play integration into existing VLA systems. Through experiments on standard VLN benchmarks, we confirm that our method significantly outperforms existing pruning strategies. It successfully preserves superior navigation accuracy under extreme pruning scenarios, all while maintaining the highly competitive inference efficiency. Real-world deployment on a Unitree Go2 quadruped robot further validates reliable and low-latency instruction-following navigation under practical robotic constraints. We hope this work helps bridge the gap between large-scale multimodal modeling and efficient, real-time embodied deployment in robotic navigation systems.
Data Analogies Enable Efficient Cross-Embodiment Transfer
Generalist robot policies are trained on demonstrations collected across a wide variety of robots, scenes, and viewpoints. Yet it remains unclear how to best organize and scale such heterogeneous data so that it genuinely improves performance in a given target setting. In this work, we ask: what form of demonstration data is most useful for enabling transfer across robot set-ups? We conduct controlled experiments that vary end-effector morphology, robot platform appearance, and camera perspective, and compare the effects of simply scaling the number of demonstrations against systematically broadening the diversity in different ways. Our simulated experiments show that while perceptual shifts such as viewpoint benefit most from broad diversity, morphology shifts benefit far less from unstructured diversity and instead see the largest gains from data analogies, i.e. paired demonstrations that align scenes, tasks, and/or trajectories across different embodiments. Informed by the simulation results, we improve real-world cross-embodiment transfer success by an average of $22.5\%$ over large-scale, unpaired datasets by changing only the composition of the data.
comment: 14 pages, 11 Figures, 6 Tables
Safe Consensus of Cooperative Manipulation with Hierarchical Event-Triggered Control Barrier Functions
Cooperative transport and manipulation of heavy or bulky payloads by multiple manipulators requires coordinated formation tracking, while simultaneously enforcing strict safety constraints in varying environments with limited communication and real-time computation budgets. This paper presents a distributed control framework that achieves consensus coordination with safety guarantees via hierarchical event-triggered control barrier functions (CBFs). We first develop a consensus-based protocol that relies solely on local neighbor information to enforce both translational and rotational consistency in task space. Building on this coordination layer, we propose a three-level hierarchical event-triggered safety architecture with CBFs, which is integrated with a risk-aware leader selection and smooth switching strategy to reduce online computation. The proposed approach is validated through real-world hardware experiments using two Franka manipulators operating with static obstacles, as well as comprehensive simulations demonstrating scalable multi-arm cooperation with dynamic obstacles. Results demonstrate higher precision cooperation under strict safety constraints, achieving substantially reduced computational cost and communication frequency compared to baseline methods.
comment: 8 pages
Open-Source Based and ETSI Compliant Cooperative, Connected, and Automated Mini-Cars
The automotive sector is following a revolutionary path from vehicles controlled by humans to vehicles that will be fully automated, fully connected, and ultimately fully cooperative. Along this road, new cooperative algorithms and protocols will be designed and field tested, which represents a great challenge in terms of costs. In this context, in particular, moving from simulations to practical experiments requires huge investments that are not always affordable and may become a barrier in some cases. To solve this issue and provide the community with an intermediate step, we here propose the use of 1:10 scaled cooperative, autonomous, and connected mini-cars. The mini-car is equipped with a Jetson Orin board running the open Robot Operating System 2 (ROS2), sensors for autonomous operations, and a Raspberry Pi board for connectivity mounting the open source Open Stack for Car (OScar). A key aspect of the proposal is the use of OScar, which implements a full ETSI cooperative-intelligent transport systems (C-ITS) compliant stack. The feasibility and potential of the proposed platform is here demonstrated through the implementation of a case study where the Day-1 intersection collision warning (ICW) application is implemented and validated.
comment: 5 pages, 6 figures
SuperSuit: An Isomorphic Bimodal Interface for Scalable Mobile Manipulation
High-quality, long-horizon demonstrations are essential for embodied AI, yet acquiring such data for tightly coupled wheeled mobile manipulators remains a fundamental bottleneck. Unlike fixed-base systems, mobile manipulators require continuous coordination between $SE(2)$ locomotion and precise manipulation, exposing limitations in existing teleoperation and wearable interfaces. We present \textbf{SuperSuit}, a bimodal data acquisition framework that supports both robot-in-the-loop teleoperation and active demonstration under a shared kinematic interface. Both modalities produce structurally identical joint-space trajectories, enabling direct data mixing without modifying downstream policies. For locomotion, SuperSuit maps natural human stepping to continuous planar base velocities, eliminating discrete command switches. For manipulation, it employs a strictly isomorphic wearable arm in both modes, while policy training is formulated in a shift-invariant delta-joint representation to mitigate calibration offsets and structural compliance without inverse kinematics. Real-world experiments on long-horizon mobile manipulation tasks show 2.6$\times$ higher demonstration throughput in active mode compared to a teleoperation baseline, comparable policy performance when substituting teleoperation data with active demonstrations at fixed dataset size, and monotonic performance improvement as active data volume increases. These results indicate that consistent kinematic representations across collection modalities enable scalable data acquisition for long-horizon mobile manipulation.
Can we Trust Unreliable Voxels? Exploring 3D Semantic Occupancy Prediction under Label Noise
3D semantic occupancy prediction is a cornerstone of robotic perception, yet real-world voxel annotations are inherently corrupted by structural artifacts and dynamic trailing effects. This raises a critical but underexplored question: can autonomous systems safely rely on such unreliable occupancy supervision? To systematically investigate this issue, we establish OccNL, the first benchmark dedicated to 3D occupancy under occupancy-asymmetric and dynamic trailing noise. Our analysis reveals a fundamental domain gap: state-of-the-art 2D label noise learning strategies collapse catastrophically in sparse 3D voxel spaces, exposing a critical vulnerability in existing paradigms. To address this challenge, we propose DPR-Occ, a principled label noise-robust framework that constructs reliable supervision through dual-source partial label reasoning. By synergizing temporal model memory with representation-level structural affinity, DPR-Occ dynamically expands and prunes candidate label sets to preserve true semantics while suppressing noise propagation. Extensive experiments on SemanticKITTI demonstrate that DPR-Occ prevents geometric and semantic collapse under extreme corruption. Notably, even at 90% label noise, our method achieves significant performance gains (up to 2.57% mIoU and 13.91% IoU) over existing label noise learning baselines adapted to the 3D occupancy prediction task. By bridging label noise learning and 3D perception, OccNL and DPR-Occ provide a reliable foundation for safety-critical robotic perception in dynamic environments. The benchmark and source code will be made publicly available at https://github.com/mylwx/OccNL.
comment: The benchmark and source code will be made publicly available at https://github.com/mylwx/OccNL
Towards Robotic Lake Maintenance: Integrating SONAR and Satellite Data to Assist Human Operators ICMR
Artificial Water Bodies (AWBs) are human-made systems that require continuous monitoring due to their artificial biological processes. These systems demand regular maintenance to manage their ecosystems effectively. As a result of these artificial conditions, underwater vegetation can grow rapidly and must be harvested to preserve the ecological balance. This paper proposes a two-step approach to support targeted weed harvesting for the maintenance of artificial lakes. The first step is the initial detection of Submerged Aquatic Vegetation (SAV), also referred to in this paper as areas of interest, is performed using satellite-derived indices, specifically the Aquatic Plants and Algae (APA) index, which highlights submerged vegetation in water bodies. Subsequently, an Unmanned Surface Vehicle (USV) equipped with multibeam SOund NAvigation and Ranging (SONAR) performs high-resolution bathymetric mapping to locate and quantify aquatic vegetation precisely. This two-stage approach offers an effective human-robot collaboration, where satellite data guides the USV missions and boat skippers leverage detailed SONAR maps for targeted harvesting. This setup narrows the search space and reduces manual workload from human operators, making the harvesting process less labour-intensive for operators. Preliminary results demonstrate the feasibility of integrating satellite imagery and underwater acoustic sensing to improve vegetation management in artificial lakes.
comment: Accepted to and presented at the 2026 IEEE International Conference on Mechatronics and Robotics Engineering (ICMRE)
NOVA: Next-step Open-Vocabulary Autoregression for 3D Multi-Object Tracking in Autonomous Driving
Generalizing across unknown targets is critical for open-world perception, yet existing 3D Multi-Object Tracking (3D MOT) pipelines remain limited by closed-set assumptions and ``semantic-blind'' heuristics. To address this, we propose Next-step Open-Vocabulary Autoregression (NOVA), an innovative paradigm that shifts 3D tracking from traditional fragmented distance-based matching toward generative spatio-temporal semantic modeling. NOVA reformulates 3D trajectories as structured spatio-temporal semantic sequences, enabling the simultaneous encoding of physical motion continuity and deep linguistic priors. By leveraging the autoregressive capabilities of Large Language Models (LLMs), we transform the tracking task into a principled process of next-step sequence completion. This mechanism allows the model to explicitly utilize the hierarchical structure of language space to resolve fine-grained semantic ambiguities and maintain identity consistency across complex long-range sequences through high-level commonsense reasoning. Extensive experiments on nuScenes, V2X-Seq-SPD, and KITTI demonstrate the superior performance of NOVA. Notably, on the nuScenes dataset, NOVA achieves an AMOTA of 22.41% for Novel categories, yielding a significant 20.21% absolute improvement over the baseline. These gains are realized through a compact 0.5B autoregressive model. Code will be available at https://github.com/xifen523/NOVA.
comment: Code will be available at https://github.com/xifen523/NOVA
TaPD: Temporal-adaptive Progressive Distillation for Observation-Adaptive Trajectory Forecasting in Autonomous Driving
Trajectory prediction is essential for autonomous driving, enabling vehicles to anticipate the motion of surrounding agents to support safe planning. However, most existing predictors assume fixed-length histories and suffer substantial performance degradation when observations are variable or extremely short in real-world settings (e.g., due to occlusion or a limited sensing range). We propose TaPD (Temporal-adaptive Progressive Distillation), a unified plug-and-play framework for observation-adaptive trajectory forecasting under variable history lengths. TaPD comprises two cooperative modules: an Observation-Adaptive Forecaster (OAF) for future prediction and a Temporal Backfilling Module (TBM) for explicit reconstruction of the past. OAF is built on progressive knowledge distillation (PKD), which transfers motion pattern knowledge from long-horizon "teachers" to short-horizon "students" via hierarchical feature regression, enabling short observations to recover richer motion context. We further introduce a cosine-annealed distillation weighting scheme to balance forecasting supervision and feature alignment, improving optimization stability and cross-length consistency. For extremely short histories where implicit alignment is insufficient, TBM backfills missing historical segments conditioned on scene evolution, producing context-rich trajectories that strengthen PKD and thereby improve OAF. We employ a decoupled pretrain-reconstruct-finetune protocol to preserve real-motion priors while adapting to backfilled inputs. Extensive experiments on Argoverse 1 and Argoverse 2 show that TaPD consistently outperforms strong baselines across all observation lengths, delivers especially large gains under very short inputs, and improves other predictors (e.g., HiVT) in a plug-and-play manner. Code will be available at https://github.com/zhouhao94/TaPD.
Few-Shot Neural Differentiable Simulator: Real-to-Sim Rigid-Contact Modeling
Accurate physics simulation is essential for robotic learning and control, yet analytical simulators often fail to capture complex contact dynamics, while learning-based simulators typically require large amounts of costly real-world data. To bridge this gap, we propose a few-shot real-to-sim approach that combines the physical consistency of analytical formulations with the representational capacity of graph neural network (GNN)-based models. Using only a small amount of real-world data, our method calibrates analytical simulators to generate large-scale synthetic datasets that capture diverse contact interactions. On this foundation, we introduce a mesh-based GNN that implicitly models rigid-body forward dynamics and derive surrogate gradients for collision detection, achieving full differentiability. Experimental results demonstrate that our approach enables learning-based simulators to outperform differentiable baselines in replicating real-world trajectories. In addition, the differentiable design supports gradient-based optimization, which we validate through simulation-based policy learning in multi-object interaction scenarios. Extensive experiments show that our framework not only improves simulation fidelity with minimal supervision but also increases the efficiency of policy learning. Taken together, these findings suggest that differentiable simulation with few-shot real-world grounding provides a powerful direction for advancing future robotic manipulation and control.
VG3S: Visual Geometry Grounded Gaussian Splatting for Semantic Occupancy Prediction
3D semantic occupancy prediction has become a crucial perception task for comprehensive scene understanding in autonomous driving. While recent advances have explored 3D Gaussian splatting for occupancy modeling to substantially reduce computational overhead, the generation of high-quality 3D Gaussians relies heavily on accurate geometric cues, which are often insufficient in purely vision-centric paradigms. To bridge this gap, we advocate for injecting the strong geometric grounding capability from Vision Foundation Models (VFMs) into occupancy prediction. In this regard, we introduce Visual Geometry Grounded Gaussian Splatting (VG3S), a novel framework that empowers Gaussian-based occupancy prediction with cross-view 3D geometric grounding. Specifically, to fully exploit the rich 3D geometric priors from a frozen VFM, we propose a plug-and-play hierarchical geometric feature adapter, which can effectively transform generic VFM tokens via feature aggregation, task-specific alignment, and multi-scale restructuring. Extensive experiments on the nuScenes occupancy benchmark demonstrate that VG3S achieves remarkable improvements of 12.6% in IoU and 7.5% in mIoU over the baseline. Furthermore, we show that VG3S generalizes seamlessly across diverse VFMs, consistently enhancing occupancy prediction accuracy and firmly underscoring the immense value of integrating priors derived from powerful, pre-trained geometry-grounded VFMs.
KISS-IMU: Self-supervised Inertial Odometry with Motion-balanced Learning and Uncertainty-aware Inference
Inertial measurement units (IMUs), which provide high-frequency linear acceleration and angular velocity measurements, serve as fundamental sensing modalities in robotic systems. Recent advances in deep neural networks have led to remarkable progress in inertial odometry. However, the heavy reliance on ground truth data during training fundamentally limits scalability and generalization to unseen and diverse environments. We propose KISS-IMU, a novel self-supervised inertial odometry framework that eliminates ground truth dependency by leveraging simple LiDAR-based ICP registration and pose graph optimization as a supervisory signal. Our approach embodies two key principles: keeping the IMU stable through motion-aware balanced training and keeping the IMU strong through uncertainty-driven adaptive weighting during inference. To evaluate performance across diverse motion patterns and scenarios, we conducted comprehensive experiments on various real-world platforms, including quadruped robots. Importantly, we train only the IMU network in a self-supervised manner, with LiDAR serving solely as a lightweight supervisory signal rather than requiring additional learnable processes. This design enables the framework to ensure robustness without relying on joint multi-modal learning or ground truth supervision. The supplementary materials are available at https://sparolab.github.io/research/kiss_imu.
comment: 8 pages, 9 figures
DreamToNav: Generalizable Navigation for Robots via Generative Video Planning
We present DreamToNav, a novel autonomous robot framework that uses generative video models to enable intuitive, human-in-the-loop control. Instead of relying on rigid waypoint navigation, users provide natural language prompts (e.g. ``Follow the person carefully''), which the system translates into executable motion. Our pipeline first employs Qwen 2.5-VL-7B-Instruct to refine vague user instructions into precise visual descriptions. These descriptions condition NVIDIA Cosmos 2.5, a state-of-the-art video foundation model, to synthesize a physically consistent video sequence of the robot performing the task. From this synthetic video, we extract a valid kinematic path using visual pose estimation, robot detection and trajectory recovery. By treating video generation as a planning engine, DreamToNav allows robots to visually "dream" complex behaviors before executing them, providing a unified framework for obstacle avoidance and goal-directed navigation without task-specific engineering. We evaluate the approach on both a wheeled mobile robot and a quadruped robot in indoor navigation tasks. DreamToNav achieves a success rate of 76.7%, with final goal errors typically within 0.05-0.10 m and trajectory tracking errors below 0.15 m. These results demonstrate that trajectories extracted from generative video predictions can be reliably executed on physical robots across different locomotion platforms.
comment: Submitted to conference
Dual-Agent Multiple-Model Reinforcement Learning for Event-Triggered Human-Robot Co-Adaptation in Decoupled Task Spaces
This paper presents a shared-control rehabilitation policy for a custom 6-degree-of-freedom (6-DoF) upper-limb robot that decomposes complex reaching tasks into decoupled spatial axes. The patient governs the primary reaching direction using binary commands, while the robot autonomously manages orthogonal corrective motions. Because traditional fixed-frequency control often induces trajectory oscillations due to variable inverse-kinematics execution times, an event-driven progression strategy is proposed. This architecture triggers subsequent control actions only when the end-effector enters an admission sphere centred on the immediate target waypoint, and was validated in a semi-virtual setup linking a physical pressure sensor to a MuJoCo simulation. To optimise human--robot co-adaptation safely and efficiently, this study introduces Dual Agent Multiple Model Reinforcement Learning (DAMMRL). This framework discretises decision characteristics: the human agent selects the admission sphere radius to reflect their inherent speed--accuracy trade-off, while the robot agent dynamically adjusts its 3D Cartesian step magnitudes to complement the user's cognitive state. Trained in simulation and deployed across mixed environments, this event-triggered DAMMRL approach effectively suppresses waypoint chatter, balances spatial precision with temporal efficiency, and significantly improves success rates in object acquisition tasks.
A Hazard-Informed Data Pipeline for Robotics Physical Safety
This report presents a structured Robotics Physical Safety Framework based on explicit asset declaration, systematic vulnerability enumeration, and hazard-driven synthetic data generation. The approach bridges classical risk engineering with modern machine learning pipelines, enabling safety envelope learning grounded in a formalized hazard ontology. The key contribution of this framework is the alignment between classical safety engineering, digital twin simulation, synthetic data generation, and machine learning model training.
comment: 4th International Conference on Automation and Mechatronics Engineering (ICAME 2026)
Sticky-Glance: Robust Intent Recognition for Human Robot Collaboration via Single-Glance
Gaze is a valuable means of communication for impaired people with extremely limited motor capabilities. However, robust gaze-based intent recognition in multi-object environments is challenging due to gaze noise, micro-saccades, viewpoint changes, and dynamic objects. To address this, we propose an object-centric gaze grounding framework that stabilizes intent through a sticky-glance algorithm, jointly modeling geometric distance and direction trends. The inferred intent remains anchored to the object even under short glances with minimal 3 gaze samples, achieving a tracking rate of 0.94 for dynamic targets and selection accuracy of 0.98 for static targets. We further introduce a continuous shared control and multi-modal interaction paradigm, enabling high-readiness control and human-in-loop feedback, thereby reducing task duration for nearly 10 \%. Experiments across dynamic tracking, multi-perspective alignment, a baseline comparison, user studies, and ablation studies demonstrate improved robustness, efficiency, and reduced workload compared to representative baselines.
Multimodal Behavior Tree Generation: A Small Vision-Language Model for Robot Task Planning
Large and small language models have been widely used for robotic task planning. At the same time, vision-language models (VLMs) have successfully tackled problems such as image captioning, scene understanding, and visual question answering. In this work, we combine these two approaches by deploying a compact, open-source multimodal model to generate behavior trees for robotic task planning. The main obstacle to achieving this goal is the lack of an existing dataset that links visual observations and instructions to executable behavior trees. We propose a method to construct such a dataset starting from existing robotic episodes (i.e., Open X-Embodiment), in which a large model serves as a teacher in a multi-stage generation pipeline. We use this dataset to fine-tune VLMs ranging from 500M to 4B parameters via parameter-efficient fine-tuning (PEFT). The generated behavior trees, compatible with the BehaviorTree.CPP library, are evaluated both offline, using structural and lexical metrics, and online through the execution of household tasks in a state-of-the-art embodied simulator. Our results demonstrate that our fine-tuned 4B-parameter VLM approaches the performance of state-of-the-art closed-source models, achieving an 87\% success rate while requiring only a fraction of the computational resources.
Lifelong Embodied Navigation Learning
Embodied navigation agents powered by large language models have shown strong performance on individual tasks but struggle to continually acquire new navigation skills, which suffer from catastrophic forgetting. We formalize this challenge as lifelong embodied navigation learning (LENL), where an agent is required to adapt to a sequence of navigation tasks spanning multiple scenes and diverse user instruction styles, while retaining previously learned knowledge. To tackle this problem, we propose Uni-Walker, a lifelong embodied navigation framework that decouples navigation knowledge into task-shared and task-specific components with Decoder Extension LoRA (DE-LoRA). To learn the shared knowledge, we design a knowledge inheritance strategy and an experts co-activation strategy to facilitate shared knowledge transfer and refinement across multiple navigation tasks. To learn the specific knowledge, we propose an expert subspace orthogonality constraint together and a navigation-specific chain-of-thought reasoning mechanism to capture specific knowledge and enhance instruction-style understanding. Extensive experiments demonstrate the superiority of Uni-Walker for building universal navigation agents with lifelong learning.
comment: 24 pages, 7 figures
Transforming Omnidirectional RGB-LiDAR data into 3D Gaussian Splatting IROS
The demand for large-scale digital twins is rapidly growing in robotics and autonomous driving. However, constructing these environments with 3D Gaussian Splatting (3DGS) usually requires expensive, purpose-built data collection. Meanwhile, deployed platforms routinely collect extensive omnidirectional RGB and LiDAR logs, but a significant portion of these sensor data is directly discarded or strictly underutilized due to transmission constraints and the lack of scalable reuse pipeline. In this paper, we present an omnidirectional RGB-LiDAR reuse pipeline that transforms these archived logs into robust initialization assets for 3DGS. Direct conversion of such raw logs introduces practical bottlenecks: inherent non-linear distortion leads to unreliable Structure-from-Motion (SfM) tracking, and dense, unorganized LiDAR clouds cause computational overhead during 3DGS optimization. To overcome these challenges, our pipeline strategically integrates an ERP-to-cubemap conversion module for deterministic spatial anchoring, alongside PRISM-a color stratified downsampling strategy. By bridging these multi-modal inputs via Fast Point Feature Histograms (FPFH) based global registration and Iterative Closest Point (ICP), our pipeline successfully repurposes a considerable fraction of discarded data into usable SfM geometry. Furthermore, our LiDAR-reinforced initialization consistently enhances the final 3DGS rendering fidelity in structurally complex scenes compared to vision-only baselines. Ultimately, this work provides a deterministic workflow for creating simulation-grade digital twins from standard archived sensor logs.
comment: This work has been submitted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) for possible publication
RODEO: RObotic DEcentralized Organization
Robots are improving their autonomy with minimal human supervision. However, auditable actions, transparent decision processes, and new human-robot interaction models are still missing requirements to achieve extended robot autonomy. To tackle these challenges, we propose RODEO (RObotic DEcentralized Organization), a blockchain-based framework that integrates trust and accountability mechanisms for robots. This paper formalizes Decentralized Autonomous Organizations (DAOs) for service robots. First, it provides a ROS-ETH bridge between the DAO and the robots. Second, it offers templates that enable organizations (e.g., companies, universities) to integrate service robots into their operations. Third, it provides proof-verification mechanisms that allow robot actions to be auditable. In our experimental setup, a mobile robot was deployed as a trash collector in a lab scenario. The robot collects trash and uses a smart bin to sort and dispose of it correctly. Then, the robot submits a proof of the successful operation and is compensated in DAO tokens. Finally, the robot re-invests the acquired funds to purchase battery charging services. Data collected in a three day experiment show that the robot doubled its income and reinvested funds to extend its operating time. The proof validation times of approximately one minute ensured verifiable task execution, while the accumulated robot income successfully funded up to 88 hours of future autonomous operation. The results of this research give insights about how robots and organizations can coordinate tasks and payments with auditable execution proofs and on-chain settlement.
comment: 8 pages, 6 figures, Accepted at IEEE International Conference on Robotics & Automation (2026)
Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models CVPR2026
We identify a fundamental Narrow Policy limitation undermining the performance of autonomous VLA models, where driving Imitation Learning (IL) tends to collapse exploration and limit the potential of subsequent Reinforcement Learning (RL) stages, which often saturate prematurely due to insufficient feedback diversity. Thereby, we propose Curious-VLA, a framework that alleviates the exploit-explore dilemma through a two-stage design. During IL, we introduce a Feasible Trajectory Expansion (FTE) strategy to generate multiple physically valid trajectories and a step-wise normalized trajectory representation to adapt this diverse data. In the RL stage, we present Adaptive Diversity-Aware Sampling (ADAS) that prioritizes high-diversity samples and introduce Spanning Driving Reward (SDR) with a focal style weighting to amplify reward's value span for improving sensitivity to driving quality. On the Navsim benchmark, Curious-VLA achieves SoTA results (PDMS 90.3, EPDMS 85.4) and a Best-of-N PDMS of 94.8, demonstrating its effectiveness in unlocking the exploratory potential of VLA models. Code: https://github.com/Mashiroln/curious_vla.git.
comment: Accepted by CVPR2026 findings
Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration
Vision-Language-Action (VLA) models enable robots to perform manipulation tasks directly from natural language instructions and are increasingly viewed as a foundation for generalist robotic policies. However, their reliability under Out-of-Distribution (OOD) instructions remains underexplored. In this paper, we reveal a critical failure mode in which VLA policies continue executing visually plausible actions even when the language instruction contradicts the scene. We refer to this phenomenon as linguistic blindness, where VLA policies prioritize visual priors over instruction semantics during action generation. To systematically analyze this issue, we introduce ICBench, a diagnostic benchmark constructed from the LIBERO dataset that probes language-action coupling by injecting controlled OOD instruction contradictions while keeping the visual environment unchanged. Evaluations on three representative VLA architectures, including Pi0, Pi0.5 and OpenVLA OFT, show that these models frequently succeed at tasks despite logically impossible instructions, revealing a strong visual bias in action generation. To mitigate this issue, we propose Instruction-Guided Attention Recalibration (IGAR), a train-free inference-time mechanism that rebalances attention distributions to restore the influence of language instructions. IGAR operates without retraining or architectural modification and can be directly applied to existing VLA models. Experiments across 30 LIBERO tasks demonstrate that IGAR substantially reduces erroneous execution under OOD contradictory instructions while preserving baseline task performance. We additionally validate the approach on a real Franka robotic arm, where IGAR effectively prevents manipulation triggered by inconsistent instructions.
TADPO: Reinforcement Learning Goes Off-road ICRA 2026
Off-road autonomous driving poses significant challenges such as navigating unmapped, variable terrain with uncertain and diverse dynamics. Addressing these challenges requires effective long-horizon planning and adaptable control. Reinforcement Learning (RL) offers a promising solution by learning control policies directly from interaction. However, because off-road driving is a long-horizon task with low-signal rewards, standard RL methods are challenging to apply in this setting. We introduce TADPO, a novel policy gradient formulation that extends Proximal Policy Optimization (PPO), leveraging off-policy trajectories for teacher guidance and on-policy trajectories for student exploration. Building on this, we develop a vision-based, end-to-end RL system for high-speed off-road driving, capable of navigating extreme slopes and obstacle-rich terrain. We demonstrate our performance in simulation and, importantly, zero-shot sim-to-real transfer on a full-scale off-road vehicle. To our knowledge, this work represents the first deployment of RL-based policies on a full-scale off-road platform.
comment: 8 pages, 5 figures, 2 tables. Accepted at ICRA 2026
Moving Through Clutter: Scaling Data Collection and Benchmarking for 3D Scene-Aware Humanoid Locomotion via Virtual Reality
Recent advances in humanoid locomotion have enabled dynamic behaviors such as dancing, martial arts, and parkour, yet these capabilities are predominantly demonstrated in open, flat, and obstacle-free settings. In contrast, real-world environments such as homes, offices, and public spaces, are densely cluttered, three-dimensional, and geometrically constrained, requiring scene-aware whole-body coordination, precise balance control, and reasoning over spatial constraints imposed by furniture and household objects. However, humanoid locomotion in cluttered 3D environments remains underexplored, and no public dataset systematically couples full-body human locomotion with the scene geometry that shapes it. To address this gap, we present Moving Through Clutter (MTC), an opensource Virtual Reality (VR) based data collection and evaluation framework for scene-aware humanoid locomotion in cluttered environments. Our system procedurally generates scenes with controllable clutter levels and captures embodiment-consistent, whole-body human motion through immersive VR navigation, which is then automatically retargeted to a humanoid robot model. We further introduce benchmarks that quantify environment clutter level and locomotion performance, including stability and collision safety. Using this framework, we compile a dataset of 348 trajectories across 145 diverse 3D cluttered scenes. The dataset provides a foundation for studying geometry-induced adaptation in humanoid locomotion and developing scene-aware planning and control methods.
MagRobot:An Open Simulator for Magnetically Navigated Robots
Magnetic navigation systems, including magnetic tracking systems and magnetic actuation systems, have shown great potential for occlusion-free localization and remote control of intracorporeal medical devices and robots in minimally invasive medicine, such as capsule endoscopy and cardiovascular intervention. However, the design of magnetically navigated robots remains heavily reliant on experimental prototyping, which is time-consuming and costly. Furthermore, there is a lack of a consistent experimental environment to compare and benchmark the hardware and algorithms across different magnetic navigation systems. To address these challenges, we propose the first universal open-source simulation platform to facilitate research, design and benchmarking of magnetically navigated robots. Our simulator features an intuitive graphical user interface that enables the user to efficiently design, visualize, and analyze magnetic navigation systems for both rigid and soft robots. The proposed simulator is versatile, which can simulate both magnetic actuation and magnetic tracking tasks in diverse medical applications that involve deformable anatomies. The proposed simulator provides an open development environment, where the user can load third-party anatomical models and customize both hardware and algorithms of magnetic navigation systems. The fidelity of the simulator is validated using both phantom and ex vivo experiments of magnetic navigation of a continuum robot and a capsule robot with diverse magnetic actuation setups. Three use cases of the simulator, i.e., bronchoscopy, endovascular intervention, and gastrointestinal endoscopy, are implemented to demonstrate the functionality of the simulator. It is shown that the configuration and algorithms of magnetic navigation systems can be flexibly designed and optimized for better performance using the simulator.
comment: 20 pages, 10 figures
HarvestFlex: Strawberry Harvesting via Vision-Language-Action Policy Adaptation in the Wild
This work presents the first study on transferring vision-language-action (VLA) policies to real greenhouse tabletop strawberry harvesting, a long-horizon, unstructured task challenged by occlusion and specular reflections. We built an end-to-end closed-loop system on the HarvestFlex platform using three-view RGB sensing (two fixed scene views plus a wrist-mounted view) and intentionally avoided depth clouds and explicit geometric calibration. We collected 3.71 h of VR teleoperated demonstrations (227 episodes) and fine-tuned pi_0, pi_0.5, and WALL-OSS with full fine-tuning and LoRA. Under a unified 50 trials real-greenhouse protocol and metrics spanning completion, pi_0.5 with full fine-tuning achieved success rate of 74.0% with 32.6 s/pick and damage rate of 4.1%. Asynchronous inference-control decoupling further improved performance over synchronous deployment. Results showed non-trivial closed-loop picking with fewer than four hours of real data, while remaining limited by close-range observability loss and contact-dynamics mismatch. A demonstration video is available at: https://youtu.be/bN8ZowZKPMI.
Proprioceptive Shape Estimation of Tensegrity Manipulators Using Energy Minimisation ICRA 2026
Shape estimation is fundamental for controlling continuously bending tensegrity manipulators, yet achieving it remains a challenge. Although using exteroceptive sensors makes the implementation straightforward, it is costly and limited to specific environments. Proprioceptive approaches, by contrast, do not suffer from these limitations. So far, several methods have been proposed; however, to our knowledge, there are no proven examples of large-scale tensegrity structures used as manipulators. This paper demonstrates that shape estimation of the entire tensegrity manipulator can be achieved using only the inclination angle information relative to gravity for each strut. Inclination angle information is intrinsic sensory data that can be obtained simply by attaching an inertial measurement unit (IMU) to each strut. Experiments conducted on a five-layer tensegrity manipulator with 20 struts and a total length of 1160 mm demonstrate that the proposed method can estimate the shape with an accuracy of 2.1 \% of the total manipulator length, from arbitrary initial conditions under both static conditions and maintains stable shape estimation under external disturbances.
comment: 8 pages, 10 figures, IEEE ICRA 2026
PROBE: Probabilistic Occupancy BEV Encoding with Analytical Translation Robustness for 3D Place Recognition
We present PROBE (PRobabilistic Occupancy BEV Encoding), a learning-free LiDAR place recognition descriptor that models each BEV cell's occupancy as a Bernoulli random variable. Rather than relying on discrete point-cloud perturbations, PROBE analytically marginalizes over continuous Cartesian translations via the polar Jacobian, yielding a distance-adaptive angular uncertainty $σ_θ= σ_t / r$ in $\mathcal{O}(R \times S)$ time. The primary parameter $σ_t$ represents the expected translational uncertainty in meters, a sensor-independent physical quantity allowing cross-sensor generalization without per-dataset tuning. Pairwise similarity combines a Bernoulli-KL Jaccard with exponential uncertainty gating and FFT-based height cosine similarity for rotation alignment. Evaluated on four datasets spanning four diverse LiDAR types, PROBE achieves the highest accuracy among handcrafted descriptors in multi-session evaluation and competitive single-session performance against both handcrafted and supervised baselines. The source code and supplementary materials are available at https://sites.google.com/view/probe-pr.
comment: 8 pages, 8 figures
How to Model Your Crazyflie Brushless
The Crazyflie quadcopter is widely recognized as a leading platform for nano-quadcopter research. In early 2025, the Crazyflie Brushless was introduced, featuring brushless motors that provide around 50% more thrust compared to the brushed motors of its predecessor, the Crazyflie 2.1. This advancement has opened new opportunities for research in agile nano-quadcopter control. To support researchers utilizing this new platform, this work presents a dynamics model of the Crazyflie Brushless and identifies its key parameters. Through simulations and hardware analyses, we assess the accuracy of our model. We furthermore demonstrate its suitability for reinforcement learning applications by training an end-to-end neural network position controller and learning a backflip controller capable of executing two complete rotations with a vertical movement of just 1.8 meters. This showcases the model's ability to facilitate the learning of controllers and acrobatic maneuvers that successfully transfer from simulation to hardware. Utilizing this application, we investigate the impact of domain randomization on control performance, offering valuable insights into bridging the sim-to-real gap with the presented model. We have open-sourced the entire project, enabling users of the Crazyflie Brushless to swiftly implement and test their own controllers on an accurate simulation platform.
Swooper: Learning High-Speed Aerial Grasping With a Simple Gripper
High-speed aerial grasping presents significant challenges due to the high demands on precise, responsive flight control and coordinated gripper manipulation. In this work, we propose Swooper, a deep reinforcement learning (DRL) based approach that achieves both precise flight control and active gripper control using a single lightweight neural network policy. Training such a policy directly via DRL is nontrivial due to the complexity of coordinating flight and grasping. To address this, we adopt a two-stage learning strategy: we first pre-train a flight control policy, and then fine-tune it to acquire grasping skills. With the carefully designed reward functions and training framework, the entire training process completes in under 60 minutes on a standard desktop with an Nvidia RTX 3060 GPU. To validate the trained policy in the real world, we develop a lightweight quadrotor grasping platform equipped with a simple off-the-shelf gripper, and deploy the policy in a zero-shot manner on the onboard Raspberry Pi 4B computer, where each inference takes only about 1.0 ms. In 25 real-world trials, our policy achieves an 84% grasp success rate and grasping speeds of up to 1.5 m/s without any fine-tuning. This matches the robustness and agility of state-of-the-art classical systems with sophisticated grippers, highlighting the capability of DRL for learning a robust control policy that seamlessly integrates high-speed flight and grasping. The supplementary video is available for more results. Video: https://zikenhuang.github.io/Swooper/.
FTSplat: Feed-forward Triangle Splatting Network
High-fidelity three-dimensional (3D) reconstruction is essential for robotics and simulation. While Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) achieve impressive rendering quality, their reliance on time-consuming per-scene optimization limits real-time deployment. Emerging feed-forward Gaussian splatting methods improve efficiency but often lack explicit, manifold geometry required for direct simulation. To address these limitations, we propose a feed-forward framework for triangle primitive generation that directly predicts continuous triangle surfaces from calibrated multi-view images. Our method produces simulation-ready models in a single forward pass, obviating the need for per-scene optimization or post-processing. We introduce a pixel-aligned triangle generation module and incorporate relative 3D point cloud supervision to enhance geometric learning stability and consistency. Experiments demonstrate that our method achieves efficient reconstruction while maintaining seamless compatibility with standard graphics and robotic simulators.
Iterative Convex Optimization with Control Barrier Functions for Obstacle Avoidance among Polytopes
Obstacle avoidance of polytopic obstacles by polytopic robots is a challenging problem in optimization-based control and trajectory planning. Many existing methods rely on smooth geometric approximations, such as hyperspheres or ellipsoids, which allow differentiable distance expressions but distort the true geometry and restrict the feasible set. Other approaches integrate exact polytope distances into nonlinear model predictive control (MPC), resulting in nonconvex programs that limit real-time performance. In this paper, we construct linear discrete-time control barrier function (DCBF) constraints by deriving supporting hyperplanes from exact closest-point computations between convex polytopes. We then propose a novel iterative convex MPC-DCBF framework, where local linearization of system dynamics and robot geometry ensures convexity of the finite-horizon optimization at each iteration. The resulting formulation reduces computational complexity and enables fast online implementation for safety-critical control and trajectory planning of general nonlinear dynamics. The framework extends to multi-robot and three-dimensional environments. Numerical experiments demonstrate collision-free navigation in cluttered maze scenarios with millisecond-level solve times.
comment: 9 pages, 4 figures
Improved hopping control on slopes for small robots using spring mass modeling
Hopping robots often lose balance on slopes because the tilted ground creates unwanted rotation at landing. This work analyzes that effect using a simple spring mass model and identifies how slope induced impulses destabilize the robot. To address this, we introduce two straightforward fixes, adjusting the bodys touchdown angle based on the slope and applying a small corrective torque before takeoff. Together, these steps effectively cancel the unwanted rotation caused by inclined terrain, allowing the robot to land smoothly and maintain stable hopping even on steep slopes. Moreover, the proposed method remains simple enough to implement on low cost robotic platforms without requiring complex sensing or computation. By combining this analytical model with minimal control actions, this approach provides a practical path toward reliable hopping on uneven terrain. The results from simulation confirm that even small slope aware adjustments can dramatically improve landing stability, making the technique suitable for future autonomous field robots that must navigate natural environments such as hills, rubble, and irregular outdoor landscapes.
Systematic Evaluation of Novel View Synthesis for Video Place Recognition IROS 2026
The generation of synthetic novel views has the potential to positively impact robot navigation in several ways. In image-based navigation, a novel overhead view generated from a scene taken by a ground robot could be used to guide an aerial robot to that location. In Video Place Recognition (VPR), novel views of ground locations from the air can be added that enable a UAV to identify places seen by the ground robot, and similarly, overhead views can be used to generate novel ground views. This paper presents a systematic evaluation of synthetic novel views in VPR using five public VPR image databases and seven typical image similarity methods. We show that for small synthetic additions, novel views improve VPR recognition statistics. We find that for larger additions, the magnitude of viewpoint change is less important than the number of views added and the type of imagery in the dataset.
comment: Submitted to IEEE IROS 2026
AnyCamVLA: Zero-Shot Camera Adaptation for Viewpoint Robust Vision-Language-Action Models
Despite remarkable progress in Vision-Language-Action models (VLAs) for robot manipulation, these large pre-trained models require fine-tuning to be deployed in specific environments. These fine-tuned models are highly sensitive to camera viewpoint changes that frequently occur in unstructured environments. In this paper, we propose a zero-shot camera adaptation framework without additional demonstration data, policy fine-tuning, or architectural modification. Our key idea is to virtually adjust test-time camera observations to match the training camera configuration in real-time. For that, we use a recent feed-forward novel view synthesis model which outputs high-quality target view images, handling both extrinsic and intrinsic parameters. This plug-and-play approach preserves the pre-trained capabilities of VLAs and applies to any RGB-based policy. Through extensive experiments on the LIBERO benchmark, our method consistently outperforms baselines that use data augmentation for policy fine-tuning or additional 3D-aware features for visual input. We further validate that our approach constantly enhances viewpoint robustness in real-world robotic manipulation scenarios, including settings with varying camera extrinsics, intrinsics, and freely moving handheld cameras.
comment: Under review, Project Page: https://heo0224.github.io/AnyCamVLA/
DexEMG: Towards Dexterous Teleoperation System via EMG2Pose Generalization
High-fidelity teleoperation of dexterous robotic hands is essential for bringing robots into unstructured domestic environments. However, existing teleoperation systems often face a trade-off between performance and portability: vision-based capture systems are constrained by costs and line-of-sight requirements, while mechanical exoskeletons are bulky and physically restrictive. In this paper, we present DexEMG, a lightweight and cost-effective teleoperation system leveraging surface electromyography (sEMG) to bridge the gap between human intent and robotic execution. We first collect a synchronized dataset of sEMG signals and hand poses via a MoCap glove to train EMG2Pose, a neural network capable of continuously predicting hand kinematics directly from muscle activity. To ensure seamless control, we develop a robust hand retargeting algorithm that maps the predicted poses onto a multi-fingered dexterous hand in real-time. Experimental results demonstrate that DexEMG achieves high precision in diverse teleoperation tasks. Notably, our system exhibits strong generalization capabilities across novel objects and complex environments without the need for intensive individual-specific recalibration. This work offers a scalable and intuitive interface for both general-purpose robotic manipulation and assistive technologies.
Expert Knowledge-driven Reinforcement Learning for Autonomous Racing via Trajectory Guidance and Dynamics Constraints
Reinforcement learning has demonstrated significant potential in the field of autonomous driving. However, it suffers from defects such as training instability and unsafe action outputs when faced with autonomous racing environments characterized by high dynamics and strong nonlinearities. To this end, this paper proposes a trajectory guidance and dynamics constraints Reinforcement Learning (TraD-RL) method for autonomous racing. The key features of this method are as follows: 1) leveraging the prior expert racing line to construct an augmented state representation and facilitate reward shaping, thereby integrating domain knowledge to stabilize early-stage policy learning; 2) embedding explicit vehicle dynamic priors into a safe operating envelope formulated via control barrier functions to enable safety-constrained learning; and 3) adopting a multi-stage curriculum learning strategy that shifts from expert-guided learning to autonomous exploration, allowing the learned policy to surpass expert-level performance. The proposed method is evaluated in a high-fidelity simulation environment modeled after the Tempelhof Airport Street Circuit. Experimental results demonstrate that TraD-RL effectively improves both lap speed and driving stability of the autonomous racing vehicle, achieving a synergistic optimization of racing performance and safety.
Terrain characterization and locomotion adaptation in a small-scale lizard-inspired robot IROS 2026
Unlike their large-scale counterparts, small-scale robots are largely confined to laboratory environments and are rarely deployed in real-world settings. As robot size decreases, robot-terrain interactions fundamentally change; however, there remains a lack of systematic understanding of what sensory information small-scale robots should acquire and how they should respond when traversing complex natural terrains. To address these challenges, we develop a Small-scale, Intelligent, Lizard-inspired, Adaptive Robot (SILA Bot) capable of adapting to diverse substrates. We use granular media of varying depths as a controlled yet representative terrain paradigm. We show that the optimal body movement pattern (ranging from standing-wave bending that assists limb retraction on flat ground to traveling-wave undulation that generates thrust in deep granular media) can be parameterized and approximated as a linear function of granular depth. Furthermore, proprioceptive signals, such as joint torque, provide sufficient information to estimate granular depth via a K-Nearest Neighbors classifier, achieving 95% accuracy. Leveraging these relationships, we design a simple linear feedback controller that modulates body phase and substantially improves locomotion performance on terrains with unknown depth. Together, these results establish a principled framework for perception and control in small-scale locomotion and enable effective terrain-adaptive locomotion while maintaining low computational complexity.
comment: 7 pages. 9 figures. IROS 2026 Conference
OpenHEART: Opening Heterogeneous Articulated Objects with a Legged Manipulator
Legged manipulators offer high mobility and versatile manipulation. However, robust interaction with heterogeneous articulated objects, such as doors, drawers, and cabinets, remains challenging because of the diverse articulation types of the objects and the complex dynamics of the legged robot. Existing reinforcement learning (RL)-based approaches often rely on high-dimensional sensory inputs, leading to sample inefficiency. In this paper, we propose a robust and sample-efficient framework for opening heterogeneous articulated objects with a legged manipulator. In particular, we propose Sampling-based Abstracted Feature Extraction (SAFE), which encodes handle and panel geometry into a compact low-dimensional representation, improving cross-domain generalization. Additionally, Articulation Information Estimator (ArtIEst) is introduced to adaptively mix proprioception with exteroception to estimate opening direction and range of motion for each object. The proposed framework was deployed to manipulate various heterogeneous articulated objects in simulation and real-world robot systems. Videos can be found on the project website: https://openheart-icra.github.io/OpenHEART/
comment: 8 pages
Hierarchical Latent Action Model ICLR 2026
Latent Action Models (LAMs) enable learning from actionless data for applications ranging from robotic control to interactive world models. However, existing LAMs typically focus on short-horizon frame transitions and capture low-level motion while overlooking longer-term temporal structure. In contrast, actionless videos often contain temporally extended and high-level skills. We present HiLAM, a hierarchical latent action model that discovers latent skills by modeling long-term temporal information. To capture these dependencies across long horizons, we utilize a pretrained LAM as a low-level extractor. This architecture aggregates latent action sequences, which contain the underlying dynamic patterns of the video, into high-level latent skills. Our experiments demonstrate that HiLAM improves over the baseline and exhibits robust dynamic skill discovery.
comment: ICLR 2026 Workshop - 2nd Workshop on World Models: Understanding, Modelling and Scaling
CDF-Glove: A Cable-Driven Force Feedback Glove for Dexterous Teleoperation
High-quality teleoperated demonstrations are a primary bottleneck for imitation learning (IL) in dexterous manipulation. However, haptic feedback provides operators with real-time contact information, enabling real-time finger posture adjustments, and thereby improving demonstration quality. Existing dexterous teleoperation platforms typically omit haptic feedback and remain bulky and expensive. We introduce CDF-Glove, a lightweight and low cost cable-driven force-feedback glove. The real-time state is available for 20 finger degrees of freedom (DoF), of which 16 are directly sensed and 4 are passively coupled (inferred from kinematic constraints). We develop a kinematic model and control stack for the glove, and validate them across multiple robotic hands with diverse kinematics and DoF. The CDF-Glove achieves distal joint repeatability of 0.4 degrees, and delivers about 200 ms force feedback latency, yielding a 4x improvement in task success rate relative to no-feedback teleoperation. We collect two bimanual teleoperation datasets, on which we train and evaluate Diffusion Policy baselines. Compared to kinesthetic teaching, the policies trained in our teleoperated demonstrations increase the average success rate by 55% and reduce the mean completion time by approximately 15.2 seconds (a 47.2% relative reduction). In particular, the CDF-Glove costs approximately US$230. The code and designs are released as open source at https://cdfglove.github.io/.
Task-Level Decisions to Gait Level Control: A Hierarchical Policy Approach for Quadruped Navigation IROS 2026
Real-world quadruped navigation is constrained by a scale mismatch between high-level navigation decisions and low-level gait execution, as well as by instabilities under out-of-distribution environmental changes. Such variations challenge sim-to-real transfer and can trigger falls when policies lack explicit interfaces for adaptation. In this paper, we present a hierarchical policy architecture for quadrupedal navigation, termed Task-level Decision to Gait Control (TDGC). A low-level policy, trained with reinforcement learning in simulation, delivers gait-conditioned locomotion and maps task requirements to a compact set of controllable behavior parameters, enabling robust mode generation and smooth switching. A high-level policy makes task-centric decisions from sparse semantic or geometric terrain cues and translates them into low-level targets, forming a traceable decision pipeline without dense maps or high-resolution terrain reconstruction. Different from end-to-end approaches, our architecture provides explicit interfaces for deployment-time tuning, fault diagnosis, and policy refinement. We introduce a structured curriculum with performance-driven progression that expands environmental difficulty and disturbance ranges. Experiments show higher task success rates on mixed terrains and out-of-distribution tests.
comment: Submitted to IROS 2026
Multi-Robot Trajectory Planning via Constrained Bayesian Optimization and Local Cost Map Learning with STL-Based Conflict Resolution ICRA 2026
We address multi-robot motion planning under Signal Temporal Logic (STL) specifications with kinodynamic constraints. Exact approaches face scalability bottlenecks and limited adaptability, while conventional sampling-based methods require excessive samples to construct optimal trajectories. We propose a two-stage framework integrating sampling-based online learning with formal STL reasoning. At the single-robot level, our constrained Bayesian Optimization-based Tree search (cBOT) planner uses a Gaussian process as a surrogate model to learn local cost maps and feasibility constraints, generating shorter collision-free trajectories with fewer samples. At the multi-robot level, our STL-enhanced Kinodynamic Conflict-Based Search (STL-KCBS) algorithm incorporates STL monitoring into conflict detection and resolution, ensuring specification satisfaction while maintaining scalability and probabilistic completeness. Benchmarking demonstrates improved trajectory efficiency and safety over existing methods. Real-world experiments with autonomous surface vehicles validate robustness and practical applicability in uncertain environments. The STLcBOT Planner will be released as an open-source package, and videos of real-world and simulated experiments are available at https://stlbot.github.io/.
comment: Accepted to ICRA 2026
BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations
The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understanding abilities, which are essential for handling complex decision-making and long-tail scenarios. However, existing methods typically feed LLMs with tokens from multi-view and multi-frame images independently, leading to redundant computation and limited spatial consistency. This separation in visual processing hinders accurate 3D spatial reasoning and fails to maintain geometric coherence across views. On the other hand, Bird's-Eye View (BEV) representations learned from geometrically annotated tasks (e.g., object detection) provide spatial structure but lack the semantic richness of foundation vision encoders. To bridge this gap, we propose BEVLM, a framework that connects a spatially consistent and semantically distilled BEV representation with LLMs. Through extensive experiments, we show that BEVLM enables LLMs to reason more effectively in cross-view driving scenes, improving accuracy by 46%, by leveraging BEV features as unified inputs. Furthermore, by distilling semantic knowledge from LLMs into BEV representations, BEVLM significantly improves closed-loop end-to-end driving performance by 29% in safety-critical scenarios.
comment: 4 figures, 6 tables in the main paper, 32 pages in total
Fly360: Omnidirectional Obstacle Avoidance within Drone View
Obstacle avoidance in unmanned aerial vehicles (UAVs), as a fundamental capability, has gained increasing attention with the growing focus on spatial intelligence. However, current obstacle-avoidance methods mainly depend on limited field-of-view sensors and are ill-suited for UAV scenarios which require full-spatial awareness when the movement direction differs from the UAV's heading. This limitation motivates us to explore omnidirectional obstacle avoidance for panoramic drones with full-view perception. We first study an under explored problem setting in which a UAV must generate collision-free motion in environments with obstacles from arbitrary directions, and then construct a benchmark that consists of three representative flight tasks. Based on such settings, we propose Fly360, a two-stage perception-decision pipeline with a fixed random-yaw training strategy. At the perception stage, panoramic RGB observations are input and converted into depth maps as a robust intermediate representation. For the policy network, it is lightweight and used to output body-frame velocity commands from depth inputs. Extensive simulation and real-world experiments demonstrate that Fly360 achieves stable omnidirectional obstacle avoidance and outperforms forward-view baselines across all tasks. Our model is available at https://zxkai.github.io/fly360/
comment: 16 pages, 10 figures
Uncertainty-Aware Adaptive Dynamics For Underwater Vehicle-Manipulator Robots
Accurate and adaptive dynamic models are critical for underwater vehicle-manipulator systems where hydrodynamic effects induce time-varying parameters. This paper introduces a novel uncertainty-aware adaptive dynamics model framework that remains linear in lumped vehicle and manipulator parameters, and embeds convex physical consistency constraints during online estimation. Moving horizon estimation is used to stack horizon regressors, enforce realizable inertia, damping, friction, and hydrostatics, and quantify uncertainty from parameter evolution. Experiments on a BlueROV2 Heavy with a 4-DOF manipulator demonstrate rapid convergence and calibrated predictions. Manipulator fits achieve R2 = 0.88 to 0.98 with slopes near unity, while vehicle surge, heave, and roll are reproduced with good fidelity under stronger coupling and noise. Median solver time is approximately 0.023 s per update, confirming online feasibility. A comparison against a fixed parameter model shows consistent reductions in MAE and RMSE across degrees of freedom. Results indicate physically plausible parameters and confidence intervals with near 100% coverage, enabling reliable feedforward control and simulation in underwater environments.
Unified Learning of Temporal Task Structure and Action Timing for Bimanual Robot Manipulation
Temporal task structure is fundamental for bimanual manipulation: a robot must not only know that one action precedes or overlaps another, but also when each action should occur and how long it should take. While symbolic temporal relations enable high-level reasoning about task structure and alternative execution sequences, concrete timing parameters are equally essential for coordinating two hands at the execution level. Existing approaches address these two levels in isolation, leaving a gap between high-level task planning and low-level movement synchronization. This work presents an approach for learning both symbolic and subsymbolic temporal task constraints from human demonstrations and deriving executable, temporally parametrized plans for bimanual manipulation. Our contributions are (i) a 3-dimensional representation of timings between two actions with methods based on multivariate Gaussian Mixture Models to represent temporal relationships between actions on a subsymbolic level, (ii) a method based on the Davis-Putnam-Logemann-Loveland (DPLL) algorithm that finds and ranks all contradiction-free assignments of Allen relations to action pairs, representing different modes of a task, and (iii) an optimization-based planning system that combines the identified symbolic and subsymbolic temporal task constraints to derive temporally parametrized plans for robot execution. We evaluate our approach on several datasets, demonstrating that our method generates temporally parametrized plans closer to human demonstrations than the most characteristic demonstration baseline.
comment: This work has been submitted to the IEEE for possible publication
Spatial Calibration of Diffuse LiDARs
Diffuse direct time-of-flight LiDARs report per-pixel depth histograms formed by aggregating photon returns over a wide instantaneous field of view, violating the single-ray assumption behind standard LiDAR-RGB calibration. We present a simple spatial calibration procedure that estimates, for each diffuse LiDAR pixel, its footprint (effective support region) and relative spatial sensitivity in a co-located RGB image plane. Using a scanned retroreflective patch with background subtraction, we recover per-pixel response maps that provide an explicit LiDAR-to-RGB correspondence for cross-modal alignment and fusion. We demonstrate the method on the ams OSRAM TMF8828.
Feasibility Restoration under Conflicting STL Specifications with Pareto-Optimal Refinement
Signal Temporal Logic (STL) is expressive formal language that specifies spatio-temporal requirements in robotics. Its quantitative robustness semantics can be easily integrated with optimization-based control frameworks. However, STL specifications may become conflicting in real-world applications, where safety rules, traffic regulations, and task objectives can be cannot be satisfied together. In these situations, traditional STL-constrained Model Predictive Control (MPC) becomes infeasible and default to conservative behaviors such as freezing, which can largely increase risks in safety-critical scenarios. In this paper, we proposes a unified two-stage framework that first restores feasibility via minimal relaxation, then refine the feasible solution by formulating it as a value-aware multi-objective optimization problem. Using $\varepsilon$-constraint method, we approximate the Pareto front of the multi-objective optimization, which allows analysis of tradeoffs among competing objectives and counterfactual analysis of alternative actions. We demonstrate that the proposed approach avoids deadlock under conflicting STL specifications and enables interpretable decision-making in safety-critical applications by conducting a case study in autonomous driving.
Failure Mechanisms and Risk Estimation for Legged Robot Locomotion on Granular Slopes
Locomotion on granular slopes such as sand dunes remains a fundamental challenge for legged robots due to reduced shear strength and gravity-induced anisotropic yielding of granular media. Using a hexapedal robot on a tiltable granular bed, we systematically measure locomotion speed together with slope-dependent normal and shear granular resistive forces. While normal penetration resistance remains nearly unchanged with inclination, shear resistance decreases substantially as slope angle increases. Guided by these measurements, we develop a simple robot-terrain interaction model that predicts anchoring timing, step length, and resulting robot speed, as functions of terrain strength and slope angle. The model reveals that slope-induced performance loss is primarily governed by delayed anchoring and increased backward slip rather than excessive sinkage. By extending the model to generalized terrain conditions, we construct failure phase diagrams that identify sinkage- and slippage-induced failure regimes, enabling quantitative risk estimation for locomotion on granular slopes. This physics-informed framework provides predictive insight into terrain-dependent failure mechanisms and offers guidance for safer and more robust robot operation on deformable inclines.
A Contrastive Fewshot RGBD Traversability Segmentation Framework for Indoor Robotic Navigation
Indoor traversability segmentation aims to identify safe, navigable free space for autonomous agents, which is critical for robotic navigation. Pure vision-based models often fail to detect thin obstacles, such as chair legs, which can pose serious safety risks. We propose a multi-modal segmentation framework that leverages RGB images and sparse 1D laser depth information to capture geometric interactions and improve the detection of challenging obstacles. To reduce the reliance on large labeled datasets, we adopt the few-shot segmentation (FSS) paradigm, enabling the model to generalize from limited annotated examples. Traditional FSS methods focus solely on positive prototypes, often leading to overfitting to the support set and poor generalization. To address this, we introduce a negative contrastive learning (NCL) branch that leverages negative prototypes (obstacles) to refine free-space predictions. Additionally, we design a two-stage attention depth module to align 1D depth vectors with RGB images both horizontally and vertically. Extensive experiments on our custom-collected indoor RGB-D traversability dataset demonstrate that our method outperforms state-of-the-art FSS and RGB-D segmentation baselines, achieving up to 9\% higher mIoU under both 1-shot and 5-shot settings. These results highlight the effectiveness of leveraging negative prototypes and sparse depth for robust and efficient traversability segmentation.
LIPP: Load-Aware Informative Path Planning with Physical Sampling
In classical Informative Path Planning (C-IPP), robots are typically modeled as mobile sensors that acquire digital measurements such as images or radiation levels. In this model - since making a measurement leaves the robot's physical state unchanged - traversal costs are determined solely by the path taken. This is a natural assumption for many missions, but does not extend to settings involving physical sample collection, where each collected sample adds mass and increases the energy cost of all subsequent motion. As a result, IPP formulations that ignore this coupling between information gain and load-dependent traversal cost can produce plans that are distance-efficient but energy-suboptimal, collecting fewer samples and less data than the energy budget would permit. In this paper, we introduce Load-aware Informative Path Planning (LIPP ), a generalization of C-IPP that explicitly models this coupling and the resulting order-dependent traversal costs. We formulate LIPP as a Mixed-Integer Quadratic Program (MIQP) that jointly optimizes routing, visitation order, and per-location sampling count under an energy budget. We show that LIPP strictly generalizes C-IPP: as sample unit mass $λ\to 0$, the load-dependent energy model reduces exactly to the classical distance budget constraint, recovering C-IPP as a special case. We further derive theoretical bounds on the path-length increase of LIPP relative to C-IPP, characterizing the trade-off for improved energy efficiency. Finally, through extensive simulations across 2000 diverse mission scenarios, we demonstrate that LIPP matches the behavior of C-IPP at zero sample mass and progressively achieves higher uncertainty reduction per unit energy as sample mass increases.
CN-CBF: Composite Neural Control Barrier Function for Safe Robot Navigation in Dynamic Environments
Safe navigation of autonomous robots remains one of the core challenges in the field, especially in dynamic and uncertain environments. One of the prevalent approaches is safety filtering based on control barrier functions (CBFs), which are easy to deploy but difficult to design. Motivated by the shortcomings of existing learning- and model-based methods, we propose a simple yet effective neural CBF design method for safe robot navigation in dynamic environments. We employ the idea of a composite CBF, where multiple neural CBFs are combined into a single CBF. The individual CBFs are trained via the Hamilton-Jacobi reachability framework to approximate the optimal safe set for single moving obstacles. Additionally, we use the residual neural architecture, which guarantees that the estimated safe set does not intersect with the corresponding failure set. The method is extensively evaluated in simulation experiments for a ground robot and a quadrotor, comparing it against several baseline methods. The results show improved success rates of up to 18\% compared to the best baseline, without increasing the conservativeness of the motion. Also, the method is demonstrated in hardware experiments for both types of robots.
SurgSync: Time-Synchronized Multi-Modal Data Collection Framework and Dataset for Surgical Robotics ICRA
Most existing robotic surgery systems adopt a human-in-the-loop paradigm, often with the surgeon directly teleoperating the robotic system. Adding intelligence to these robots would enable higher-level control, such as supervised autonomy or even full autonomy. However, artificial intelligence (AI) requires large amounts of training data, which is currently lacking. This work proposes SurgSync, a multi-modal data collection framework with offline and online synchronization to support training and real-time inference, respectively. The framework is implemented on a da Vinci Research Kit (dVRK) and introduces (1) dual-mode (online/offline-matching) synchronized recorders, (2) a modern stereo endoscope to achieve image quality on par with clinical systems, and (3) additional sensors such as a side-view camera and a novel capacitive contact sensor to provide ground truth contact data. The framework also incorporates a post-processing toolbox for tasks such as depth estimation, optical flow, and a practical kinematic reprojection method using Gaussian heatmap. User studies with participants of varying skill levels are performed with ex-vivo tissue to provide clinically realistic data, and a network for surgical skill assessment is employed to demonstrate utilization of the collected data. Through the user study experiments, we obtained a dataset of 214 validated instances across multiple canonical training tasks. All software and data are available at surgsync.github.io.
comment: Accepted By International Conference on Robotics and Automation (ICRA), IEEE, 2026. More details can be found at https://surgsync.github.io/
T2Nav Algebraic Topology Aware Temporal Graph Memory and Loop Detection for ZeroShot Visual Navigation
Deploying autonomous agents in real world environments is challenging, particularly for navigation, where systems must adapt to situations they have not encountered before. Traditional learning approaches require substantial amounts of data, constant tuning, and, sometimes, starting over for each new task. That makes them hard to scale and not very flexible. Recent breakthroughs in foundation models, such as large language models and vision language models, enable systems to attempt new navigation tasks without requiring additional training. However, many of these methods only work with specific input types, employ relatively basic reasoning, and fail to fully exploit the details they observe or the structure of the spaces. Here, we introduce T2Nav, a zeroshot navigation system that integrates heterogeneous data and employs graph-based reasoning. By directly incorporating visual information into the graph and matching it to the environment, our approach enables the system to strike a good balance between exploration and goal attainment. This strategy allows robust obstacle avoidance, reliable loop closure detection, and efficient path planning while eliminating redundant exploration patterns. The system demonstrates flexibility by handling goals specified using reference images of target object instances, making it particularly suitable for scenarios in which agents must navigate to visually similar yet spatially distinct instances. Experiments demonstrate that our approach is efficient and adapts well to unknown environments, moving toward practical zero-shot instance-image navigation capabilities.
SysNav: Multi-Level Systematic Cooperation Enables Real-World, Cross-Embodiment Object Navigation
Object navigation (ObjectNav) in real-world environments is a complex problem that requires simultaneously addressing multiple challenges, including complex spatial structure, long-horizon planning and semantic understanding. Recent advances in Vision-Language Models (VLMs) offer promising capabilities for semantic understanding, yet effectively integrating them into real-world navigation systems remains a non-trivial challenge. In this work, we formulate real-world ObjectNav as a system-level problem and introduce SysNav, a three-level ObjectNav system designed for real-world crossembodiment deployment. SysNav decouples semantic reasoning, navigation planning and motion control to ensure robustness and generalizability. At the high-level, we summarize the environment into a structured scene representation and leverage VLMs to provide semantic-grounded navigation guidance. At the mid-level, we introduce a hierarchical room-based navigation strategy that reserves VLM guidance for room-level decisions, which effectively utilizes its reasoning ability while ensuring system efficiency. At the low-level, planned waypoints are executed through different embodiment-specific motion control modules. We deploy our system on three embodiments, a custom-built wheeled robot, the Unitree Go2 quadruped and the Unitree G1 humanoid, and conduct 190 real-world experiments. Our system achieves substantial improvements in both success rate and navigation efficiency. To the best of our knowledge, SysNav is the first system capable of reliably and efficiently completing building-scale long-range object navigation in complex real-world environments. Furthermore, extensive experiments on four simulation benchmarks demonstrate state-of-the-art performance. Project page is available at: https://cmu-vln.github.io/.
Collaborative Planning with Concurrent Synchronization for Operationally Constrained UAV-UGV Teams
Collaborative planning under operational constraints is an essential capability for heterogeneous robot teams tackling complex large-scale real-world tasks. Unmanned Aerial Vehicles (UAVs) offer rapid environmental coverage, but flight time is often limited by energy constraints, whereas Unmanned Ground Vehicles (UGVs) have greater energy capacity to support long-duration missions, but movement is constrained by traversable terrain. Individually, neither can complete tasks such as environmental monitoring. Effective UAV-UGV collaboration therefore requires energy-constrained multi-UAV task planning, traversability-constrained multi-UGV path planning, and crucially, synchronized concurrent co-planning to ensure timely in-mission recharging. To enable these capabilities, we propose Collaborative Planning with Concurrent Synchronization (CoPCS), a learning-based approach that integrates a heterogeneous graph transformer for operationally constrained task encoding with a transformer decoder for joint, synchronized co-planning that enables UAVs and UGVs to act concurrently in a coordinated manner. CoPCS is trained end-to-end under a unified imitation learning paradigm. We conducted extensive experiments to evaluate CoPCS in both robotic simulations and physical robot teams. Experimental results demonstrate that our method provides the novel multi-robot capability of synchronized concurrent co-planning and substantially improves team performance. More details of this work are available on the project website: https://hcrlab.gitlab.io/project/CoPCS.
VertiAdaptor: Online Kinodynamics Adaptation for Vertically Challenging Terrain
Autonomous driving in off-road environments presents significant challenges due to the dynamic and unpredictable nature of unstructured terrain. Traditional kinodynamic models often struggle to generalize across diverse geometric and semantic terrain types, underscoring the need for real-time adaptation to ensure safe and reliable navigation. We propose VertiAdaptor (VA), a novel online adaptation framework that efficiently integrates elevation with semantic embeddings to enable terrain-aware kinodynamic modeling and planning via function encoders. VA learns a kinodynamic space spanned by a set of neural ordinary differential equation basis functions, capturing complex vehicle-terrain interactions across varied environments. After offline training, the proposed approach can rapidly adapt to new, unseen environments by identifying kinodynamics in the learned space through a computationally efficient least-squares calculation. We evaluate VA within the Verti-Bench simulator, built on the Chrono multi-physics engine, and validate its performance both in simulation and on a physical Verti-4-Wheeler platform. Our results demonstrate that VA improves prediction accuracy by up to 23.9% and achieves a 5X faster adaptation time, advancing the robustness and reliability of autonomous robots in complex and evolving off-road environments.
Material Driven HRI Design: Aesthetics as Explainability
Aesthetics - often treated as secondary to function-guides how people interpret robots' roles. A great deal of robot designs - both real and fictitious - use sleek industrial aesthetics. These feature hard glossy plastics, hiding as much of the underlying mechanical and electrical components as possible, resembling something akin to a nude humanoid figure. This leaves robots as something of a blank slate to which end-users apply coverings to, often based on media of fiction and non-fiction alike. We argue that designers can take cues from fashion to design interaction and set appropriate expectations. Rather than viewing appearance as decoration, we propose that color, texture, and material choices function as interaction signals. These signals can invite or discourage touch, clarify a robot's role, and help align user expectations with a robot's actual capabilities. When done thoughtfully, such cues can create familiarity and legibility; when done poorly, they can lead to wrong expectations. This preliminary paper proposes a framework describing how materials can create explainability by signaling expectations for interaction, task, and environment. We use this framework to do a content analysis of 6 robots.
comment: 4 pages, 1 table, 2026 ACM/IEEE Human-Robot Interaction Conference Workshop on Articulating the Value of Design Research for HRI
CAR: Cross-Vehicle Kinodynamics Adaptation via Mobility Representation
Developing autonomous off-road mobility typically requires either extensive, platform-specific data collection or relies on simplified abstractions, such as unicycle or bicycle models, that fail to capture the complex kinodynamics of diverse platforms, ranging from wheeled to tracked vehicles. This limitation hinders scalability across evolving heterogeneous autonomous robot fleets. To address this challenge, we propose Cross-vehicle kinodynamics Adaptation via mobility Representation (CAR), a novel framework that enables rapid mobility transfer to new vehicles. CAR employs a Transformer encoder with Adaptive Layer Normalization to embed vehicle trajectory transitions and physical configurations into a shared mobility latent space. By identifying and extracting commonality from nearest neighbors within this latent space, our approach enables rapid kinodynamics adaptation to novel platforms with minimal data collection and computational overhead. We evaluate CAR using the Verti-Bench simulator, built on the Chrono multi-physics engine, and validate its performance on four distinct physical configurations of the Verti-4-Wheeler platform. With only one minute of new trajectory data, CAR achieves up to 67.2% reduction in prediction error compared to direct neighbor transfer across diverse unseen vehicle configurations, demonstrating the effectiveness of cross-vehicle mobility knowledge transfer in both simulated and real-world environments.
Robodimm: A Physics-Grounded Framework for Automated Actuator Sizing in Scalable Modular Robots
Selecting an appropriate motor-gearbox combination is a critical design task in robotics because it directly affects cost, mass, and dynamic performance. This process is especially challenging in modular robots with closed kinematic chains, where joint torques are coupled and actuator inertia propagates through the mechanism. We present Robodimm, a software framework for automated actuator sizing in scalable robot architectures. By leveraging Pinocchio for dynamics and Pink for inverse kinematics, Robodimm uses a Karush-Kuhn-Tucker (KKT) formulation for constrained inverse dynamics. The platform supports parametric scaling, interactive trajectory programming through jog modes, and a two-round validation workflow that addresses actuator self-weight effects.
comment: 8 pages, 3 figures. Preprint version submitted to arXiv
Nonlinear Performance Degradation of Vision-Based Teleoperation under Network Latency
Teleoperation is increasingly being adopted as a critical fallback for autonomous vehicles. However, the impact of network latency on vision-based, perception-driven control remains insufficiently studied. The present work investigates the nonlinear degradation of closed-loop stability in camera-based lane keeping under varying network delays. To conduct this study, we developed the Latency-Aware Vision Teleoperation testbed (LAVT), a research-oriented ROS 2 framework that enables precise, distributed one-way latency measurement and reproducible delay injection. Using LAVT, we performed 180 closed-loop experiments in simulation across diverse road geometries. Our findings reveal a sharp collapse in stability between 150 ms and 225 ms of one-way perception latency, where route completion rates drop from 100% to below 50% as oscillatory instability and phase-lag effects emerge. We further demonstrate that additional control-channel delay compounds these effects, significantly accelerating system failure even under constant visual latency. By combining this systematic empirical characterization with the LAVT testbed, this work provides quantitative insights into perception-driven instability and establishes a reproducible baseline for future latency-compensation and predictive control strategies. Project page, supplementary video, and code are available at https://bimilab.github.io/paper-LAVT
MotionBits: Video Segmentation through Motion-Level Analysis of Rigid Bodies
Rigid bodies constitute the smallest manipulable elements in the real world, and understanding how they physically interact is fundamental to embodied reasoning and robotic manipulation. Thus, accurate detection, segmentation, and tracking of moving rigid bodies is essential for enabling reasoning modules to interpret and act in diverse environments. However, current segmentation models trained on semantic grouping are limited in their ability to provide meaningful interaction-level cues for completing embodied tasks. To address this gap, we introduce MotionBit, a novel concept that, unlike prior formulations, defines the smallest unit in motion-based segmentation through kinematic spatial twist equivalence, independent of semantics. In this paper, we contribute (1) the MotionBit concept and definition, (2) a hand-labeled benchmark, called MoRiBo, for evaluating moving rigid-body segmentation across robotic manipulation and human-in-the-wild videos, and (3) a learning-free graph-based MotionBits segmentation method that outperforms state-of-the-art embodied perception methods by 37.3\% in macro-averaged mIoU on the MoRiBo benchmark. Finally, we demonstrate the effectiveness of MotionBits segmentation for downstream embodied reasoning and manipulation tasks, highlighting its importance as a fundamental primitive for understanding physical interactions.
comment: 23 pages, 18 figures
RoboCritics: Enabling Reliable End-to-End LLM Robot Programming through Expert-Informed Critics
End-user robot programming grants users the flexibility to re-task robots in situ, yet it remains challenging for novices due to the need for specialized robotics knowledge. Large Language Models (LLMs) hold the potential to lower the barrier to robot programming by enabling task specification through natural language. However, current LLM-based approaches generate opaque, "black-box" code that is difficult to verify or debug, creating tangible safety and reliability risks in physical systems. We present RoboCritics, an approach that augments LLM-based robot programming with expert-informed motion-level critics. These critics encode robotics expertise to analyze motion-level execution traces for issues such as joint speed violations, collisions, and unsafe end-effector poses. When violations are detected, critics surface transparent feedback and offer one-click fixes that forward structured messages back to the LLM, enabling iterative refinement while keeping users in the loop. We instantiated RoboCritics in a web-based interface connected to a UR3e robot and evaluated it in a between-subjects user study (n=18). Compared to a baseline LLM interface, RoboCritics reduced safety violations, improved execution quality, and shaped how participants verified and refined their programs. Our findings demonstrate that RoboCritics enables more reliable and user-centered end-to-end robot programming with LLMs.
comment: 10 pages, 5 figures, Proceedings of the 21st ACM/IEEE International Conference on Human Robot Interaction (HRI 2026)
Receding-Horizon Nullspace Optimization for Actuation-Aware Control Allocation in Omnidirectional UAVs
Fully actuated omnidirectional UAVs enable independent control of forces and torques along all six degrees of freedom, broadening the operational envelope for agile flight and aerial interaction tasks. However, conventional control allocation methods neglect the asymmetric dynamics of the onboard actuators, which can induce oscillatory motor commands and degrade trajectory tracking during dynamic maneuvers. This work proposes a receding-horizon, actuation-aware allocation strategy that explicitly incorporates asymmetric motor dynamics and exploits the redundancy of over-actuated platforms through nullspace optimization. By forward-simulating the closed-loop system over a prediction horizon, the method anticipates actuator-induced oscillations and suppresses them through smooth redistribution of motor commands, while preserving the desired body wrench exactly. The approach is formulated as a constrained optimal control problem solved online via Constrained iterative LQR. Simulation results on the OmniOcta platform demonstrate that the proposed method significantly reduces motor command oscillations compared to a conventional single-step quadratic programming allocator, yielding improved trajectory tracking in both position and orientation.
comment: 8 pages, 8 figures
Learning-Based Robust Control: Unifying Exploration and Distributional Robustness for Reliable Robotics via Free Energy
A key challenge towards reliable robotic control is devising computational models that can both learn policies and guarantee robustness when deployed in the field. Inspired by the free energy principle in computational neuroscience, to address these challenges, we propose a model for policy computation that jointly learns environment dynamics and rewards, while ensuring robustness to epistemic uncertainties. Expounding a distributionally robust free energy principle, we propose a modification to the maximum diffusion learning framework. After explicitly characterizing robustness of our policies to epistemic uncertainties in both environment and reward, we validate their effectiveness on continuous-control benchmarks, via both simulations and real-world experiments involving manipulation with a Franka Research~3 arm. Across simulation and zero-shot deployment, our approach narrows the sim-to-real gap, and enables repeatable tabletop manipulation without task-specific fine-tuning.
A Comprehensive Analysis of the Effects of Network Quality of Service on Robotic Telesurgery
The viability of long-distance telesurgery hinges on reliable network Quality of Service (QoS), yet the impact of realistic network degradations on task performance is not sufficiently understood. This paper presents a comprehensive analysis of how packet loss, delay, and communication loss affect telesurgical task execution. We introduce NetFI, a novel fault injection tool that emulates different network conditions using stochastic QoS models informed by real-world network data. By integrating NetFI with a surgical simulation platform, we conduct a user study involving 15 participants at three proficiency levels, performing a standardized Peg Transfer task under varying levels of packet loss, delay, and communication loss. We analyze the effect of network QoS on overall task performance and the fine-grained motion primitives (MPs) using objective performance and safety metrics and subjective operator's perception of workload. We identify specific MPs vulnerable to network degradation and find strong correlations between proficiency, objective performance, and subjective workload. These findings offer quantitative insights into the operational boundaries of telesurgery. Our open-source tools and annotated dataset provide a foundation for developing robust and network-aware control and mitigation strategies.
comment: Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
A Multi-Layer Sim-to-Real Framework for Gaze-Driven Assistive Neck Exoskeletons ICRA
Dropped head syndrome, caused by neck muscle weakness from neurological diseases, severely impairs an individual's ability to support and move their head, causing pain and making everyday tasks challenging. Our long-term goal is to develop an assistive powered neck exoskeleton that restores natural movement. However, predicting a user's intended head movement remains a key challenge. We leverage virtual reality (VR) to collect coupled eye and head movement data from healthy individuals to train models capable of predicting head movement based solely on eye gaze. We also propose a novel multi-layer controller selection framework, where head control strategies are evaluated across decreasing levels of abstraction -- from simulation and VR to a physical neck exoskeleton. This pipeline effectively rejects poor-performing controllers early, identifying two novel gaze-driven models that achieve strong performance when deployed on the physical exoskeleton. Our results reveal that no single controller is universally preferred, highlighting the necessity for personalization in gaze-driven assistive control. Our work demonstrates the utility of VR-based evaluation for accelerating the development of intuitive, safe, and personalized assistive robots.
comment: IEEE International Conference on Robotics & Automation (ICRA), 2026. Equal Contribution from the first two authors
HybridMimic: Hybrid RL-Centroidal Control for Humanoid Motion Mimicking
Motion mimicking, i.e., encouraging the control policy to mimic human motion, facilitates the learning of complex tasks via reinforcement learning (RL) for humanoid robots. Although standard RL frameworks demonstrate impressive locomotion agility, they often bypass explicit reasoning about robot dynamics during deployment, which is a design choice that can lead to physically infeasible commands when the robot encounters out-of-distribution environments. By integrating model-based principles, hybrid approaches can improve performance; however, existing methods typically rely on predefined contact timing, limiting their versatility. This paper introduces HybridMimic, a framework in which a learned policy dynamically modulates a centroidal-model-based controller by predicting continuous contact states and desired centroidal velocities. This architecture exploits the physical grounding of centroidal dynamics to generate feedforward torques that remain feasible even under domain shift. Using physics-informed rewards, the policy is trained to efficiently utilize the centroidal controller's optimization by outputting precise control targets and reference torques. Through hardware experiments on the Booster T1 humanoid, HybridMimic reduces the average base position tracking error by 13\% compared to a state-of-the-art RL baseline, demonstrating the robustness of dynamics-aware deployment.
Stability-Guided Exploration for Diverse Motion Generation
Scaling up datasets is highly effective in improving the performance of deep learning models, including in the field of robot learning. However, data collection still proves to be a bottleneck. Approaches relying on collecting human demonstrations are labor-intensive and inherently limited: they tend to be narrow, task-specific, and fail to adequately explore the full space of feasible states. Synthetic data generation could remedy this, but current techniques mostly rely on local trajectory optimization and fail to find diverse solutions. In this work, we propose a novel method capable of finding diverse long-horizon manipulations through black-box simulation. We achieve this by combining an RRT-style search with sampling-based MPC, together with a novel sampling scheme that guides the exploration toward stable configurations. Specifically, we sample from a manifold of stable states while growing a search tree directly through simulation, without restricting the planner to purely stable motions. We demonstrate the method's ability to discover diverse manipulation strategies, including pushing, grasping, pivoting, throwing, and tool use, across different robot morphologies, without task-specific guidance.
Gradient-based Nested Co-Design of Aerodynamic Shape and Control for Winged Robots
Designing aerial robots for specialized tasks, from perching to payload delivery, requires tailoring their aerodynamic shape to specific mission requirements. For tasks involving wide flight envelopes, the usual sequential process of first determining the shape and then the motion planner is likely to be suboptimal due to the inherent nonlinear interactions between them. This limitation has been motivating co-design research, which involves jointly optimizing the aerodynamic shape and the motion planner. In this paper, we present a general-purpose, gradient-based, nested co-design framework where the motion planner solves an optimal control problem and the aerodynamic forces used in the dynamics model are determined by a neural surrogate model. This enables us to model complex subsonic flow conditions encountered in aerial robotics and to overcome the limited applicability of existing co-design methods. These limitations stem from the simplifying assumptions they require for computational tractability to either the planner or the aerodynamics. We validate our method on two complex dynamic tasks for fixed-wing gliders: perching and a short landing. Our optimized designs improve task performance compared to an evolutionary baseline in a fraction of the computation time.
Robotic Foundation Models for Industrial Control: A Comprehensive Survey and Readiness Assessment Framework
Robotic foundation models (RFMs) are emerging as a promising route towards flexible, instruction- and demonstration-driven robot control, however, a critical investigation of their industrial applicability is still lacking. This survey gives an extensive overview over the RFM-landscape and analyses, driven by concrete implications, how industrial domains and use cases shape the requirements of RFMs, with particular focus on collaborative robot platforms, heterogeneous sensing and actuation, edge-computing constraints, and safety-critical operation. We synthesise industrial deployment perspectives into eleven interdependent implications and operationalise them into an assessment framework comprising a catalogue of 149 concrete criteria, spanning both model capabilities and ecosystem requirements. Using this framework, we evaluate 324 manipulation-capable RFMs via 48,276 criterion-level decisions obtained via a conservative LLM-assisted evaluation pipeline, validated against expert judgements. The results indicate that industrial maturity is limited and uneven: even the highest-rated models satisfy only a fraction of criteria and typically exhibit narrow implication-specific peaks rather than integrated coverage. We conclude that progress towards industry-grade RFMs depends less on isolated benchmark successes than on systematic incorporation of safety, real-time feasibility, robust perception, interaction, and cost-effective system integration into auditable deployment stacks.
Improved Constrained Generation by Bridging Pretrained Generative Models
Constrained generative modeling is fundamental to applications such as robotic control and autonomous driving, where models must respect physical laws and safety-critical constraints. In real-world settings, these constraints rarely take the form of simple linear inequalities, but instead complex feasible regions that resemble road maps or other structured spatial domains. We propose a constrained generation framework that generates samples directly within such feasible regions while preserving realism. Our method fine-tunes a pretrained generative model to enforce constraints while maintaining generative fidelity. Experimentally, our method exhibits characteristics distinct from existing fine-tuning and training-free constrained baselines, revealing a new compromise between constraint satisfaction and sampling quality.
Whole-Body Model-Predictive Control of Legged Robots with MuJoCo ICRA 2026
We demonstrate the surprising real-world effectiveness of a very simple approach to whole-body model-predictive control (MPC) of quadruped and humanoid robots: the iterative LQR (iLQR) algorithm with MuJoCo dynamics and finite-difference approximated derivatives. Building upon the previous success of model-based behavior synthesis and control of locomotion and manipulation tasks with MuJoCo in simulation, we show that these policies can easily generalize to the real world with few sim-to-real considerations. Our baseline method achieves real-time whole-body MPC on a variety of hardware experiments, including dynamic quadruped locomotion, quadruped walking on two legs, and full-sized humanoid bipedal locomotion. We hope this easy-to-reproduce hardware baseline lowers the barrier to entry for real-world whole-body MPC research and contributes to accelerating research velocity in the community. Our code and experiment videos will be available online at:https://johnzhang3.github.io/mujoco_ilqr
comment: to appear at ICRA 2026
CAPS: Context-Aware Priority Sampling for Enhanced Imitation Learning in Autonomous Driving ICRA 2026
In this paper, we introduce Context-Aware Priority Sampling (CAPS), a novel method designed to enhance data efficiency in learning-based autonomous driving systems. CAPS addresses the challenge of imbalanced datasets in imitation learning by leveraging Vector Quantized Variational Autoencoders (VQ-VAEs). In this way, we can get structured and interpretable data representations, which help to reveal meaningful patterns in the data. These patterns are used to group the data into clusters, with each sample being assigned a cluster ID. The cluster IDs are then used to re-balance the dataset, ensuring that rare yet valuable samples receive higher priority during training. We evaluate our method through closed-loop experiments in the CARLA simulator. The results on Bench2Drive scenarios demonstrate the effectiveness of CAPS in enhancing model generalization, with substantial improvements in both driving score and success rate.
comment: Accepted at IEEE International Conference on Robotics & Automation (ICRA 2026)
ROScopter: A Multirotor Autopilot based on ROSflight 2.0
ROScopter is a lean multirotor autopilot built for researchers. ROScopter seeks to accelerate simulation and hardware testing of research code with an architecture that is both easy to understand and simple to modify. ROScopter is designed to interface with ROSflight 2.0 and runs entirely on an onboard flight computer, leveraging the features of ROS 2 to improve modularity. This work describes the architecture of ROScopter and how it can be used to test application code in both simulated and hardware environments. Hardware results of the default ROScopter behavior are presented, showing that ROScopter achieves similar performance to another state-of-the-art autopilot for basic waypoint-following maneuvers, but with a significantly reduced and more modular code-base.
comment: Submitted to the 2026 International Conference on Unmanned Aerial Systems
ROSplane 2.0: A Fixed-Wing Autopilot for Research
Unmanned aerial vehicle (UAV) research requires the integration of cutting-edge technology into existing autopilot frameworks. This process can be arduous, requiring extensive resources, time, and detailed knowledge of the existing system. ROSplane is a lean, open-source fixed-wing autonomy stack built by researchers for researchers. It is designed to accelerate research by providing clearly defined interfaces with an easily modifiable framework. Built around ROS 2, ROSplane allows for rapid integration of low or high-level control, path planning, or estimation algorithms. A focus on lean, easily-understood code and extensive documentation lowers the barrier to entry for researchers. Recent developments to ROSplane improve its capacity to accelerate UAV research, including the transition from ROS 1 to ROS 2, enhanced estimation and control algorithms, increased modularity, and an improved aerodynamic modeling pipeline. This aerodynamic modeling pipeline significantly reduces the effort of transitioning from simulation to real-world testing without requiring costly system identification or computational fluid dynamics tools. ROSplane's architecture reduces the effort required to integrate new research tools and methods, expediting hardware experimentation.
comment: Submitted to the 2026 International Conference on Unmanned Aerial Systems
ROSflight 2.0: Lean ROS 2-Based Autopilot for Unmanned Aerial Vehicles
ROSflight is a lean, open-source autopilot ecosystem for unmanned aerial vehicles (UAVs). Designed by researchers for researchers, it is built to lower the barrier to entry to UAV research and accelerate the transition from simulation to hardware experiments by maintaining a lean (not full-featured), well-documented, and modular codebase. This publication builds on previous treatments and describes significant additions to the architecture that improve the modularity and usability of ROSflight, including the transition from ROS 1 to ROS 2, supported hardware, low-level actuator mixing, and the simulation environment. We believe that these changes improve the usability of ROSflight and enable ROSflight to accelerate research in areas like advanced-air mobility. Hardware results are provided, showing that ROSflight is able to control a multirotor over a serial connection at 400 Hz while closing all control loops on the companion computer.
comment: Submitted to the 2026 International Conference on Unmanned Aerial Systems
ROS-related Robotic Systems Development with V-model-based Application of MeROS Metamodel
Systems built on the Robot Operating System (ROS) are increasingly easy to assemble, yet hard to govern and reliably coordinate. Beyond the sheer number of subsystems involved, the difficulty stems from their diversity and interaction depth. In this paper, we use a compact heterogeneous robotic system (HeROS), combining mobile and manipulation capabilities, as a demonstration vehicle under dynamically changing tasks. Notably, all its subsystems are powered by ROS. The use of compatible interfaces and other ROS integration capabilities simplifies the construction of such systems. However, this only addresses part of the complexity: the semantic coherence and structural traceability are even more important for precise coordination and call for deliberate engineering methods. The Model-Based Systems Engineering (MBSE) discipline, which emerged from the experience of complexity management in large-scale engineering domains, offers the methodological foundations needed. Despite their strengths in complementary aspects of robotics systems engineering, the lack of a unified approach to integrate ROS and MBSE hinders the full potential of these tools. Motivated by the anticipated impact of such a synergy in robotics practice, we propose a structured methodology based on MeROS - a SysML metamodel created specifically to put the ROS-based systems into the focus of the MBSE workflow. As its methodological backbone, we adapt the well-known V-model to this context, illustrating how complex robotic systems can be designed with traceability and validation capabilities embedded into their lifecycle using practices familiar to engineering teams.
comment: 22 pages
InsSo3D: Inertial Navigation System and 3D Sonar SLAM for turbid environment inspection
This paper presents InsSo3D, an accurate and efficient method for large-scale 3D Simultaneous Localisation and Mapping (SLAM) using a 3D Sonar and an Inertial Navigation System (INS). Unlike traditional sonar, which produces 2D images containing range and azimuth information but lacks elevation information, 3D Sonar produces a 3D point cloud, which therefore does not suffer from elevation ambiguity. We introduce a robust and modern SLAM framework adapted to the 3D Sonar data using INS as prior, detecting loop closure and performing pose graph optimisation. We evaluated InsSo3D performance inside a test tank with access to ground truth data and in an outdoor flooded quarry. Comparisons to reference trajectories and maps obtained from an underwater motion tracking system and visual Structure From Motion (SFM) demonstrate that InsSo3D efficiently corrects odometry drift. The average trajectory error is below 21cm during a 50-minute-long mission, producing a map of 10m by 20m with a 9cm average reconstruction error, enabling safe inspection of natural or artificial underwater structures even in murky water conditions.
VISO: Robust Underwater Visual-Inertial-Sonar SLAM with Photometric Rendering for Dense 3D Reconstruction
Visual challenges in underwater environments significantly hinder the accuracy of vision-based localisation and the high-fidelity dense reconstruction. In this paper, we propose VISO, a robust underwater SLAM system that fuses a stereo camera, an inertial measurement unit (IMU), and a 3D sonar to achieve accurate 6-DoF localisation and enable efficient dense 3D reconstruction with high photometric fidelity. We introduce a coarse-to-fine online calibration approach for extrinsic parameters estimation between the 3D sonar and the camera. Additionally, a photometric rendering strategy is proposed for the 3D sonar point cloud to enrich the sonar map with visual information. Extensive experiments in a laboratory tank and an open lake demonstrate that VISO surpasses current state-of-the-art underwater and visual-based SLAM algorithms in terms of localisation robustness and accuracy, while also exhibiting real-time dense 3D reconstruction performance comparable to the offline dense mapping method.
Taxonomy-aware Dynamic Motion Generation on Hyperbolic Manifolds ICRA
Human-like motion generation for robots often draws inspiration from biomechanical studies, which often categorize complex human motions into hierarchical taxonomies. While these taxonomies provide rich structural information about how movements relate to one another, this information is frequently overlooked in motion generation models, leading to a disconnect between the generated motions and their underlying hierarchical structure. This paper introduces the \ac{gphdm}, a novel approach that learns latent representations preserving both the hierarchical structure of motions and their temporal dynamics to ensure physical consistency. Our model achieves this by extending the dynamics prior of the Gaussian Process Dynamical Model (GPDM) to the hyperbolic manifold and integrating it with taxonomy-aware inductive biases. Building on this geometry- and taxonomy-aware frameworks, we propose three novel mechanisms for generating motions that are both taxonomically-structured and physically-consistent: two probabilistic recursive approaches and a method based on pullback-metric geodesics. Experiments on generating realistic motion sequences on the hand grasping taxonomy show that the proposed GPHDM faithfully encodes the underlying taxonomy and temporal dynamics, and it generates novel physically-consistent trajectories.
comment: Accepted for publication in IEEE Conference on Robotics and Automation (ICRA), 8 pages, 6 figures, 1 table
Contact-Safe Reinforcement Learning with ProMP Reparameterization and Energy Awareness
Reinforcement learning (RL) approaches based on Markov Decision Processes (MDPs) are predominantly applied in the robot joint space, often relying on limited task-specific information and partial awareness of the 3D environment. In contrast, episodic RL has demonstrated advantages over traditional MDP-based methods in terms of trajectory consistency, task awareness, and overall performance in complex robotic tasks. Moreover, traditional step-wise and episodic RL methods often neglect the contact-rich information inherent in task-space manipulation, especially considering the contact-safety and robustness. In this work, contact-rich manipulation tasks are tackled using a task-space, energy-safe framework, where reliable and safe task-space trajectories are generated through the combination of Proximal Policy Optimization (PPO) and movement primitives. Furthermore, an energy-aware Cartesian Impedance Controller objective is incorporated within the proposed framework to ensure safe interactions between the robot and the environment. Our experimental results demonstrate that the proposed framework outperforms existing methods in handling tasks on various types of surfaces in 3D environments, achieving high success rates as well as smooth trajectories and energy-safe interactions.
comment: 8 pages
FALCON: Future-Aware Learning with Contextual Object-Centric Pretraining for UAV Action Recognition
We introduce FALCON, a unified self-supervised video pretraining approach for UAV action recognition from raw RGB aerial footage, requiring no additional preprocessing at inference. UAV videos exhibit severe spatial imbalance: large, cluttered backgrounds dominate the field of view, causing reconstruction-based pretraining to waste capacity on uninformative regions and under-learn action-relevant human/object cues. FALCON addresses this by integrating object-aware masked autoencoding with object-centric dual-horizon future reconstruction. Using detections only during pretraining, we construct objectness priors that (i) enforce balanced token visibility during masking and (ii) concentrate reconstruction supervision on action-relevant regions, preventing learning from being dominated by background appearance. To promote temporal dynamics learning, we further reconstruct short- and long-horizon future content within an object-centric supervision region, injecting anticipatory temporal supervision that is robust to noisy aerial context. Across UAV benchmarks, FALCON improves top-1 accuracy by 2.9\% on NEC-Drone and 5.8\% on UAV-Human with a ViT-B backbone, while achieving 2$\times$--5$\times$ faster inference than supervised approaches that rely on heavy test-time augmentation.
Decision-Driven Semantic Object Exploration for Legged Robots via Confidence-Calibrated Perception and Topological Subgoal Selection
Conventional navigation pipelines for legged robots remain largely geometry-centric, relying on dense SLAM representations that are fragile under rapid motion and offer limited support for semantic decision making in open-world exploration. In this work, we focus on decision-driven semantic object exploration, where the primary challenge is not map consistency but how noisy and heterogeneous semantic observations can be transformed into stable and executable exploration decisions. We propose a vision-based approach that explicitly addresses this problem through confidence-calibrated semantic evidence arbitration, a controlled-growth semantic topological memory, and a semantic utility-driven subgoal selection mechanism. These components enable the robot to accumulate task-relevant semantic knowledge over time and select exploration targets that balance semantic relevance, reliability, and reachability, without requiring dense geometric reconstruction. Extensive experiments in both simulation and real-world environments demonstrate that the proposed mechanisms consistently improve the quality of semantic decision inputs, subgoal selection accuracy, and overall exploration performance on legged robots.
OmniDP: Beyond-FOV Large-Workspace Humanoid Manipulation with Omnidirectional 3D Perception
The deployment of humanoid robots for dexterous manipulation in unstructured environments remains challenging due to perceptual limitations that constrain the effective workspace. In scenarios where physical constraints prevent the robot from repositioning itself, maintaining omnidirectional awareness becomes far more critical than color or semantic information.While recent advances in visuomotor policy learning have improved manipulation capabilities, conventional RGB-D solutions suffer from narrow fields of view (FOV) and self-occlusion, requiring frequent base movements that introduce motion uncertainty and safety risks. Existing approaches to expanding perception, including active vision systems and third-view cameras, introduce mechanical complexity, calibration dependencies, and latency that hinder reliable real-time performance. In this work, We propose OmniDP, an end-to-end LiDAR-driven 3D visuomotor policy that enables robust manipulation in large workspaces. Our method processes panoramic point clouds through a Time-Aware Attention Pooling mechanism, efficiently encoding sparse 3D data while capturing temporal dependencies. This 360° perception allows the robot to interact with objects across wide areas without frequent repositioning. To support policy learning, we develop a whole-body teleoperation system for efficient data collection on full-body coordination. Extensive experiments in simulation and real-world environments show that OmniDP achieves robust performance in large-workspace and cluttered scenarios, outperforming baselines that rely on egocentric depth cameras.
comment: 8 pages, 6 figures
AIM-SLAM: Dense Monocular SLAM via Adaptive and Informative Multi-View Keyframe Prioritization with Foundation Model
Recent advances in geometric foundation models have emerged as a promising alternative for addressing the challenge of dense reconstruction in monocular visual simultaneous localization and mapping (SLAM). Although geometric foundation models enable SLAM to leverage variable input views, the previous methods remain confined to two-view pairs or fixed-length inputs without sufficient deliberation of geometric context for view selection. To tackle this problem, we propose AIM-SLAM, a dense monocular SLAM framework that exploits an adaptive and informative multi-view keyframe prioritization with dense pointmap predictions from visual geometry grounded transformer (VGGT). Specifically, we introduce the selective information- and geometric-aware multi-view adaptation (SIGMA) module, which employs voxel overlap and information gain to retrieve a candidate set of keyframes and adaptively determine its size. Furthermore, we formulate a joint multi-view Sim(3) optimization that enforces consistent alignment across selected views, substantially improving pose estimation accuracy. The effectiveness of AIM-SLAM is demonstrated on real-world datasets, where it achieves state-of-the-art performance in both pose estimation and dense reconstruction. Our system supports ROS integration, with code is available at https://aimslam.github.io/.
comment: 8 pages
C*: A Coverage Path Planning Algorithm for Unknown Environments using Rapidly Covering Graphs
The paper presents a novel sample-based algorithm, called C*, for real-time coverage path planning (CPP) of unknown environments. C* is built upon the concept of a Rapidly Covering Graph (RCG), which is incrementally constructed during robot navigation via progressive sampling of the search space. By using efficient sampling and pruning techniques, the RCG is constructed to be a minimum-sufficient graph, where its nodes and edges form the potential waypoints and segments of the coverage trajectory, respectively. The RCG tracks the coverage progress, generates the coverage trajectory and helps the robot to escape from the dead-end situations. To minimize coverage time, C* produces the desired back-and-forth coverage pattern, while adapting to the TSP-based optimal coverage of local isolated regions, called coverage holes, which are surrounded by obstacles and covered regions. It is analytically proven that C* provides complete coverage of unknown environments. The algorithmic simplicity and low computational complexity of C* make it easy to implement and suitable for real-time on-board applications. The performance of C* is validated by 1) extensive high-fidelity simulations and 2) laboratory experiments using an autonomous robot. C* yields near optimal trajectories, and a comparative evaluation with seven existing CPP methods demonstrates significant improvements in performance in terms of coverage time, number of turns, trajectory length, and overlap ratio, while preventing the formation of coverage holes. Finally, C* is comparatively evaluated on two different CPP applications using 1) energy-constrained robots and 2) multi-robot teams.
Robustness-Aware Tool Selection and Manipulation Planning with Learned Energy-Informed Guidance ICRA
Humans subconsciously choose robust ways of selecting and using tools, for example, choosing a ladle over a flat spatula to serve meatballs. However, robustness under external disturbances remains underexplored in robotic tool-use planning. This paper presents a robustness-aware method that jointly selects tools and plans contact-rich manipulation trajectories, explicitly optimizing for robustness against disturbances. At the core of our method is an energy-based robustness metric that guides the planner toward robust manipulation behaviors. We formulate a hierarchical optimization pipeline that first identifies a tool and configuration that optimizes robustness, and then plans a corresponding manipulation trajectory that maintains robustness throughout execution. We evaluate our method across three representative tool-use tasks. Simulation and real-world results demonstrate that our method consistently selects robust tools and generates disturbance-resilient manipulation plans.
comment: IEEE International Conference on Robotics and Automation (ICRA), 2026
Safe Autonomous Lane Changing: Planning with Dynamic Risk Fields and Time-Varying Convex Space Generation
This paper presents a novel trajectory planning pipeline for complex driving scenarios like autonomous lane changing, by integrating risk-aware planning with guaranteed collision avoidance into a unified optimization framework. We first construct a dynamic risk fields (DRF) that captures both the static and dynamic collision risks from surrounding vehicles. Then, we develop a rigorous strategy for generating time-varying convex feasible spaces that ensure kinematic feasibility and safety requirements. The trajectory planning problem is formulated as a finite-horizon optimal control problem and solved using a constrained iterative Linear Quadratic Regulator (iLQR) algorithm that jointly optimizes trajectory smoothness, control effort, and risk exposure while maintaining strict feasibility. Extensive simulations demonstrate that our method outperforms traditional approaches in terms of safety and efficiency, achieving collision-free trajectories with shorter lane-changing distances (28.59 m) and times (2.84 s) while maintaining smooth and comfortable acceleration patterns. In dense roundabout environments the planner further demonstrates robust adaptability, producing larger safety margins, lower jerk, and superior curvature smoothness compared with APF, MPC, and RRT based baselines. These results confirm that the integrated DRF with convex feasible space and constrained iLQR solver provides a balanced solution for safe, efficient, and comfortable trajectory generation in dynamic and interactive traffic scenarios.
Safe Model Predictive Diffusion with Shielding ICRA
Generating safe, kinodynamically feasible, and optimal trajectories for complex robotic systems is a central challenge in robotics. This paper presents Safe Model Predictive Diffusion (Safe MPD), a training-free diffusion planner that unifies a model-based diffusion framework with a safety shield to generate trajectories that are both kinodynamically feasible and safe by construction. By enforcing feasibility and safety on all samples during the denoising process, our method avoids the common pitfalls of post-processing corrections, such as computational intractability and loss of feasibility. We validate our approach on challenging non-convex planning problems, including kinematic and acceleration-controlled tractor-trailer systems. The results show that it substantially outperforms existing safety strategies in success rate and safety, while achieving sub-second computation times.
comment: 2026 IEEE International Conference on Robotics and Automation (ICRA). Project page: https://www.taekyung.me/safe-mpd
(MGS)$^2$-Net: Unifying Micro-Geometric Scale and Macro-Geometric Structure for Cross-View Geo-Localization
Cross-view geo-localization (CVGL) is pivotal for GNSS-denied UAV navigation but remains brittle under the drastic geometric misalignment between oblique aerial views and orthographic satellite references. Existing methods predominantly operate within a 2D manifold, neglecting the underlying 3D geometry where view-dependent vertical facades (macro-structure) and scale variations (micro-scale) severely corrupt feature alignment. To bridge this gap, we propose (MGS)$^2$, a geometry-grounded framework. The core of our innovation is the Macro-Geometric Structure Filtering (MGSF) module. Unlike pixel-wise matching sensitive to noise, MGSF leverages dilated geometric gradients to physically filter out high-frequency facade artifacts while enhancing the view-invariant horizontal plane, directly addressing the domain shift. To guarantee robust input for this structural filtering, we explicitly incorporate a Micro-Geometric Scale Adaptation (MGSA) module. MGSA utilizes depth priors to dynamically rectify scale discrepancies via multi-branch feature fusion. Furthermore, a Geometric-Appearance Contrastive Distillation (GACD) loss is designed to strictly discriminate against oblique occlusions. Extensive experiments demonstrate that (MGS)$^2$ achieves state-of-the-art performance, recording a Recall@1 of 97.5\% on University-1652 and 97.02\% on SUES-200. Furthermore, the framework exhibits superior cross-dataset generalization against geometric ambiguity. The code is available at: \href{https://github.com/GabrielLi1473/MGS-Net}{https://github.com/GabrielLi1473/MGS-Net}.
FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment
Geometrically accurate and semantically expressive map representations have proven invaluable for robot deployment and task planning in unknown environments. Nevertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments still presents open challenges, mainly due to computational requirements. In this paper we present FindAnything, an open-world mapping framework that incorporates vision-language information into dense volumetric submaps. Thanks to the use of vision-language features, FindAnything combines pure geometric and open-vocabulary semantic information for a higher level of understanding. It proposes an efficient storage of open-vocabulary information through the aggregation of features at the object level. Pixelwise vision-language features are aggregated based on eSAM segments, which are in turn integrated into object-centric volumetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. We demonstrate that FindAnything performs on par with the state-of-the-art in terms of semantic accuracy while being substantially faster and more memory-efficient, allowing its deployment in large-scale environments and on resourceconstrained devices, such as MAVs. We show that the real-time capabilities of FindAnything make it useful for downstream tasks, such as autonomous MAV exploration in a simulated Search and Rescue scenario. Project Page: https://ethz-mrl.github.io/findanything/.
comment: 11 pages, 5 figures
Bridging Simulation and Usability: A User-Friendly Framework for Scenario Generation in CARLA
Autonomous driving promises safer roads, reduced congestion, and improved mobility, yet validating these systems across diverse conditions remains a major challenge. Real-world testing is expensive, time-consuming, and sometimes unsafe, making large-scale validation impractical. In contrast, simulation environments offer a scalable and cost-effective alternative for rigorous verification and validation. A critical component of the validation process is scenario generation, which involves designing and configuring traffic scenarios to evaluate autonomous systems' responses to various events and uncertainties. However, existing scenario generation tools often require programming knowledge, limiting accessibility for non-technical users. To address this limitation, we present an interactive, no-code framework for scenario generation. Our framework features a graphical interface that enables users to create, modify, save, load, and execute scenarios without needing coding expertise or detailed simulation knowledge. Unlike script-based tools such as Scenic or ScenarioRunner, our approach lowers the barrier to entry and supports a broader user base. Central to our framework is a graph-based scenario representation that facilitates structured management, supports both manual and automated generation, and enables integration with deep learning-based scenario and behavior generation methods. In automated mode, the framework can randomly sample parameters such as actor types, behaviors, and environmental conditions, allowing the generation of diverse and realistic test datasets. By simplifying the scenario generation process, this framework supports more efficient testing workflows and increases the accessibility of simulation-based validation for researchers, engineers, and policymakers.
comment: Paper is accepted in IEEE International Automated Vehicle Validation Conference (IAVVC 2025)
Diverse and Adaptive Behavior Curriculum for Autonomous Driving: A Student-Teacher Framework with Multi-Agent RL IROS 2025
Autonomous driving faces challenges in navigating complex real-world traffic, requiring safe handling of both common and critical scenarios. Reinforcement learning (RL), a prominent method in end-to-end driving, enables agents to learn through trial and error in simulation. However, RL training often relies on rule-based traffic scenarios, limiting generalization. Additionally, current scenario generation methods focus heavily on critical scenarios, neglecting a balance with routine driving behaviors. Curriculum learning, which progressively trains agents on increasingly complex tasks, is a promising approach to improving the robustness and coverage of RL driving policies. However, existing research mainly emphasizes manually designed curricula, focusing on scenery and actor placement rather than traffic behavior dynamics. This work introduces a novel student-teacher framework for automatic curriculum learning. The teacher, a graph-based multi-agent RL component, adaptively generates traffic behaviors across diverse difficulty levels. An adaptive mechanism adjusts task difficulty based on student performance, ensuring exposure to behaviors ranging from common to critical. The student, though exchangeable, is realized as a deep RL agent with partial observability, reflecting real-world perception constraints. Results demonstrate the teacher's ability to generate diverse traffic behaviors. The student, trained with automatic curricula, outperformed agents trained on rule-based traffic, achieving higher rewards and exhibiting balanced, assertive driving.
comment: First and Second authors contributed equally; Paper accepted in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)
VEGA: Electric Vehicle Navigation Agent via Physics-Informed Neural Operator and Proximal Policy Optimization IROS
We present VEGA, a vehicle-adaptive energy-aware routing system for electric vehicles (EVs) that integrates physics-informed parameter estimation with RL-based charge-aware path planning. VEGA consists of two copupled modules: (1) a physics-informed neural operator (PINO) that estimates vehicle-specific physical parameters-drag, rolling resistance, mass, motor and regenerative-braking efficiencies, and auxiliary load-from short windows of onboard speed and acceleration data; (2) a Proximal Policy Optimization (PPO) agent that navigates a charger-annotated road graph, jointly selecting routes and charging stops under state-of-charge constraints. The agent is initialized via behavior cloning from an A* teacher and fine-tuned with cirriculum-guided PPO on the full U.S. highway network with Tesla Supercharger locations. On a cross-country San Francisco-to-New York route (~4,860km), VEGA produces a feasible 20-stop plan with 56.12h total trip time and minimum SoC 11.41%. Against the controlled Energy-aware A* baseline, the distance and driving-time gaps are small (-8.49km and +0.37h), while inference is >20x faster. The learned policy generalizes without retraining to road networks in France and Japan.
comment: This work has been submitted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) for possible publication
Graph-based Online Lidar Odometry with Retrospective Map Refinement
Lidar-only odometry aims to estimate the trajectory of a mobile platform from a stream of lidar scans. Traditional scan-to map approaches register each scan against a single, evolving map, which propagates registration errors over time. To mitigate this, we propose a multitude-of-maps approach where the current scan is registered against multiple overlapping submaps instead of a single static map. By optimizing the resulting constraints in a pose graph, our method enables not only precise estimation of the current pose but also retrospective refinement of the submaps' anchor points, which improves short-term consistency and long-term accuracy. We demonstrate that our approach achieves competitive and often superior accuracy on a variety of automotive datasets while maintaining real-time performance. Ablation studies confirm the critical role of multiple registrations and retrospective refinement of the map as core factors for our accuracy gains. Code and raw results are available on our public GitHub at https://github.com/Fusion-Goettingen/IROS_2026_Kurda_Graph.
Beyond Imitation: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models
Simulation offers a scalable and low-cost way to enrich vision-language-action (VLA) training, reducing reliance on expensive real-robot demonstrations. However, most sim-real co-training methods rely on supervised fine-tuning (SFT), which treats simulation as a static source of demonstrations and does not exploit large-scale closed-loop interaction. Consequently, real-world gains and generalization are often limited. In this paper, we propose an \underline{\textit{RL}}-based sim-real \underline{\textit{Co}}-training \modify{(RL-Co)} framework that leverages interactive simulation while preserving real-world capabilities. Our method follows a generic two-stage design: we first warm-start the policy with SFT on a mixture of real and simulated demonstrations, then fine-tune it with reinforcement learning in simulation while adding an auxiliary supervised loss on real-world data to anchor the policy and mitigate catastrophic forgetting. We evaluate our framework on four real-world tabletop manipulation tasks using two representative VLA architectures, OpenVLA and $π_{0.5}$, and observe consistent improvements over real-only fine-tuning and SFT-based co-training, including +24% real-world success on OpenVLA and +20% on $π_{0.5}$. Beyond higher success rates, RL co-training yields stronger generalization to unseen task variations and substantially improved real-world data efficiency, providing a practical and scalable pathway for leveraging simulation to enhance real-robot deployment.
Language Conditioning Improves Accuracy of Aircraft Goal Prediction in Non-Towered Airspace
Autonomous aircraft must safely operate in non-towered airspace, where coordination relies on voice-based communication among human pilots. Safe operation requires an aircraft to predict the intent, and corresponding goal location, of other aircraft. This paper introduces a multimodal framework for aircraft goal prediction that integrates natural language understanding with spatial reasoning to improve autonomous decision-making in such environments. We leverage automatic speech recognition and large language models to transcribe and interpret pilot radio calls, identify aircraft, and extract discrete intent labels. These intent labels are fused with observed trajectories to condition a temporal convolutional network and Gaussian mixture model for probabilistic goal prediction. Our method significantly reduces goal prediction error compared to baselines that rely solely on motion history, demonstrating that language-conditioned prediction increases prediction accuracy. Experiments on a real-world dataset from a non-towered airport validate the approach and highlight its potential to enable socially aware, language-conditioned robotic motion planning.
comment: The last two authors advised equally. Accepted to the 2026 IEEE International Conference on Robotics and Automation. 8 pages, 6 figures
RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
We need to trust robots that use often opaque AI methods. They need to explain themselves to us, and we need to trust their explanation. In this regard, explainability plays a critical role in trustworthy autonomous decision-making to foster transparency and acceptance among end users, especially in complex autonomous driving. Recent advancements in Multi-Modal Large Language models (MLLMs) have shown promising potential in enhancing the explainability as a driving agent by producing control predictions along with natural language explanations. However, severe data scarcity due to expensive annotation costs and significant domain gaps between different datasets makes the development of a robust and generalisable system an extremely challenging task. Moreover, the prohibitively expensive training requirements of MLLM and the unsolved problem of catastrophic forgetting further limit their generalisability post-deployment. To address these challenges, we present RAG-Driver, a novel retrieval-augmented multi-modal large language model that leverages in-context learning for high-performance, explainable, and generalisable autonomous driving. By grounding in retrieved expert demonstration, we empirically validate that RAG-Driver achieves state-of-the-art performance in producing driving action explanations, justifications, and control signal prediction. More importantly, it exhibits exceptional zero-shot generalisation capabilities to unseen environments without further training endeavours.
comment: 14 pages, 6 figures
RoboPocket: Improve Robot Policies Instantly with Your Phone
Scaling imitation learning is fundamentally constrained by the efficiency of data collection. While handheld interfaces have emerged as a scalable solution for in-the-wild data acquisition, they predominantly operate in an open-loop manner: operators blindly collect demonstrations without knowing the underlying policy's weaknesses, leading to inefficient coverage of critical state distributions. Conversely, interactive methods like DAgger effectively address covariate shift but rely on physical robot execution, which is costly and difficult to scale. To reconcile this trade-off, we introduce RoboPocket, a portable system that enables Robot-Free Instant Policy Iteration using single consumer smartphones. Its core innovation is a Remote Inference framework that visualizes the policy's predicted trajectory via Augmented Reality (AR) Visual Foresight. This immersive feedback allows collectors to proactively identify potential failures and focus data collection on the policy's weak regions without requiring a physical robot. Furthermore, we implement an asynchronous Online Finetuning pipeline that continuously updates the policy with incoming data, effectively closing the learning loop in minutes. Extensive experiments demonstrate that RoboPocket adheres to data scaling laws and doubles the data efficiency compared to offline scaling strategies, overcoming their long-standing efficiency bottleneck. Moreover, our instant iteration loop also boosts sample efficiency by up to 2$\times$ in distributed environments a small number of interactive corrections per person. Project page and videos: https://robo-pocket.github.io.
comment: Project page: https://robo-pocket.github.io
Symmetry-Breaking in Multi-Agent Navigation: Winding Number-Aware MPC with a Learned Topological Strategy
In distributed multi-agent navigation without explicit communication, agents can fall into symmetry-induced deadlocks because each agent must autonomously decide how to pass others. To address this problem, we propose WNumMPC, a hierarchical navigation method that quantifies cooperative symmetry-breaking strategies via a topological invariant, the winding number, and learns such strategies through reinforcement learning. The learning-based Planner outputs continuous-valued signed target winding numbers and dynamic importance weights to prioritize critical interactions in dense crossings. Then, the model-based Controller generates collision-free and efficient motions based on the strategy and weights provided by the Planner. Simulation and real-world robot experiments indicate that WNumMPC effectively avoids deadlocks and collisions and achieves better performance than the baselines, particularly in dense and symmetry-prone scenarios. These experiments also suggest that explicitly leveraging winding numbers yields robust sim-to-real transfer with minimal performance degradation. The code for the experiments is available at https://github.com/omron-sinicx/WNumMPC.
comment: 12 pages, 7 figures
APEX: Learning Adaptive High-Platform Traversal for Humanoid Robots
Humanoid locomotion has advanced rapidly with deep reinforcement learning (DRL), enabling robust feet-based traversal over uneven terrain. Yet platforms beyond leg length remain largely out of reach because current RL training paradigms often converge to jumping-like solutions that are high-impact, torque-limited, and unsafe for real-world deployment. To address this gap, we propose APEX, a system for perceptive, climbing-based high-platform traversal that composes terrain-conditioned behaviors: climb-up and climb-down at vertical edges, walking or crawling on the platform, and stand-up and lie-down for posture reconfiguration. Central to our approach is a generalized ratchet progress reward for learning contact-rich, goal-reaching maneuvers. It tracks the best-so-far task progress and penalizes non-improving steps, providing dense yet velocity-free supervision that enables efficient exploration under strong safety regularization. Based on this formulation, we train LiDAR-based full-body maneuver policies and reduce the sim-to-real perception gap through a dual strategy: modeling mapping artifacts during training and applying filtering and inpainting to elevation maps during deployment. Finally, we distill all six skills into a single policy that autonomously selects behaviors and transitions based on local geometry and commands. Experiments on a 29-DoF Unitree G1 humanoid demonstrate zero-shot sim-to-real traversal of 0.8 meter platforms (approximately 114% of leg length), with robust adaptation to platform height and initial pose, as well as smooth and stable multi-skill transitions.
comment: Project Website: https://apex-humanoid.github.io/
Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion
Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/
Bi-AQUA: Bilateral Control-Based Imitation Learning for Underwater Robot Arms via Lighting-Aware Action Chunking with Transformers
Underwater robotic manipulation remains challenging because lighting variation, color attenuation, scattering, and reduced visibility can severely degrade visuomotor policies. We present Bi-AQUA, the first underwater bilateral control-based imitation learning framework for robot arms that explicitly models lighting within the policy. Bi-AQUA integrates transformer-based bilateral action chunking with a hierarchical lighting-aware design composed of a label-free Lighting Encoder, FiLM-based visual feature modulation, and a lighting token for action conditioning. This design enables adaptation to static and dynamically changing underwater illumination while preserving the force-sensitive advantages of bilateral control, which are particularly important in long-horizon and contact-rich manipulation. Real-world experiments on underwater pick-and-place, drawer closing, and peg extraction tasks show that Bi-AQUA outperforms a bilateral baseline without lighting modeling and achieves robust performance under seen, unseen, and changing lighting conditions. These results highlight the importance of combining explicit lighting modeling with force-aware bilateral imitation learning for reliable underwater manipulation. For additional material, please check: https://mertcookimg.github.io/bi-aqua
Safe-SAGE: Social-Semantic Adaptive Guidance for Safe Engagement through Laplace-Modulated Poisson Safety Functions
Traditional safety-critical control methods, such as control barrier functions, suffer from semantic blindness, exhibiting the same behavior around obstacles regardless of contextual significance. This limitation leads to the uniform treatment of all obstacles, despite their differing semantic meanings. We present Safe-SAGE (Social-Semantic Adaptive Guidance for Safe Engagement), a unified framework that bridges the gap between high-level semantic understanding and low-level safety-critical control through a Poisson safety function (PSF) modulated using a Laplace guidance field. Our approach perceives the environment by fusing multi-sensor point clouds with vision-based instance segmentation and persistent object tracking to maintain up-to-date semantics beyond the camera's field of view. A multi-layer safety filter is then used to modulate system inputs to achieve safe navigation using this semantic understanding of the environment. This safety filter consists of both a model predictive control layer and a control barrier function layer. Both layers utilize the PSF and flux modulation of the guidance field to introduce varying levels of conservatism and multi-agent passing norms for different obstacles in the environment. Our framework enables legged robots to safely navigate semantically rich, dynamic environments with context-dependent safety margins.
comment: 8 pages
Sample-Based Hybrid Mode Control: Asymptotically Optimal Switching of Algorithmic and Non-Differentiable Control Modes
This paper investigates a sample-based solution to the hybrid mode control problem across non-differentiable and algorithmic hybrid modes. Our approach reasons about a set of hybrid control modes as an integer-based optimization problem where we select what mode to apply, when to switch to another mode, and the duration for which we are in a given control mode. A sample-based variation is derived to efficiently search the integer domain for optimal solutions. We find our formulation yields strong performance guarantees that can be applied to a number of robotics-related tasks. In addition, our approach is able to synthesize complex algorithms and policies to compound behaviors and achieve challenging tasks. Last, we demonstrate the effectiveness of our approach in real-world robotic examples that require reactive switching between long-term planning and high-frequency control.
Learning Robust Control Policies for Inverted Pose on Miniature Blimp Robots ICRA 2026
The ability to achieve and maintain inverted poses is essential for unlocking the full agility of miniature blimp robots (MBRs). However, developing reliable inverted control strategies for MBRs remains challenging due to their complex and underactuated dynamics. To address this challenge, we propose a novel framework that enables robust control policy learning for inverted pose on MBRs. The proposed framework consists of three core stages. First, a high-fidelity three-dimensional (3D) simulation environment is constructed and calibrated using real-world MBR motion data. Second, a robust inverted control policy is trained in simulation using a modified Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm combined with a domain randomization strategy. Third, a mapping layer is designed to bridge the sim-to-real gap and facilitate real-world deployment of the learned policy. Comprehensive evaluations in the simulation environment demonstrate that the learned policy achieves a higher success rate compared to the energy-shaping controller. Furthermore, experimental results confirm that the learned policy with a mapping layer enables an MBR to achieve and maintain a fully inverted pose in real-world settings.
comment: Accepted in ICRA 2026
EchoVLA: Synergistic Declarative Memory for VLA-Driven Mobile Manipulation
Recent progress in Vision-Language-Action (VLA) models has enabled embodied agents to interpret multimodal instructions and perform complex tasks. However, existing VLAs are mostly confined to short-horizon, table-top manipulation, lacking the memory and reasoning capability required for mobile manipulation, where agents must coordinate navigation and manipulation under changing spatial contexts. In this work, we present EchoVLA, a memory-aware VLA model for mobile manipulation. EchoVLA incorporates a synergistic declarative memory inspired by the human brain, consisting of a scene memory that maintains a collection of spatial-semantic maps and an episodic memory that stores task-level experiences with multimodal contextual features. The two memories are individually stored, updated, and retrieved based on current observations, task history, and instructions, and their retrieved representations are fused via coarse- and fine-grained attention to guide base-arm diffusion policies. To support large-scale training, we further introduce MoMani, an automated benchmark that generates expert-level trajectories through multimodal large language model (MLLM)-guided planning and feedback-driven refinement, supplemented with real-robot demonstrations. Comprehensive simulated and real-world results demonstrate that EchoVLA substantially improves overall performance, e.g., it achieves the highest success rates of 0.52 on manipulation/navigation tasks and 0.31 on mobile manipulation tasks in simulation, exceeding the strong baseline $π_{0.5}$ by +0.20 and +0.11, respectively.
AURASeg: Attention-guided Upsampling with Residual-Assistive Boundary Refinement for Onboard Robot Drivable-Area Segmentation
Free space ground segmentation is essential to navigate autonomous robots, recognize drivable zones, and traverse efficiently. Fine-grained features remain challenging for existing segmentation models, particularly for robots in indoor, outdoor and road-scene environments. These difficulties arise from ineffective multi-scale processing, sub-optimal boundary refinement, and limited feature representation. To address this, we propose Attention-guided Upsampling with Residual-Assistive Boundary Refinement (AURASeg), a ground-plane drivable area segmentation framework designed to improve boundary precision while preserving strong region accuracy under edge-deployment constraints. Built on ResNet backbone, we propose (i) a Residual Boundary Refinement Module (RBRM) that enhances edge delineation through boundary-assistive feature refinement, and (ii) Attention Progressive Upsampling Decoder (APUD) blocks that fuse multi-level features using residual fusion of attention modules; additionally, we integrate (iii) a lightweight ASPPLite module to capture multi-scale context with minimal overhead. Extensive experiments on CARL-D, the Ground Mobile Robot Perception (GMRPD) dataset, and a custom Gazebo indoor dataset show that AURASeg consistently outperforms strong baselines, with notable gains in boundary metrics. Finally, we demonstrate on-device deployment on a Jetson Nano powered Kobuki TurtleBot, validating practical edge-inference feasibility. Code is omitted for anonymity and will be released upon acceptance.
comment: 6 pages, 4 figures, 4 tables
Integrated Hierarchical Decision-Making in Inverse Kinematic Planning and Control
This work presents a novel and efficient non-linear programming framework that tightly integrates hierarchical decision-making with inverse kinematic planning and control. Decision-making plays a central role in many aspects of robotics, from sparse inverse kinematic control with a minimal number of joints, to inverse kinematic planning while simultaneously selecting a discrete end-effector location from multiple candidates. Current approaches often rely on heavy computations using mixed-integer non-linear programming, separate decision-making from inverse kinematics (some times approximated by reachability methods), or employ efficient but less accurate $\ell_1$-norm formulations of linear sparse programming, without addressing the underlying non-linear problem formulations. In contrast, the proposed sparse hierarchical non-linear programming solver is efficient, versatile, and accurate by exploiting sparse hierarchical structure and leveraging the rarely used $\ell_0$-norm in robotics. The solver efficiently addresses complex non-linear hierarchical decision-making problems, such as inverse kinematic planning with simultaneous prioritized selection of end-effector locations from a large set of candidates, or inverse kinematic control with simultaneous selection of bi-manual grasp locations on a randomly rotated box.
Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation ICRA
Learning robotic manipulation policies directly in the real world can be expensive and time-consuming. While reinforcement learning (RL) policies trained in simulation present a scalable alternative, effective sim-to-real transfer remains challenging, particularly for tasks that require precise dynamics. To address this, we propose Phys2Real, a real-to-sim-to-real RL pipeline that combines vision-language model (VLM)-inferred physical parameter estimates with interactive adaptation through uncertainty-aware fusion. Our approach consists of three core components: (1) high-fidelity geometric reconstruction with 3D Gaussian splatting, (2) VLM-inferred prior distributions over physical parameters, and (3) online physical parameter estimation from interaction data. Phys2Real conditions policies on interpretable physical parameters, refining VLM predictions with online estimates via ensemble-based uncertainty quantification. On planar pushing tasks of a T-block with varying center of mass (CoM) and a hammer with an off-center mass distribution, Phys2Real achieves substantial improvements over a domain randomization baseline: 100% vs 79% success rate for the bottom-weighted T-block, 57% vs 23% in the challenging top-weighted T-block, and 15% faster average task completion for hammer pushing. Ablation studies indicate that the combination of VLM and interaction information is essential for success. Project website: https://phys2real.github.io/.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026
XR-DT: Extended Reality-Enhanced Digital Twin for Safe Motion Planning via Human-Aware Model Predictive Path Integral Control
As mobile robots increasingly operate alongside humans in shared workspaces, ensuring safe, efficient, and interpretable Human-Robot Interaction (HRI) has become a pressing challenge. While substantial progress has been devoted to human behavior prediction, limited attention has been paid to how humans perceive, interpret, and trust robots' inferences and how robots plan safe and efficient trajectories based on predicted human behaviors. To address these challenges, this paper presents XR-DT, an eXtended Reality-enhanced Digital Twin framework for mobile robots, which bridges physical and virtual spaces to enable bi-directional understanding between humans and robots. Our hierarchical XR-DT architecture integrates augmented-, virtual-, and mixed-reality layers, fusing real-time sensor data, simulated environments in the Unity game engine, and human feedback captured through wearable XR devices. Within this framework, we design a novel Human-Aware Model Predictive Path Integral (HA-MPPI) control model, an MPPI-based motion planner that incorporates ATLAS (Attention-based Trajectory Learning with Anticipatory Sensing), a multi-modal Transformer model designed for egocentric human trajectory prediction via XR headsets. Extensive real-world experimental results demonstrate accurate human trajectory prediction, and safe and efficient robot navigation, validating the HA-MPPI's effectiveness within the XR-DT framework. By embedding human behavior, environmental dynamics, and robot navigation into the XR-DT framework, our system enables interpretable, trustworthy, and adaptive HRI.
comment: 8 pages, 6 figures, 3 tables
Real-Time Learning of Predictive Dynamic Obstacle Models for Robotic Motion Planning ICRA
Autonomous systems often must predict the motions of nearby agents from partial and noisy data. This paper asks and answers the question: "can we learn, in real-time, a nonlinear predictive model of another agent's motions?" Our online framework denoises and forecasts such dynamics using a modified sliding-window Hankel Dynamic Mode Decomposition (Hankel-DMD). Partial noisy measurements are embedded into a Hankel matrix, while an associated Page matrix enables singular-value hard thresholding (SVHT) to estimate the effective rank. A Cadzow projection enforces structured low-rank consistency, yielding a denoised trajectory and local noise variance estimates. From this representation, a time-varying Hankel-DMD lifted linear predictor is constructed for multi-step forecasts. The residual analysis provides variance-tracking signals that can support downstream estimators and risk-aware planning. We validate the approach in simulation under Gaussian and heavy-tailed noise, and experimentally on a dynamic crane testbed. Results show that the method achieves stable variance-aware denoising and short-horizon prediction suitable for integration into real-time control frameworks.
comment: 10 pages, 6 figures, submitted to IEEE International Conference on Robotics and Automation (ICRA) 2025
Large-Language-Model-Guided State Estimation for Partially Observable Task and Motion Planning
Robot planning in partially observable environments, where not all objects are known or visible, is a challenging problem, as it requires reasoning under uncertainty through partially observable Markov decision processes. During the execution of a computed plan, a robot may unexpectedly observe task-irrelevant objects, which are typically ignored by naive planners. In this work, we propose incorporating two types of common-sense knowledge: (1) certain objects are more likely to be found in specific locations; and (2) similar objects are likely to be co-located, while dissimilar objects are less likely to be found together. Manually engineering such knowledge is complex, so we explore leveraging the powerful common-sense reasoning capabilities of large language models (LLMs). Our planning and execution framework, CoCo-TAMP, introduces a hierarchical state estimation that uses LLM-guided information to shape the belief over task-relevant objects, enabling efficient solutions to long-horizon task and motion planning problems. In experiments, CoCo-TAMP achieves an average reduction of 62.7% in planning and execution time in simulation, and 72.6% in real-world demonstrations, compared to a baseline that does not incorporate either type of common-sense knowledge.
ExpReS-VLA: Specializing Vision-Language-Action Models Through Experience Replay and Retrieval ICRA
Vision-Language-Action (VLA) models like OpenVLA demonstrate impressive zero-shot generalization across robotic manipulation tasks but struggle to adapt to specific deployment environments where consistent high performance on a limited set of tasks is more valuable than broad generalization. We present EXPierence replayed, REtrieval augmented, Specialized VLA (ExpReS-VLA), a method that enables rapid on-device adaptation of pre-trained VLAs to target domains while preventing catastrophic forgetting through compressed experience replay and retrieval-augmented generation. Our approach maintains a memory-efficient buffer by storing extracted embeddings from OpenVLA's frozen vision backbone, reducing storage requirements by 97% compared to raw image-action pairs. During deployment, ExpReS-VLA retrieves the $k$ most similar past experiences using cosine similarity to augment training batches, while a prioritized experience replay buffer preserves recently successful trajectories. To leverage failed attempts, we introduce Thresholded Hybrid Contrastive Loss (THCL), enabling the model to learn from both successful and unsuccessful demonstrations. Experiments on the LIBERO benchmark show improvements from 82.6% to 93.1% on spatial reasoning and 61% to 72.3% on long-horizon tasks over base OpenVLA, with gains across architectures including $π_0$ (+3.2 points) and OpenVLA-OFT (+1.7 points). Physical robot experiments across five tasks demonstrate 98% success on both in-distribution and out-of-distribution conditions, improving from 84.7% and 32% respectively for naive fine-tuning. Adaptation completes in 31 seconds using 12 demonstrations on a single RTX 5090.
comment: 8 pages, 4 figures, 3 tables, accepted to International Conference on Robotics and Automation (ICRA) 2026
Assigning Multi-Robot Tasks to Multitasking Robots
One simplifying assumption in existing and well-performing task allocation methods is that the robots are single-tasking: each robot operates on a single task at any given time. While this assumption is harmless to make in some situations, it can be inefficient or even infeasible in others. In this paper, we consider assigning multi-robot tasks to multitasking robots. The key contribution is a novel task allocation framework that incorporates the consideration of physical constraints introduced by multitasking. This is in contrast to the existing work where such constraints are largely ignored. After formulating the problem, we propose a compilation to weighted MAX-SAT, which allows us to leverage existing solvers for a solution. A more efficient greedy heuristic is then introduced. For evaluation, we first compare our methods with a modern baseline that is efficient for single-tasking robots to validate the benefits of multitasking in synthetic domains. Then, using a site-clearing scenario in simulation, we further illustrate the complex task interaction considered by the multitasking robots in our approach to demonstrate its performance. Finally, we demonstrate a physical experiment to show how multitasking enabled by our approach can benefit task efficiency in a realistic setting.
Preference-Conditioned Multi-Objective RL for Integrated Command Tracking and Force Compliance in Humanoid Locomotion
Humanoid locomotion requires not only accurate command tracking for navigation but also compliant responses to external forces during human interaction. Despite significant progress, existing RL approaches mainly emphasize robustness, yielding policies that resist external forces but lack compliance particularly challenging for inherently unstable humanoids. In this work, we address this by formulating humanoid locomotion as a multi-objective optimization problem that balances command tracking and external force compliance. We introduce a preference-conditioned multi-objective RL (MORL) framework that enables a single omnidirectional locomotion policy to trade off between command following and force compliance via a user-specified preference input. External forces are modeled via velocity-resistance factor for consistent reward design, and training leverages an encoder-decoder structure that infers task-relevant privileged features from deployable observations. We validate our approach in both simulation and real-world experiments on a humanoid robot. Experimental results in simulation and on hardware show that the framework trains stably and enables deployable preference-conditioned humanoid locomotion.
Diffusion-SAFE: Diffusion-Native Human-to-Robot Driving Handover for Shared Autonomy
Shared autonomy in driving requires anticipating human behavior, flagging risk before it becomes unavoidable, and transferring control safely and smoothly. We propose Diffusion-SAFE, a closed-loop framework built on two diffusion models: an evaluator that predicts multimodal human-intent action sequences for probabilistic risk detection, and a safety-guided copilot that steers its denoising process toward safe regions using the gradient of a map-based safety certificate. When risk is detected, control is transferred through partial diffusion: the human plan is forward-noised to an intermediate level and denoised by the safety-guided copilot. The forward-diffusion ratio $ρ$ acts as a continuous takeover knob-small $ρ$ keeps the output close to human intent, while increasing $ρ$ shifts authority toward the copilot, avoiding the mixed-unsafe pitfall of action-level blending. Unlike methods relying on hand-crafted score functions, our diffusion formulation supports both safety evaluation and plan generation directly from demonstrations. We evaluate Diffusion-SAFE in simulation and on a real ROS-based race car, achieving 93.0%/87.0% (sim/real) handover success rates with smooth transitions.
PAD-TRO: Projection-Augmented Diffusion for Direct Trajectory Optimization
Recently, diffusion models have gained popularity and attention in trajectory optimization due to their capability of modeling multi-modal probability distributions. However, addressing nonlinear equality constraints, i.e, dynamic feasibility, remains a great challenge in diffusion-based trajectory optimization. Recent diffusion-based trajectory optimization frameworks rely on a single-shooting style approach where the denoised control sequence is applied to forward propagate the dynamical system, which cannot explicitly enforce constraints on the states and frequently leads to sub-optimal solutions. In this work, we propose a novel direct trajectory optimization approach via model-based diffusion, which directly generates a sequence of states. To ensure dynamic feasibility, we propose a gradient-free projection mechanism that is incorporated into the reverse diffusion process. Our results show that, compared to a recent state-of-the-art baseline, our approach leads to zero dynamic feasibility error and approximately 4x higher success rate in a quadrotor waypoint navigation scenario involving dense static obstacles.
comment: Final manuscript. Accepted for publication at the 2026 American Control Conference
ViLAM: Distilling Vision-Language Reasoning into Attention Maps for Social Robot Navigation
We introduce ViLAM, a novel method for distilling vision-language reasoning from large Vision-Language Models (VLMs) into spatial attention maps for socially compliant robot navigation. Unlike traditional methods that rely on expert demonstrations or human-annotated datasets, ViLAM performs knowledge distillation and fine-tuning at the intermediate layer representation (attention) level by aligning attention maps from a pretrained vision-action model with socially guided attention maps derived from a large VLM. These distilled attention maps highlight key navigational regions in a scene and serve as socially informed spatial cost maps for motion planning. To achieve this, we introduce a novel attention-level distillation loss that fuses knowledge from both sources, generating augmented attention maps with enhanced social awareness. These refined attention maps are then used as a traversability costmap within a socially aware local planner for navigation. We validate our approach through real-world experiments on a Husky wheeled robot, and demonstrate 14.2% - 50% improvements in success rate over existing methods.
Multiagent Systems
Talk Freely, Execute Strictly: Schema-Gated Agentic AI for Flexible and Reproducible Scientific Workflows
Large language models (LLMs) can now translate a researcher's plain-language goal into executable computation, yet scientific workflows demand determinism, provenance, and governance that are difficult to guarantee when an LLM decides what runs. Semi-structured interviews with 18 experts across 10 industrial R&D stakeholders surface 2 competing requirements--deterministic, constrained execution and conversational flexibility without workflow rigidity--together with boundary properties (human-in-the-loop control and transparency) that any resolution must satisfy. We propose schema-gated orchestration as the resolving principle: the schema becomes a mandatory execution boundary at the composed-workflow level, so that nothing runs unless the complete action--including cross-step dependencies--validates against a machine-checkable specification. We operationalize the 2 requirements as execution determinism (ED) and conversational flexibility (CF), and use these axes to review 20 systems spanning 5 architectural groups along a validation-scope spectrum. Scores are assigned via a multi-model protocol--15 independent sessions across 3 LLM families--yielding substantial-to-near-perfect inter-model agreement (Krippendorff a=0.80 for ED and a=0.98 for CF), demonstrating that multi-model LLM scoring can serve as a reusable alternative to human expert panels for architectural assessment. The resulting landscape reveals an empirical Pareto front--no reviewed system achieves both high flexibility and high determinism--but a convergence zone emerges between the generative and workflow-centric extremes. We argue that a schema-gated architecture, separating conversational from execution authority, is positioned to decouple this trade-off, and distill 3 operational principles--clarification-before-execution, constrained plan-act orchestration, and tool-to-workflow-level gating--to guide adoption.
Conversational Demand Response: Bidirectional Aggregator-Prosumer Coordination through Agentic AI
Residential demand response depends on sustained prosumer participation, yet existing coordination is either fully automated, or limited to one-way dispatch signals and price alerts that offer little possibility for informed decision-making. This paper introduces Conversational Demand Response (CDR), a coordination mechanism where aggregators and prosumers interact through bidirectional natural language, enabled through agentic AI. A two-tier multi-agent architecture is developed in which an aggregator agent dispatches flexibility requests and a prosumer Home Energy Management System (HEMS) assesses deliverability and cost-benefit by calling an optimization-based tool. CDR also enables prosumer-initiated upstream communication, where changes in preferences can reach the aggregator directly. Proof-of-concept evaluation shows that interactions complete in under 12 seconds. The architecture illustrates how agentic AI can bridge the aggregator-prosumer coordination gap, providing the scalability of automated DR while preserving the transparency, explainability, and user agency necessary for sustained prosumer participation. All system components, including agent prompts, orchestration logic, and simulation interfaces, are released as open source to enable reproducibility and further development.
comment: 6 pages, 2 figures. Code available at: https://github.com/RedaElMakroum/cdr
MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing ACL 2026
Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents/sub-workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heterogeneous external context sources. To overcome these limitations, we present MASFactory, a graph-centric framework for orchestrating LLM-based MAS. It introduces Vibe Graphing, a human-in-the-loop approach that compiles natural-language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human-in-the-loop interaction. We evaluate MASFactory on seven public benchmarks, validating both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing. Our code (https://github.com/BUPT-GAMMA/MASFactory) and video (https://youtu.be/ANynzVfY32k) are publicly available.
comment: Submitted to ACL 2026 Demo Track. 10 pages, 6 figures. Code and documentation are available at: https://github.com/BUPT-GAMMA/MASFactory
Impact of arbitrage between leveraged ETF and futures on market liquidity during market crash
Leveraged ETFs (L-ETFs) are exchange-traded funds that achieve price movements several times greater than an index by holding index-linked futures such as Nikkei Stock Average Index futures. It is known that when the price of an L-ETF falls, the L-ETF uses the liquidity of futures to limit the decline through arbitrage trading. Conversely, when the price of a futures contract falls, the futures contract uses the liquidity of the L-ETF to limit its decline. However, the impact of arbitrage trading on the liquidity of these markets has been little studied. Therefore, the present study used artificial market simulations to investigate how the liquidity (Volume, SellDepth, BuyDepth, Tightness) of both markets changes when prices plummet in either (i.e., the L-ETF or futures market), depending on the presence or absence of arbitrage trading. As a result, it was found that when erroneous orders occur in the L-ETF market, the existence of arbitrage trading causes liquidity to be supplied from the futures market to the L-ETF market in terms of SellDepth and Tightness. When erroneous orders occur in the futures market, the existence of arbitrage trading causes liquidity to be supplied from the L-ETF market to the futures market in terms of SellDepth and Tightness, and liquidity to be supplied from the futures market to the L-ETF market in terms of Volume. We also analyzed the internal market mechanisms that led to these results.
Evaluating LLM Alignment With Human Trust Models
Trust plays a pivotal role in enabling effective cooperation, reducing uncertainty, and guiding decision-making in both human interactions and multi-agent systems. Although it is significant, there is limited understanding of how large language models (LLMs) internally conceptualize and reason about trust. This work presents a white-box analysis of trust representation in EleutherAI/gpt-j-6B, using contrastive prompting to generate embedding vectors within the activation space of the LLM for diadic trust and related interpersonal relationship attributes. We first identified trust-related concepts from five established human trust models. We then determined a threshold for significant conceptual alignment by computing pairwise cosine similarities across 60 general emotional concepts. Then we measured the cosine similarities between the LLM's internal representation of trust and the derived trust-related concepts. Our results show that the internal trust representation of EleutherAI/gpt-j-6B aligns most closely with the Castelfranchi socio-cognitive model, followed by the Marsh Model. These findings indicate that LLMs encode socio-cognitive constructs in their activation space in ways that support meaningful comparative analyses, inform theories of social cognition, and support the design of human-AI collaborative systems.
comment: This paper will appear in the post-proceedings of ICAART 2026
The Coordination Gap: Alternation Metrics for Temporal Dynamics in Multi-Agent Battle of the Exes
Multi-agent coordination dilemmas expose a fundamental tension between individual optimization and collective welfare, yet characterizing such coordination requires metrics sensitive to temporal structure and collective dynamics. As a diagnostic testbed, we study a BoE-derived multi-agent variant of the Battle of the Exes, formalizing it as a Markov game in which turn-taking emerges as a periodic coordination regime. Conventional outcome-based metrics (e.g., efficiency and min/max fairness) are temporally blind -- they cannot distinguish structured alternation from monopolistic or random access patterns -- and fairness ratios lose discriminative power as n grows, obscuring inequities. To address this limitation, we introduce Perfect Alternation (PA) as a reference coordination regime and propose six novel Alternation (ALT) metrics designed as temporally sensitive observables of coordination quality. Using Q-learning agents as a minimal adaptive diagnostic baseline, and comparing against random-policy null processes, we uncover a clear measurement failure: despite exhibiting deceptively high traditional metrics (e.g., reward fairness often exceeding 0.9), learned policies perform up to 81% below random baselines under ALT-variant evaluation -- a deficit already present in the two-agent case and intensifying as n grows. These results demonstrate, in this setting, that high aggregate payoffs can coexist with poor temporal coordination, and that conventional metrics may severely mischaracterize emergent dynamics. Our findings underscore the necessity of temporally aware observables for analyzing coordination in multi-agent games and highlight random-policy baselines as essential null processes for interpreting coordination outcomes relative to chance-level behavior.
comment: 40 pages, 5 figures, 4 tables. Submitted to Mathematical Social Sciences
Evaluating Multi-Agent LLM Architectures for Rare Disease Diagnosis
While large language models are capable diagnostic tools, the impact of multi-agent topology on diagnostic accuracy remains underexplored. This study evaluates four agent topologies, Control (single agent), Hierarchical, Adversarial, and Collaborative, across 302 cases spanning 33 rare disease categories. We introduce a Reasoning Gap metric to quantify the difference between internal knowledge retrieval and final diagnostic accuracy. Results indicate that the Hierarchical topology (50.0% accuracy) marginally outperforms Collaborative (49.8%) and Control (48.5%) configurations. In contrast, the Adversarial model significantly degrades performance (27.3%), exhibiting a massive Reasoning Gap where valid diagnoses were rejected due to artificial doubt. Across all architectures, performance was strongest in Allergic diseases and Toxic Effects categories but poorest in Cardiac Malformation and Respiratory cases. Critically, while the single-agent baseline was generally robust, all multi-agent systems, including the Adversarial model, yielded superior accuracy in Bone and Thoracic disease categories. These findings demonstrate that increasing system complexity does not guarantee better reasoning, supporting a shift toward dynamic topology selection.
MARLIN: Multi-Agent Reinforcement Learning with Murmuration Intelligence and LLM Guidance for Reservoir Management AAMAS'26
As climate change intensifies extreme weather events, water disasters pose growing threats to global communities, making adaptive reservoir management critical for protecting vulnerable populations and ensuring water security. Modern water resource management faces unprecedented challenges from cascading uncertainties propagating through interconnected reservoir networks. These uncertainties, rooted in physical water transfer losses and environmental variability, make precise control difficult. For example, sending 10 tons downstream may yield only 8-12 tons due to evaporation and seepage. Traditional centralized optimization approaches suffer from exponential computational complexity and cannot effectively handle such real-world uncertainties, while existing multi-agent reinforcement learning (MARL) methods fail to achieve effective coordination under uncertainty. To address these challenges, we present MARLIN, a decentralized reservoir management framework inspired by starling murmurations intelligence. Integrating bio-inspired alignment, separation, and cohesion rules with MARL, MARLIN enables individual reservoirs to make local decisions while achieving emergent global coordination. In addition, a LLM provides real-time reward shaping signals, guiding agents to adapt to environmental changes and human-defined preferences. Experiments on USGS data show that MARLIN improves uncertainty handling by 23\%, cuts computation by 35\%, and accelerates flood response by 68\%, exhibiting super-linear coordination, with complexity scaling 5.4x from 400 to 10,000 nodes. These results demonstrate MARLIN's potential for disaster prevention and protecting communities through intelligent, scalable water resource management.
comment: AAMAS'26
A Multi-Agent System Enables Versatile Information Extraction from the Chemical Literature
To fully expedite AI-powered chemical research, high-quality chemical databases are the foundation. Automatic extraction of chemical information from the literature is essential for constructing reaction databases, but it is currently limited by the multimodality and style variability of chemical information. In this work, we developed a multimodal large language model (MLLM)-based multi-agent system for robust and automated chemical information extraction. It utilizes the MLLM's strong reasoning capability to understand the structure of diverse chemical graphics and decompose the extraction task into sub-tasks. It then coordinates a set of specialized agents, each combining the capabilities of the MLLM with the precise, domain-specific strengths of dedicated tools and web services, to solve the subtasks accurately and integrate the results into a unified output. Our system achieved an F1 score of 76.27% on a benchmark dataset of sophisticated multimodal chemical reaction graphics from the literature, surpassing the previous state-of-the-art model (F1 score of 39.13%) by a significant margin. Additionally, it demonstrated versatile applicability in a range of other information extraction tasks, including molecular image recognition, reaction image parsing, named entity recognition and text-based reaction extraction. This work is a critical step toward automated chemical information extraction into structured datasets, which will be a strong promoter of AI-driven chemical research.
Symmetry-Breaking in Multi-Agent Navigation: Winding Number-Aware MPC with a Learned Topological Strategy
In distributed multi-agent navigation without explicit communication, agents can fall into symmetry-induced deadlocks because each agent must autonomously decide how to pass others. To address this problem, we propose WNumMPC, a hierarchical navigation method that quantifies cooperative symmetry-breaking strategies via a topological invariant, the winding number, and learns such strategies through reinforcement learning. The learning-based Planner outputs continuous-valued signed target winding numbers and dynamic importance weights to prioritize critical interactions in dense crossings. Then, the model-based Controller generates collision-free and efficient motions based on the strategy and weights provided by the Planner. Simulation and real-world robot experiments indicate that WNumMPC effectively avoids deadlocks and collisions and achieves better performance than the baselines, particularly in dense and symmetry-prone scenarios. These experiments also suggest that explicitly leveraging winding numbers yields robust sim-to-real transfer with minimal performance degradation. The code for the experiments is available at https://github.com/omron-sinicx/WNumMPC.
comment: 12 pages, 7 figures
XR-DT: Extended Reality-Enhanced Digital Twin for Safe Motion Planning via Human-Aware Model Predictive Path Integral Control
As mobile robots increasingly operate alongside humans in shared workspaces, ensuring safe, efficient, and interpretable Human-Robot Interaction (HRI) has become a pressing challenge. While substantial progress has been devoted to human behavior prediction, limited attention has been paid to how humans perceive, interpret, and trust robots' inferences and how robots plan safe and efficient trajectories based on predicted human behaviors. To address these challenges, this paper presents XR-DT, an eXtended Reality-enhanced Digital Twin framework for mobile robots, which bridges physical and virtual spaces to enable bi-directional understanding between humans and robots. Our hierarchical XR-DT architecture integrates augmented-, virtual-, and mixed-reality layers, fusing real-time sensor data, simulated environments in the Unity game engine, and human feedback captured through wearable XR devices. Within this framework, we design a novel Human-Aware Model Predictive Path Integral (HA-MPPI) control model, an MPPI-based motion planner that incorporates ATLAS (Attention-based Trajectory Learning with Anticipatory Sensing), a multi-modal Transformer model designed for egocentric human trajectory prediction via XR headsets. Extensive real-world experimental results demonstrate accurate human trajectory prediction, and safe and efficient robot navigation, validating the HA-MPPI's effectiveness within the XR-DT framework. By embedding human behavior, environmental dynamics, and robot navigation into the XR-DT framework, our system enables interpretable, trustworthy, and adaptive HRI.
comment: 8 pages, 6 figures, 3 tables
FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol
Reasoning protocols such as Chain of Thought (CoT) and Tree of Thought (ToT) organize internal deliberation but lack an explicit mechanism for external questioning that elicits self-revision. We present FOR-Prompting (From Objection to Revision Prompting), an asymmetric protocol where a Defender proposes an answer, an Debater (Questioner) raises question-style objections with no direct fixes, and a Host optionally synthesizes the final output. Across GSM8K, FOR-Prompting matches the accuracy of CoT and consistently improves over single-prompting when evaluated under identical model backbones. On small-scale open-source models (e.g., LLaMA-3.2-1B), FOR-Prompting yields substantial gains over direct prompting and performs comparably to lightweight reasoning baselines, highlighting its promise for low-resource and on-device settings. Cross-model role-swapping further shows that performance is primarily determined by the Defender, enabling small models to act effectively as Questioners. Beyond structured math tasks, FOR-Prompting supports refinement in open-ended and multi-stage tasks: qualitative analysis shows improved exploration, coverage, and specificity, and a blind study of human preferences found that participants preferred FOR-Prompting outputs over strong LLM baselines in an itinerary-planning scenario. The protocol is model-agnostic and operates purely through role-structured prompting, requiring no training, access to model internals, or symmetrically strong agents. FOR-Prompting therefore enables scalable study of objection-driven reasoning and offers a practical mechanism for automated iterative refinement across both hosted and local LLMs.
Behavioral Inference at Scale: The Fundamental Asymmetry Between Motivations and Belief Systems
We establish empirical bounds on behavioral inference through controlled experiments at scale: LLM-based agents assigned one of 36 behavioral profiles (9 belief systems x 4 motivations) generate over 1.5 million behavioral sequences across 17,411 games in grid-world environments, providing ground truth unavailable in human behavioral studies. Rather than asking whether inference has limits, we ask how large those limits are, where they concentrate, and why. A fundamental asymmetry emerges in both magnitude and structure. Motivations achieve 98-100% inference accuracy and recover 97% of available mutual information across all architectures. Belief systems plateau at 24% for LSTMs regardless of capacity, recovering only 30% of available information, a 3.3x asymmetry in information extraction efficiency. Transformer architectures with 9-stage curriculum learning reach 49% alignment accuracy, doubling LSTM performance and demonstrating that the recurrent ceiling is architectural rather than fundamental. Yet even this improvement leaves belief systems correctly classified less than half the time, with per-alignment accuracy ranging from 1% (True Neutral) to 72% (Lawful Evil). Confusion analysis maps the failure structure precisely: a "neutral zone" of behavioral ambiguity extends beyond True Neutral to encompass Good alignments, where prosocial behavior is indistinguishable from rule-following or balance-keeping. Combined motivation and belief inference yields 17.6x improvement over random baseline for full 36-class profile classification, while establishing that the bottleneck is entirely located in belief system inference. Signal enhancement and explanatory queries yield only marginal LSTM gains (+3.8%), confirming that the ceiling is information-theoretic rather than data-limited. These bounds have direct implications for any system relying on behavioral monitoring to infer agent values.
Systems and Control (EESS)
Control Barrier Corridors: From Safety Functions to Safe Sets
Safe autonomy is a critical requirement and a key enabler for robots to operate safely in unstructured complex environments. Control barrier functions and safe motion corridors are two widely used but technically distinct safety methods, functional and geometric, respectively, for safe motion planning and control. Control barrier functions are applied to the safety filtering of control inputs to limit the decay rate of system safety, whereas safe motion corridors are geometrically constructed to define a local safe zone around the system state for use in motion optimization and reference-governor design. This paper introduces a new notion of control barrier corridors, which unifies these two approaches by converting control barrier functions into local safe goal regions for reference goal selection in feedback control systems. We show, with examples on fully actuated systems, kinematic unicycles, and linear output regulation systems, that individual state safety can be extended locally over control barrier corridors for convex barrier functions, provided the control convergence rate matches the barrier decay rate, highlighting a trade-off between safety and reactiveness. Such safe control barrier corridors enable safely reachable persistent goal selection over continuously changing barrier corridors during system motion, which we demonstrate for verifiably safe and persistent path following in autonomous exploration of unknown environments.
comment: 12 pages, 6 figures, an extended preprint version of a conference paper
CLAIRE: Compressed Latent Autoencoder for Industrial Representation and Evaluation -- A Deep Learning Framework for Smart Manufacturing
Accurate fault detection in high-dimensional industrial environments remains a major challenge due to the inherent complexity, noise, and redundancy in sensor data. This paper introduces CLAIRE, i.e., a hybrid end-to-end learning framework that integrates unsupervised deep representation learning with supervised classification for intelligent quality control in smart manufacturing systems. It employs an optimized deep autoencoder to transform raw input into a compact latent space, effectively capturing the intrinsic data structure while suppressing irrelevant or noisy features. The learned representations are then fed into a downstream classifier to perform binary fault prediction. Experimental results on a high-dimensional dataset demonstrate that CLAIRE significantly outperforms conventional classifiers trained directly on raw features. Moreover, the framework incorporates a post hoc phase, using a game-theory-based interpretability technique, to analyze the latent space and identify the most informative input features contributing to fault predictions. The proposed framework highlights the potential of integrating explainable AI with feature-aware regularization for robust fault detection. The modular and interpretable nature of the proposed framework makes it highly adaptable, offering promising applications in other domains characterized by complex, high-dimensional data, such as healthcare, finance, and environmental monitoring.
comment: 13 pages. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2026
Frequency-Separable Hamiltonian Neural Network for Multi-Timescale Dynamics
While Hamiltonian mechanics provides a powerful inductive bias for neural networks modeling dynamical systems, Hamiltonian Neural Networks and their variants often fail to capture complex temporal dynamics spanning multiple timescales. This limitation is commonly linked to the spectral bias of deep neural networks, which favors learning low-frequency, slow-varying dynamics. Prior approaches have sought to address this issue through symplectic integration schemes that enforce energy conservation or by incorporating geometric constraints to impose structure on the configuration-space. However, such methods either remain limited in their ability to fully capture multiscale dynamics or require substantial domain specific assumptions. In this work, we exploit the observation that Hamiltonian functions admit decompositions into explicit fast and slow modes and can be reconstructed from these components. We introduce the Frequency-Separable Hamiltonian Neural Network (FS-HNN), which parameterizes the system Hamiltonian using multiple networks, each governed by Hamiltonian dynamics and trained on data sampled at distinct timescales. We further extend this framework to partial differential equations by learning a state- and boundary-conditioned symplectic operators. Empirically, we show that FS-HNN improves long-horizon extrapolation performance on challenging dynamical systems and generalizes across a broad range of ODE and PDE problems.
AI End-to-End Radiation Treatment Planning Under One Second
Artificial intelligence-based radiation therapy (RT) planning has the potential to reduce planning time and inter-planner variability, improving efficiency and consistency in clinical workflows. Most existing automated approaches rely on multiple dose evaluations and corrections, resulting in plan generation times of several minutes. We introduce AIRT (Artificial Intelligence-based Radiotherapy), an end-to-end deep-learning framework that directly infers deliverable treatment plans from CT images and structure contours. AIRT generates single-arc VMAT prostate plans, from imaging and anatomical inputs to leaf sequencing, in under one second on a single Nvidia A100 GPU. The framework includes a differentiable dose feedback, an adversarial fluence map shaping, and a plan generation augmentation to improve plan quality and robustness. The model was trained on more than 10,000 intact prostate cases. Non-inferiority to RapidPlan Eclipse was demonstrated across target coverage and OAR sparing metrics. Target homogeneity (HI = 0.10 $\pm$ 0.01) and OAR sparing were similar to reference plans when evaluated using AcurosXB. These results represent a significant step toward ultra-fast standardized RT planning and a streamlined clinical workflow.
Star-based Navigation in the Outer Solar System
This paper investigates an autonomous navigation method for spacecraft operating in the outer solar system, up to 250 AU from the Sun, using the parallactic shifts of nearby stars. These measurements enable estimation of the spacecraft trajectory while distant stars provide attitude information through conventional star-pattern matching. Stellar observation models are developed, accounting for delta light-time, parallax, and aberration effects. Navigation performance is assessed using two approaches: (1) a least-squares estimator using simultaneous multi-star measurements, and (2) a Kalman filter processing sequential single-star observations along deep-space trajectories. Monte Carlo simulations on trajectories representative of Voyager 1, Voyager 2, Pioneer 10, Pioneer 11, and New Horizons missions show sub-AU position accuracies at 250 AU, and velocity accuracies better than 0.00004 AU/day, under realistic spacecraft and instrumentation uncertainties. These values correspond to relative errors below 0.4% in position and velocity with respect to the reference trajectories. Although less precise than radiometric tracking, this performance can support navigation in the outer solar system without reliance on Earth. When ground-based navigation remains necessary, this approach can be employed during long cruising phases, lowering the number of ground contacts. The method additionally shows potential for future missions venturing farther from the Sun.
comment: Accepted for publication in the Journal of Guidance, Control, and Dynamics. This is the author's accepted manuscript. The final version of record will be published by AIAA and will be available at the JGCD website
Conversational Demand Response: Bidirectional Aggregator-Prosumer Coordination through Agentic AI
Residential demand response depends on sustained prosumer participation, yet existing coordination is either fully automated, or limited to one-way dispatch signals and price alerts that offer little possibility for informed decision-making. This paper introduces Conversational Demand Response (CDR), a coordination mechanism where aggregators and prosumers interact through bidirectional natural language, enabled through agentic AI. A two-tier multi-agent architecture is developed in which an aggregator agent dispatches flexibility requests and a prosumer Home Energy Management System (HEMS) assesses deliverability and cost-benefit by calling an optimization-based tool. CDR also enables prosumer-initiated upstream communication, where changes in preferences can reach the aggregator directly. Proof-of-concept evaluation shows that interactions complete in under 12 seconds. The architecture illustrates how agentic AI can bridge the aggregator-prosumer coordination gap, providing the scalability of automated DR while preserving the transparency, explainability, and user agency necessary for sustained prosumer participation. All system components, including agent prompts, orchestration logic, and simulation interfaces, are released as open source to enable reproducibility and further development.
comment: 6 pages, 2 figures. Code available at: https://github.com/RedaElMakroum/cdr
A Dual-AoI-based Approach for Optimal Transmission Scheduling in Wireless Monitoring Systems with Random Data Arrivals
In Internet of Things (IoTs), the freshness of system status information is crucial for real-time monitoring and decision-making. This paper studies the transmission scheduling problem in wireless monitoring systems, where information freshness -- typically quantified by the Age of Information (AoI) -- is heavily constrained by limited channel resources and influenced by factors such as the randomness of data arrivals and unreliable wireless channel. Such randomness leads to asynchronous AoI evolution at local sensors and the monitoring center, rendering conventional scheduling policies that rely solely on the monitoring center's AoI inefficient. To this end, we propose a dual-AoI model that captures asynchronous AoI dynamics and formulate the problem as minimizing a long-term time-average AoI function. We develop a scheduling policy based on Markov decision process (MDP) to solve the problem, and analyze the existence and monotonicity of a deterministic stationary optimal policy. Moreover, we derive a low-complexity scheduling policy which exhibits a channel-state-dependent threshold structure. In addition, we establish a necessary and sufficient condition for the stability of the AoI objective. Simulation results demonstrate that the proposed policy outperforms existing approaches.
comment: 15 pages
Improved hopping control on slopes for small robots using spring mass modeling
Hopping robots often lose balance on slopes because the tilted ground creates unwanted rotation at landing. This work analyzes that effect using a simple spring mass model and identifies how slope induced impulses destabilize the robot. To address this, we introduce two straightforward fixes, adjusting the bodys touchdown angle based on the slope and applying a small corrective torque before takeoff. Together, these steps effectively cancel the unwanted rotation caused by inclined terrain, allowing the robot to land smoothly and maintain stable hopping even on steep slopes. Moreover, the proposed method remains simple enough to implement on low cost robotic platforms without requiring complex sensing or computation. By combining this analytical model with minimal control actions, this approach provides a practical path toward reliable hopping on uneven terrain. The results from simulation confirm that even small slope aware adjustments can dramatically improve landing stability, making the technique suitable for future autonomous field robots that must navigate natural environments such as hills, rubble, and irregular outdoor landscapes.
On Koopman Resolvents and Frequency Response of Nonlinear Systems
This paper proposes a novel formulation of frequency response for nonlinear systems in the Koopman operator framework. This framework is a promising direction for the analysis and synthesis of systems with nonlinear dynamics based on (linear) Koopman operators. We show that the frequency response of a nonlinear plant is derived through the Laplace transform of the output of the plant, which is a generalization of the classical approach to LTI plants and is guided by the resolvent theory of Koopman operators. The response is a complex-valued function of the driving angular frequency, allowing one to draw the so-called Bode plots, which display the gain and phase characteristics. Sufficient conditions for the existence of the frequency response are presented for three classes of dynamics.
comment: 7 pages, 1 figure
Uncertainty-Aware Adaptive Dynamics For Underwater Vehicle-Manipulator Robots
Accurate and adaptive dynamic models are critical for underwater vehicle-manipulator systems where hydrodynamic effects induce time-varying parameters. This paper introduces a novel uncertainty-aware adaptive dynamics model framework that remains linear in lumped vehicle and manipulator parameters, and embeds convex physical consistency constraints during online estimation. Moving horizon estimation is used to stack horizon regressors, enforce realizable inertia, damping, friction, and hydrostatics, and quantify uncertainty from parameter evolution. Experiments on a BlueROV2 Heavy with a 4-DOF manipulator demonstrate rapid convergence and calibrated predictions. Manipulator fits achieve R2 = 0.88 to 0.98 with slopes near unity, while vehicle surge, heave, and roll are reproduced with good fidelity under stronger coupling and noise. Median solver time is approximately 0.023 s per update, confirming online feasibility. A comparison against a fixed parameter model shows consistent reductions in MAE and RMSE across degrees of freedom. Results indicate physically plausible parameters and confidence intervals with near 100% coverage, enabling reliable feedforward control and simulation in underwater environments.
Codebook Design and Baseband Precoding for Pragmatic Array-Fed RIS Hybrid Multiuser MIMO
In our previous work [2], we introduced a hardware- and power-efficient architecture for hybrid digital-analog (HDA) multiuser MIMO (MU-MIMO) based on stacking identical basic modules. Each module consists of a small active multi-antenna feeder (AMAF) placed in the near field of a larger reflective intelligent surface (RIS). Each AMAF is driven by one RF chain and conveys one spatial stream, achieving a multiplexing gain of $K$ with $K$ stacked modules. While [2] focused on module design and efficiency compared to active arrays, performance was evaluated only under pure line-of-sight (LOS) conditions. This work extends our approach in several ways. First, we propose a simple, pragmatic method for designing phase-only flat-top beams for the AMAF-RIS module, enabling wide angular coverage with low ripple and sidelobes. This design supports hierarchical beamforming codebooks for efficient beam acquisition. Second, we evaluate MU-MIMO performance under realistic mmWave multipath channels including both LOS and non-LOS (NLOS) components modeled using a 3D von Mises-Fisher distribution. We propose a low-complexity HDA MU-MIMO framework with: user-beam association via standard beam acquisition; dynamic user grouping (one user per beam); effective baseband MIMO channel estimation using 3GPP-compliant pilots; and downlink transmission with zero-forcing precoding under per-antenna power constraints. Results show high spectral efficiency and multiplexing gain while preserving hardware simplicity and power efficiency. Crucially, the approach is fully compliant with 3GPP 5GNR beam acquisition and sounding reference signaling mechanisms.
Adaptive Data-Driven Min-Max MPC for Linear Time-Varying Systems
In this paper, we propose an adaptive data-driven min-max model predictive control (MPC) scheme for discrete-time linear time-varying (LTV) systems. We assume that prior knowledge of the system dynamics and bounds on the variations are known, and that the states are measured online. Starting from an initial state-feedback gain derived from prior knowledge, the algorithm updates the state-feedback gain using online input-state data. To this end, a semidefinite program (SDP) is solved to minimize an upper bound on the infinite-horizon optimal cost and to derive a corresponding state-feedback gain. We prove that the resulting closed-loop system is exponentially stabilized and satisfies the constraints. Further, we extend the proposed scheme to LTV systems with process noise. The resulting closed-loop system is shown to be robustly stabilized to a robust positive invariant (RPI) set. Finally, the proposed methods are demonstrated by numerical simulations.
Space-Control: Process-Level Isolation for Sharing CXL-based Disaggregated Memory
Memory disaggregation via Compute Express Link (CXL) enables multiple hosts to share remote memory, improving utilization for data-intensive workloads. Today, virtual memory enables process-level isolation on a host and CXL enables host-level isolation. This creates a critical security gap: the absence of process-level memory isolation in shared disaggregated memory. We present Space-Control, a hardware-software co-design that provides fine-grained, process-level isolation for shared disaggregated memory. Space-Control authenticates execution context in the hardware and enforces access control on every memory access and amortizes lookup times with a small cache. Our design allows up to 127 processes Simulation Toolkit (SST) based CXL model, Space-Control incurs minimal performance overhead of 3.3%, making shared disaggregated memory isolation practical.
Impact of Work Schedule Flexibility on EV Hosting Capacity: Insights from Analyzing Field Data
Uncoordinated electric vehicle (EV) charging is altering residential load patterns and pushing distribution transformers to operate beyond their limits. These outcomes can be offset by exploiting the flexibility in work schedules (hybrid, remote vs. in-person) of EV owners, particularly when combined with rooftop photovoltaic (PV) generation. However, this phenomenon has not been explored in-depth yet. This paper addresses this research gap by introducing weekly work schedule-aware robust and chance-constrained optimization formulations for EV charging coordination to determine a transformer's EV hosting capacity. The results obtained using data from a residential feeder in Arizona indicate that an intelligent combination of work schedule flexibility with PV generation can help power utilities effectively manage changing grid demands.
Adaptive Gain Nonlinear Observer for External Wrench Estimation in Human-UAV Physical Interaction
This paper presents an Adaptive Gain Nonlinear Observer (AGNO) for estimating the external interaction wrench (forces and torques) in human-UAV physical interaction for assistive payload transportation. The proposed AGNO uses the full nonlinear dynamic model to achieve an accurate and robust wrench estimation without relying on dedicated force-torque sensors. A key feature of this approach is the explicit consideration of the non-constant inertia matrix, which is essential for aerial systems with asymmetric mass distribution or shifting payloads. A comprehensive dynamic model of a cooperative transportation system composed of two quadrotors and a shared payload is derived, and the stability of the observer is rigorously established using Lyapunov-based analysis. Simulation results validate the effectiveness of the proposed observer in enabling intuitive and safe human-UAV interaction. Comparative evaluations demonstrate that the proposed AGNO outperforms an Extended Kalman Filter (EKF) in terms of estimation root mean square errors (RMSE), particularly for torque estimation under nonlinear interaction conditions. This approach reduces system weight and cost by eliminating additional sensing hardware, enhancing practical feasibility.
CN-CBF: Composite Neural Control Barrier Function for Safe Robot Navigation in Dynamic Environments
Safe navigation of autonomous robots remains one of the core challenges in the field, especially in dynamic and uncertain environments. One of the prevalent approaches is safety filtering based on control barrier functions (CBFs), which are easy to deploy but difficult to design. Motivated by the shortcomings of existing learning- and model-based methods, we propose a simple yet effective neural CBF design method for safe robot navigation in dynamic environments. We employ the idea of a composite CBF, where multiple neural CBFs are combined into a single CBF. The individual CBFs are trained via the Hamilton-Jacobi reachability framework to approximate the optimal safe set for single moving obstacles. Additionally, we use the residual neural architecture, which guarantees that the estimated safe set does not intersect with the corresponding failure set. The method is extensively evaluated in simulation experiments for a ground robot and a quadrotor, comparing it against several baseline methods. The results show improved success rates of up to 18\% compared to the best baseline, without increasing the conservativeness of the motion. Also, the method is demonstrated in hardware experiments for both types of robots.
Quantum Technologies and Edge Devices in Electrical Grids: Opportunities, Challenges, and Future Directions
In modern power systems, edge devices serve as local hubs that collect data, perform on-site computing, sense electrical parameters, execute control actions, and communicate with neighboring edge devices as part of the larger grid. However, as the number of monitored nodes and control loops grows, traditional edge devices face serious limits. They can become overloaded by complex signal processing and decision tasks, causing delays and higher energy use. Standard sensors hit a noise floor that prevents them from detecting miniature changes, making it harder to spot early signs of faults or instability. Meanwhile, conventional communication links struggle with bandwidth limits, security risks, and rising encryption demands, which together slow down and weaken the transfer of critical grid information. Quantum technologies have the potential to overcome these challenges. Quantum computers can deliver exponential speed-ups for optimization and machine-learning tasks that ordinary processors cannot handle. Quantum sensors can sense signals with atomic precision, giving edge devices a more precise view of grid dynamics. Quantum communication techniques, including quantum key distribution, offer methods to achieve information-theoretic security and ensure that information arrives quickly and without tampering. We explore how quantum technologies can be integrated into edge devices, highlighting both opportunities and challenges.
comment: 19 pages, 2 figures. Comments welcome
Performance Comparison of Gate-Based and Adiabatic Quantum Computing for AC Power Flow Problem
We present the first direct comparison between gate-based quantum computing (GQC) and adiabatic quantum computing (AQC) paradigms for solving the AC power flow (PF) equations. The PF problem is reformulated as a combinatorial optimization problem. For the GQC approach, the Quantum Approximate Optimization Algorithm (QAOA) is employed, while for the AQC approach, the problem is formulated as an Ising model. Numerical experiments on a 4-bus test system evaluate solution accuracy and computational performance. Results obtained using QAOA are benchmarked against those produced by D-Wave's Advantage system and Fujitsu's latest-generation Digital Annealer, implemented through the Quantum-Inspired Integrated Optimization (QIIO) software. The findings provide quantitative insights into the performance trade-offs, scalability, and practical viability of GQC and AQC paradigms for PF analysis, highlighting the potential of quantum optimization algorithms to address the computational challenges associated with the operation of modern electricity grids in the fault-tolerant era.
comment: 12 pages, 2 figures, 4 tables
Whole-Body Model-Predictive Control of Legged Robots with MuJoCo ICRA 2026
We demonstrate the surprising real-world effectiveness of a very simple approach to whole-body model-predictive control (MPC) of quadruped and humanoid robots: the iterative LQR (iLQR) algorithm with MuJoCo dynamics and finite-difference approximated derivatives. Building upon the previous success of model-based behavior synthesis and control of locomotion and manipulation tasks with MuJoCo in simulation, we show that these policies can easily generalize to the real world with few sim-to-real considerations. Our baseline method achieves real-time whole-body MPC on a variety of hardware experiments, including dynamic quadruped locomotion, quadruped walking on two legs, and full-sized humanoid bipedal locomotion. We hope this easy-to-reproduce hardware baseline lowers the barrier to entry for real-world whole-body MPC research and contributes to accelerating research velocity in the community. Our code and experiment videos will be available online at:https://johnzhang3.github.io/mujoco_ilqr
comment: to appear at ICRA 2026
ROSplane 2.0: A Fixed-Wing Autopilot for Research
Unmanned aerial vehicle (UAV) research requires the integration of cutting-edge technology into existing autopilot frameworks. This process can be arduous, requiring extensive resources, time, and detailed knowledge of the existing system. ROSplane is a lean, open-source fixed-wing autonomy stack built by researchers for researchers. It is designed to accelerate research by providing clearly defined interfaces with an easily modifiable framework. Built around ROS 2, ROSplane allows for rapid integration of low or high-level control, path planning, or estimation algorithms. A focus on lean, easily-understood code and extensive documentation lowers the barrier to entry for researchers. Recent developments to ROSplane improve its capacity to accelerate UAV research, including the transition from ROS 1 to ROS 2, enhanced estimation and control algorithms, increased modularity, and an improved aerodynamic modeling pipeline. This aerodynamic modeling pipeline significantly reduces the effort of transitioning from simulation to real-world testing without requiring costly system identification or computational fluid dynamics tools. ROSplane's architecture reduces the effort required to integrate new research tools and methods, expediting hardware experimentation.
comment: Submitted to the 2026 International Conference on Unmanned Aerial Systems
ROSflight 2.0: Lean ROS 2-Based Autopilot for Unmanned Aerial Vehicles
ROSflight is a lean, open-source autopilot ecosystem for unmanned aerial vehicles (UAVs). Designed by researchers for researchers, it is built to lower the barrier to entry to UAV research and accelerate the transition from simulation to hardware experiments by maintaining a lean (not full-featured), well-documented, and modular codebase. This publication builds on previous treatments and describes significant additions to the architecture that improve the modularity and usability of ROSflight, including the transition from ROS 1 to ROS 2, supported hardware, low-level actuator mixing, and the simulation environment. We believe that these changes improve the usability of ROSflight and enable ROSflight to accelerate research in areas like advanced-air mobility. Hardware results are provided, showing that ROSflight is able to control a multirotor over a serial connection at 400 Hz while closing all control loops on the companion computer.
comment: Submitted to the 2026 International Conference on Unmanned Aerial Systems
MARLIN: Multi-Agent Reinforcement Learning with Murmuration Intelligence and LLM Guidance for Reservoir Management AAMAS'26
As climate change intensifies extreme weather events, water disasters pose growing threats to global communities, making adaptive reservoir management critical for protecting vulnerable populations and ensuring water security. Modern water resource management faces unprecedented challenges from cascading uncertainties propagating through interconnected reservoir networks. These uncertainties, rooted in physical water transfer losses and environmental variability, make precise control difficult. For example, sending 10 tons downstream may yield only 8-12 tons due to evaporation and seepage. Traditional centralized optimization approaches suffer from exponential computational complexity and cannot effectively handle such real-world uncertainties, while existing multi-agent reinforcement learning (MARL) methods fail to achieve effective coordination under uncertainty. To address these challenges, we present MARLIN, a decentralized reservoir management framework inspired by starling murmurations intelligence. Integrating bio-inspired alignment, separation, and cohesion rules with MARL, MARLIN enables individual reservoirs to make local decisions while achieving emergent global coordination. In addition, a LLM provides real-time reward shaping signals, guiding agents to adapt to environmental changes and human-defined preferences. Experiments on USGS data show that MARLIN improves uncertainty handling by 23\%, cuts computation by 35\%, and accelerates flood response by 68\%, exhibiting super-linear coordination, with complexity scaling 5.4x from 400 to 10,000 nodes. These results demonstrate MARLIN's potential for disaster prevention and protecting communities through intelligent, scalable water resource management.
comment: AAMAS'26
Mixed Monotonicity Reachability Analysis of Neural ODE: A Trade-Off Between Tightness and Efficiency NeurIPS 2025
Neural ordinary differential equations (neural ODE) are powerful continuous-time machine learning models for depicting the behavior of complex dynamical systems, but their verification remains challenging due to limited reachability analysis tools adapted to them. We propose a novel interval-based reachability method that leverages continuous-time mixed monotonicity techniques for dynamical systems to compute an over-approximation for the neural ODE reachable sets. By exploiting the geometric structure of full initial sets and their boundaries via the homeomorphism property, our approach ensures efficient bound propagation. By embedding neural ODE dynamics into a mixed monotone system, our interval-based reachability approach, implemented in TIRA with single-step, incremental, and boundary-based approaches, provides sound and computationally efficient over-approximations compared with CORA's zonotopes and NNV2.0 star set representations, while trading tightness for efficiency. This trade-off makes our method particularly suited for high-dimensional, real-time, and safety-critical applications. Applying mixed monotonicity to neural ODE reachability analysis paves the way for lightweight formal analysis by leveraging the symmetric structure of monotone embeddings and the geometric simplicity of interval boxes, opening new avenues for scalable verification. This novel approach is illustrated on two numerical examples of a spiral system and a fixed-point attractor system modeled as a neural ODE.
comment: 27 pages, 11 figures, Accepted for publication in PMLR proceedings of NeurReps 2025 co-located with NeurIPS 2025
C*: A Coverage Path Planning Algorithm for Unknown Environments using Rapidly Covering Graphs
The paper presents a novel sample-based algorithm, called C*, for real-time coverage path planning (CPP) of unknown environments. C* is built upon the concept of a Rapidly Covering Graph (RCG), which is incrementally constructed during robot navigation via progressive sampling of the search space. By using efficient sampling and pruning techniques, the RCG is constructed to be a minimum-sufficient graph, where its nodes and edges form the potential waypoints and segments of the coverage trajectory, respectively. The RCG tracks the coverage progress, generates the coverage trajectory and helps the robot to escape from the dead-end situations. To minimize coverage time, C* produces the desired back-and-forth coverage pattern, while adapting to the TSP-based optimal coverage of local isolated regions, called coverage holes, which are surrounded by obstacles and covered regions. It is analytically proven that C* provides complete coverage of unknown environments. The algorithmic simplicity and low computational complexity of C* make it easy to implement and suitable for real-time on-board applications. The performance of C* is validated by 1) extensive high-fidelity simulations and 2) laboratory experiments using an autonomous robot. C* yields near optimal trajectories, and a comparative evaluation with seven existing CPP methods demonstrates significant improvements in performance in terms of coverage time, number of turns, trajectory length, and overlap ratio, while preventing the formation of coverage holes. Finally, C* is comparatively evaluated on two different CPP applications using 1) energy-constrained robots and 2) multi-robot teams.
Data-Driven Estimation of Quadrotor Motor Efficiency via Residual Minimization
A data-driven framework is proposed for online estimation of quadrotor motor efficiency via residual minimization. The problem is formulated as a constrained nonlinear optimization that minimizes trajectory residuals between measured flight data and predictions generated by a quadrotor dynamics model. A sliding-window strategy enables online estimation, and the optimization is efficiently solved using an iteratively reweighted least squares (IRLS) scheme combined with a primal-dual interior-point method, with inequality constraints enforced through a logarithmic barrier function. Robust z-score weighting is employed to reject outliers, which is particularly effective in motor clipping scenarios where the proposed estimator exhibits smaller spikes than an EKF baseline. Compared to traditional filter-based approaches, the batch-mode formulation allows selective inclusion of data segments via IRLS reweighting and hard-rejection. This structure is well-suited for online estimation and supports applications such as fault detection and isolation (FDI), health monitoring, and predictive maintenance in aerial robotic systems. Simulation results under various degradation scenarios demonstrate the accuracy and robustness of the proposed estimator.
comment: Accepted final version to appear in: American Control Conference, 2026
VISKY: Virtual Inertia Skyhook Control for Semi-Active Suspension Systems Using Magnetorheological Dampers IROS
This paper presents a Virtual Inertia Skyhook (VISKY) controller for magnetorheological (MR) dampers in semi-active suspensions. The proposed law is derived from a continuous sky-ground damping baseline augmented with acceleration feedback on the sprung and unsprung masses. In the closed-loop equations, these acceleration terms appear as a mass-like virtual inertia matrix rather than as a change in physical hardware. This interpretation motivates the VISKY name while making the underlying sky-ground hybrid structure explicit. Numerical evaluations under half-sine bump, representative ISO 8608 random-road and while-acceleration metrics relative to conventional Skygroundhook, with the largest gains appearing near the wheel-hop mode. The controller retains low computational overhead because it requires only algebraic force computation and bounded MR inversion.
comment: This work has been submitted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) for possible publication
Multi-UAV Flood Monitoring via CVT with Gaussian Mixture of Density Functions for Coverage Control
This study presents a control strategy for coordinating multiple unmanned aerial vehicles (UAVs) to monitor unknown flood regions and estimate the extent of inundation. The proposed method adopts a density-driven coverage framework based on Centroidal Voronoi Tessellation (CVT), in which the density function is modeled using a Gaussian Mixture of Density Functions (GMDF). This formulation provides a more accurate characterization of inundated areas compared to conventional axis-aligned Gaussian models. The performance of the two density modeling approaches is systematically evaluated under different UAV fleet sizes (16, 20, and 24), with multiple simulation trials conducted in the ROS/Gazebo environment. The results show that the GMDF-based formulation consistently achieves higher coverage rates, demonstrating its effectiveness in enhancing flood monitoring and improving UAV spatial distribution.
comment: The authors have identified an error in the simulation data used in the experiments, which affects the results reported in the manuscript. Therefore, the authors have decided to withdraw the paper
Imperfect Competition in Markets for Short-Circuit Current Services
An important limitation of Inverter-Based Resources (IBR) is their reduced contribution to Short-Circuit Current (SCC), as compared to that of Synchronous Generators (SGs). With increasing penetration of IBR in most power systems, the reducing SCC poses challenges to a secure system operation, as line protections may not trip when required. In order to address this issue, the SCC ancillary service could be procured via an economic mechanism, aiming at securing adequate SCC on all buses. However, the suitability of markets for SCC services is not well understood, given that these could be prone to market power issues: since the SCC contributions from various SGs to a certain bus are determined by the electrical topology of the grid, this is a highly local service. It is necessary to understand if SGs at advantageous electrical locations could exert market power and, if so, how it could be mitigated. In order to fill this gap, this paper, for the first time, adopts an SCC-constrained bilevel model to investigate strategic behaviors of SGs. To address the non-convexity due to unit commitment variables, the model is restructured through a primal-dual formulation. Based on a modified IEEE 30-bus system, cases with strategic SGs placed at different buses are analyzed. These studies demonstrate that strategic agents exerting market power by manipulating service prices and extending operating periods could achieve up to triple revenues from SCC provision, which reduces market efficiency and would increase the financial burden on consumers. These findings highlight the need for careful market design, for which potential measures to mitigate these market power issues are also discussed.
comment: Ancillary services, short-circuit current, market power, bilevel optimization, primal-dual formulation. A paper submitted to
StochasticBarrier.jl: A Toolbox for Stochastic Barrier Function Synthesis
We present StochasticBarrier.jl, an open-source Julia-based toolbox for generating Stochastic Barrier Functions (SBFs) for safety verification of discrete-time stochastic systems with additive Gaussian noise. StochasticBarrier.jl certifies linear, polynomial, and piecewise affine (PWA) systems. The latter enables verification for a wide range of system dynamics, including general nonlinear types. The toolbox implements a Sum-of-Squares (SOS) optimization approach, as well as methods based on piecewise constant (PWC) functions. For SOS-based SBFs, StochasticBarrier.jl leverages semi-definite programming solvers, while for PWC SBFs, it offers three engines: two using linear programming (LP) and one based on gradient descent (GD). Benchmarking StochasticBarrier.jl against the state-of-the-art shows that the tool outperforms existing tools in computation time, safety probability bounds, and scalability across over 30 case studies. Compared to its closest competitor, StochasticBarrier.jl is up to four orders of magnitude faster, achieves significant safety probability improvements, and supports higher-dimensional systems.
Automatic Link Selection in Multi-Channel Multiple Access with Link Failures
This paper focuses on the problem of automatic link selection in multi-channel multiple access control using bandit feedback. In particular, a controller assigns multiple users to multiple channels in a time-slotted system, where in each time slot, at most one user can be assigned to a given channel, and at most one channel can be assigned to a given user. Given that user $i$ is assigned to channel $j$, the transmission fails with a fixed unknown probability $1-q_{i,j}$. The assignments are made dynamically using success/failure feedback. The goal is to maximize the time-average utility, where we consider an arbitrary (possibly nonsmooth) concave, entrywise nondecreasing utility function. The first proposed algorithm has fast $\mathcal{O}(\sqrt{\log(T)/T})$ convergence. However, this algorithm requires solving a convex optimization problem within each iteration, which can be computationally expensive. The second algorithm has slower $\mathcal{O}(\sqrt[3]{\log(T)/T})$ convergence, while avoiding the costly inner optimization. Both of these algorithms are adaptive. In particular, the convergence guarantee holds for any interval of $T$ consecutive slots during which the success probabilities do not change. We further study several special cases. In the single-channel setting, we obtain both fast $\mathcal{O}(\sqrt{\log(T)/T})$ convergence and efficient implementation via a simpler adaptive mechanism. We also consider a UCB-based non-adaptive algorithm with max-weight-type decisions. Simulations highlight intriguing performance trade-offs and demonstrate rapid adaptation of the proposed adaptive schemes.
XR-DT: Extended Reality-Enhanced Digital Twin for Safe Motion Planning via Human-Aware Model Predictive Path Integral Control
As mobile robots increasingly operate alongside humans in shared workspaces, ensuring safe, efficient, and interpretable Human-Robot Interaction (HRI) has become a pressing challenge. While substantial progress has been devoted to human behavior prediction, limited attention has been paid to how humans perceive, interpret, and trust robots' inferences and how robots plan safe and efficient trajectories based on predicted human behaviors. To address these challenges, this paper presents XR-DT, an eXtended Reality-enhanced Digital Twin framework for mobile robots, which bridges physical and virtual spaces to enable bi-directional understanding between humans and robots. Our hierarchical XR-DT architecture integrates augmented-, virtual-, and mixed-reality layers, fusing real-time sensor data, simulated environments in the Unity game engine, and human feedback captured through wearable XR devices. Within this framework, we design a novel Human-Aware Model Predictive Path Integral (HA-MPPI) control model, an MPPI-based motion planner that incorporates ATLAS (Attention-based Trajectory Learning with Anticipatory Sensing), a multi-modal Transformer model designed for egocentric human trajectory prediction via XR headsets. Extensive real-world experimental results demonstrate accurate human trajectory prediction, and safe and efficient robot navigation, validating the HA-MPPI's effectiveness within the XR-DT framework. By embedding human behavior, environmental dynamics, and robot navigation into the XR-DT framework, our system enables interpretable, trustworthy, and adaptive HRI.
comment: 8 pages, 6 figures, 3 tables
Real-Time Learning of Predictive Dynamic Obstacle Models for Robotic Motion Planning ICRA
Autonomous systems often must predict the motions of nearby agents from partial and noisy data. This paper asks and answers the question: "can we learn, in real-time, a nonlinear predictive model of another agent's motions?" Our online framework denoises and forecasts such dynamics using a modified sliding-window Hankel Dynamic Mode Decomposition (Hankel-DMD). Partial noisy measurements are embedded into a Hankel matrix, while an associated Page matrix enables singular-value hard thresholding (SVHT) to estimate the effective rank. A Cadzow projection enforces structured low-rank consistency, yielding a denoised trajectory and local noise variance estimates. From this representation, a time-varying Hankel-DMD lifted linear predictor is constructed for multi-step forecasts. The residual analysis provides variance-tracking signals that can support downstream estimators and risk-aware planning. We validate the approach in simulation under Gaussian and heavy-tailed noise, and experimentally on a dynamic crane testbed. Results show that the method achieves stable variance-aware denoising and short-horizon prediction suitable for integration into real-time control frameworks.
comment: 10 pages, 6 figures, submitted to IEEE International Conference on Robotics and Automation (ICRA) 2025
Admittance Matrix Concentration Inequalities for Understanding Uncertain Power Networks
This paper presents conservative probabilistic bounds for the spectrum of the admittance matrix and classical linear power flow models under uncertain network parameters; for example, probabilistic line contingencies. Our proposed approach imports tools from probability theory, such as concentration inequalities for random matrices. This provides a theoretical framework for understanding error bounds of common approximations of the AC power flow equations under parameter uncertainty, including the DC and LinDistFlow approximations. Additionally, we show that the upper bounds scale as functions of nodal criticality. This network-theoretic quantity captures how uncertainty concentrates at critical nodes for use in contingency analysis. We validate these bounds on IEEE test networks, demonstrating that they correctly capture the scaling behavior of spectral perturbations up to conservative constants.
comment: 9 pages, 2 figures
PAD-TRO: Projection-Augmented Diffusion for Direct Trajectory Optimization
Recently, diffusion models have gained popularity and attention in trajectory optimization due to their capability of modeling multi-modal probability distributions. However, addressing nonlinear equality constraints, i.e, dynamic feasibility, remains a great challenge in diffusion-based trajectory optimization. Recent diffusion-based trajectory optimization frameworks rely on a single-shooting style approach where the denoised control sequence is applied to forward propagate the dynamical system, which cannot explicitly enforce constraints on the states and frequently leads to sub-optimal solutions. In this work, we propose a novel direct trajectory optimization approach via model-based diffusion, which directly generates a sequence of states. To ensure dynamic feasibility, we propose a gradient-free projection mechanism that is incorporated into the reverse diffusion process. Our results show that, compared to a recent state-of-the-art baseline, our approach leads to zero dynamic feasibility error and approximately 4x higher success rate in a quadrotor waypoint navigation scenario involving dense static obstacles.
comment: Final manuscript. Accepted for publication at the 2026 American Control Conference
Discovering and exploiting active sensing motifs for estimation
From organisms to machines, autonomous systems rely on measured sensory cues to estimate unknown information about themselves or their environment. For nonlinear systems, strategic sensor motion can be leveraged to extract otherwise inaccessible information. This principle, known as active sensing, is widespread in biology yet difficult to study, and remains underutilized in engineered systems due to the challenge of systematically designing active sensing motifs. Here, we introduce the method ``BOUNDS: Bounding Observability for Uncertain Nonlinear Dynamic Systems", and Python package pybounds, which can discover movement motifs that increase the information encoded in sensory cues. To exploit sporadic estimates from bouts of active sensing, we further introduce the Augmented Information Kalman Filter (AI-KF). The AI-KF uses insight from BOUNDS to dynamically fuse neural network and model-based estimation. We demonstrate BOUNDS and the AI-KF on a flying agent model and experimental GPS-denied data from a quadcopter, revealing how specific active movements improve estimates of ground speed, altitude, and wind direction. Altogether, our work will prove useful for designing sensor-minimal autonomous systems and investigating active sensing in living organisms.
comment: 24 pages, 11 figures
Robotics
RoboPocket: Improve Robot Policies Instantly with Your Phone
Scaling imitation learning is fundamentally constrained by the efficiency of data collection. While handheld interfaces have emerged as a scalable solution for in-the-wild data acquisition, they predominantly operate in an open-loop manner: operators blindly collect demonstrations without knowing the underlying policy's weaknesses, leading to inefficient coverage of critical state distributions. Conversely, interactive methods like DAgger effectively address covariate shift but rely on physical robot execution, which is costly and difficult to scale. To reconcile this trade-off, we introduce RoboPocket, a portable system that enables Robot-Free Instant Policy Iteration using single consumer smartphones. Its core innovation is a Remote Inference framework that visualizes the policy's predicted trajectory via Augmented Reality (AR) Visual Foresight. This immersive feedback allows collectors to proactively identify potential failures and focus data collection on the policy's weak regions without requiring a physical robot. Furthermore, we implement an asynchronous Online Finetuning pipeline that continuously updates the policy with incoming data, effectively closing the learning loop in minutes. Extensive experiments demonstrate that RoboPocket adheres to data scaling laws and doubles the data efficiency compared to offline scaling strategies, overcoming their long-standing efficiency bottleneck. Moreover, our instant iteration loop also boosts sample efficiency by up to 2$\times$ in distributed environments a small number of interactive corrections per person. Project page and videos: https://robo-pocket.github.io.
comment: Project page: https://robo-pocket.github.io
Safe-SAGE: Social-Semantic Adaptive Guidance for Safe Engagement through Laplace-Modulated Poisson Safety Functions
Traditional safety-critical control methods, such as control barrier functions, suffer from semantic blindness, exhibiting the same behavior around obstacles regardless of contextual significance. This limitation leads to the uniform treatment of all obstacles, despite their differing semantic meanings. We present Safe-SAGE (Social-Semantic Adaptive Guidance for Safe Engagement), a unified framework that bridges the gap between high-level semantic understanding and low-level safety-critical control through a Poisson safety function (PSF) modulated using a Laplace guidance field. Our approach perceives the environment by fusing multi-sensor point clouds with vision-based instance segmentation and persistent object tracking to maintain up-to-date semantics beyond the camera's field of view. A multi-layer safety filter is then used to modulate system inputs to achieve safe navigation using this semantic understanding of the environment. This safety filter consists of both a model predictive control layer and a control barrier function layer. Both layers utilize the PSF and flux modulation of the guidance field to introduce varying levels of conservatism and multi-agent passing norms for different obstacles in the environment. Our framework enables legged robots to navigate semantically rich, dynamic environments with context-dependent safety margins while maintaining rigorous safety guarantees.
cuRoboV2: Dynamics-Aware Motion Generation with Depth-Fused Distance Fields for High-DoF Robots
Effective robot autonomy requires motion generation that is safe, feasible, and reactive. Current methods are fragmented: fast planners output physically unexecutable trajectories, reactive controllers struggle with high-fidelity perception, and existing solvers fail on high-DoF systems. We present cuRoboV2, a unified framework with three key innovations: (1) B-spline trajectory optimization that enforces smoothness and torque limits; (2) a GPU-native TSDF/ESDF perception pipeline that generates dense signed distance fields covering the full workspace, unlike existing methods that only provide distances within sparsely allocated blocks, up to 10x faster and in 8x less memory than the state-of-the-art at manipulation scale, with up to 99% collision recall; and (3) scalable GPU-native whole-body computation, namely topology-aware kinematics, differentiable inverse dynamics, and map-reduce self-collision, that achieves up to 61x speedup while also extending to high-DoF humanoids (where previous GPU implementations fail). On benchmarks, cuRoboV2 achieves 99.7% success under 3kg payload (where baselines achieve only 72--77%), 99.6% collision-free IK on a 48-DoF humanoid (where prior methods fail entirely), and 89.5% retargeting constraint satisfaction (vs. 61% for PyRoki); these collision-free motions yield locomotion policies with 21% lower tracking error than PyRoki and 12x lower cross-seed variance than mink. A ground-up codebase redesign for discoverability enabled LLM coding assistants to author up to 73% of new modules, including hand-optimized CUDA kernels, demonstrating that well-structured robotics code can unlock productive human--LLM collaboration. Together, these advances provide a unified, dynamics-aware motion generation stack that scales from single-arm manipulators to full humanoids.
comment: cuRoboV2 Technical Report
Observing and Controlling Features in Vision-Language-Action Models
Vision-Language-Action Models (VLAs) have shown remarkable progress towards embodied intelligence. While their architecture partially resembles that of Large Language Models (LLMs), VLAs exhibit higher complexity due to their multi-modal inputs/outputs and often hybrid nature of transformer and diffusion heads. This is part of the reason why insights from mechanistic interpretability in LLMs, which explain how the internal model representations relate to their output behavior, do not trivially transfer to VLA counterparts. In this work, we propose to close this gap by introducing and analyzing two main concepts: feature-observability and feature-controllability. In particular, we first study features that are linearly encoded in representation space, and show how they can be observed by means of a linear classifier. Then, we use a minimal linear intervention grounded in optimal control to accurately place internal representations and steer the VLA's output towards a desired region. Our results show that targeted, lightweight interventions can reliably steer a robot's behavior while preserving closed-loop capabilities. We demonstrate on different VLA architectures ($π_{0.5}$ and OpenVLA) through simulation experiments that VLAs possess interpretable internal structure amenable to online adaptation without fine-tuning, enabling real-time alignment with user preferences and task requirements.
Residual RL--MPC for Robust Microrobotic Cell Pushing Under Time-Varying Flow
Contact-rich micromanipulation in microfluidic flow is challenging because small disturbances can break pushing contact and induce large lateral drift. We study planar cell pushing with a magnetic rolling microrobot that tracks a waypoint-sampled reference curve under time-varying Poiseuille flow. We propose a hybrid controller that augments a nominal MPC with a learned residual policy trained by SAC. The policy outputs a bounded 2D velocity correction that is contact-gated, so residual actions are applied only during robot--cell contact, preserving reliable approach behavior and stabilizing learning. All methods share the same actuation interface and speed envelope for fair comparisons. Experiments show improved robustness and tracking accuracy over pure MPC and PID under nonstationary flow, with generalization from a clover training curve to unseen circle and square trajectories. A residual-bound sweep identifies an intermediate correction limit as the best trade-off, which we use in all benchmarks.
comment: 8 pages, 8 figures
Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model CVPR 2026
World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning. Recent approaches leverage world models as learned simulators, but its application to decision-time planning remains computationally prohibitive for real-time control. A key bottleneck lies in latent representations: conventional tokenizers encode each observation into hundreds of tokens, making planning both slow and resource-intensive. To address this, we propose CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, drastically reducing computational cost while preserving essential information for planning. An action-conditioned world model that occupies CompACT tokenizer achieves competitive planning performance with orders-of-magnitude faster planning, offering a practical step toward real-world deployment of world models.
comment: CVPR 2026
PhysiFlow: Physics-Aware Humanoid Whole-Body VLA via Multi-Brain Latent Flow Matching and Robust Tracking
In the domain of humanoid robot control, the fusion of Vision-Language-Action (VLA) with whole-body control is essential for semantically guided execution of real-world tasks. However, existing methods encounter challenges in terms of low VLA inference efficiency or an absence of effective semantic guidance for whole-body control, resulting in instability in dynamic limb-coordinated tasks. To bridge this gap, we present a semantic-motion intent guided, physics-aware multi-brain VLA framework for humanoid whole-body control. A series of experiments was conducted to evaluate the performance of the proposed framework. The experimental results demonstrated that the framework enabled reliable vision-language-guided full-body coordination for humanoid robots.
ROScopter: A Multirotor Autopilot based on ROSflight 2.0
ROScopter is a lean multirotor autopilot built for researchers. ROScopter seeks to accelerate simulation and hardware testing of research code with an architecture that is both easy to understand and simple to modify. ROScopter is designed to interface with ROSflight 2.0 and runs entirely on an onboard flight computer, leveraging the features of ROS 2 to improve modularity. This work describes the architecture of ROScopter and how it can be used to test application code in both simulated and hardware environments. Hardware results of the default ROScopter behavior are presented, showing that ROScopter achieves similar performance to another state-of-the-art autopilot for basic waypoint-following maneuvers, but with a significantly reduced and more modular code-base.
Loop Closure via Maximal Cliques in 3D LiDAR-Based SLAM
Reliable loop closure detection remains a critical challenge in 3D LiDAR-based SLAM, especially under sensor noise, environmental ambiguity, and viewpoint variation conditions. RANSAC is often used in the context of loop closures for geometric model fitting in the presence of outliers. However, this approach may fail, leading to map inconsistency. We introduce a novel deterministic algorithm, CliReg, for loop closure validation that replaces RANSAC verification with a maximal clique search over a compatibility graph of feature correspondences. This formulation avoids random sampling and increases robustness in the presence of noise and outliers. We integrated our approach into a real- time pipeline employing binary 3D descriptors and a Hamming distance embedding binary search tree-based matching. We evaluated it on multiple real-world datasets featuring diverse LiDAR sensors. The results demonstrate that our proposed technique consistently achieves a lower pose error and more reliable loop closures than RANSAC, especially in sparse or ambiguous conditions. Additional experiments on 2D projection-based maps confirm its generality across spatial domains, making our approach a robust and efficient alternative for loop closure detection.
comment: Accepted in the 2025 European Conference on Mobile Robots (ECMR). This is the author's version of the work
Accelerating Sampling-Based Control via Learned Linear Koopman Dynamics
This paper presents an efficient model predictive path integral (MPPI) control framework for systems with complex nonlinear dynamics. To improve the computational efficiency of classic MPPI while preserving control performance, we replace the nonlinear dynamics used for trajectory propagation with a learned linear deep Koopman operator (DKO) model, enabling faster rollout and more efficient trajectory sampling. The DKO dynamics are learned directly from interaction data, eliminating the need for analytical system models. The resulting controller, termed MPPI-DK, is evaluated in simulation on pendulum balancing and surface vehicle navigation tasks, and validated on hardware through reference-tracking experiments on a quadruped robot. Experimental results demonstrate that MPPI-DK achieves control performance close to MPPI with true dynamics while substantially reducing computational cost, enabling efficient real-time control on robotic platforms.
OpenFrontier: General Navigation with Visual-Language Grounded Frontiers
Open-world navigation requires robots to make decisions in complex everyday environments while adapting to flexible task requirements. Conventional navigation approaches often rely on dense 3D reconstruction and hand-crafted goal metrics, which limits their generalization across tasks and environments. Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language, but typically require interactive training, large-scale data collection, or task-specific fine-tuning with a mobile agent. We formulate navigation as a sparse subgoal identification and reaching problem and observe that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation. Based on this insight, we select navigation frontiers as semantic anchors and propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models. OpenFrontier enables efficient navigation with a lightweight system design, without dense 3D mapping, policy training, or model fine-tuning. We evaluate OpenFrontier across multiple navigation benchmarks and demonstrate strong zero-shot performance, as well as effective real-world deployment on a mobile robot.
Omni-Manip: Beyond-FOV Large-Workspace Humanoid Manipulation with Omnidirectional 3D Perception
The deployment of humanoid robots for dexterous manipulation in unstructured environments remains challenging due to perceptual limitations that constrain the effective workspace. In scenarios where physical constraints prevent the robot from repositioning itself, maintaining omnidirectional awareness becomes far more critical than color or semantic information. While recent advances in visuomotor policy learning have improved manipulation capabilities, conventional RGB-D solutions suffer from narrow fields of view (FOV) and self-occlusion, requiring frequent base movements that introduce motion uncertainty and safety risks. Existing approaches to expanding perception, including active vision systems and third-view cameras, introduce mechanical complexity, calibration dependencies, and latency that hinder reliable real-time performance. In this work, We propose Omni-Manip, an end-to-end LiDAR-driven 3D visuomotor policy that enables robust manipulation in large workspaces. Our method processes panoramic point clouds through a Time-Aware Attention Pooling mechanism, efficiently encoding sparse 3D data while capturing temporal dependencies. This 360° perception allows the robot to interact with objects across wide areas without frequent repositioning. To support policy learning, we develop a whole-body teleoperation system for efficient data collection on full-body coordination. Extensive experiments in simulation and real-world environments show that Omni-Manip achieves robust performance in large-workspace and cluttered scenarios, outperforming baselines that rely on egocentric depth cameras.
comment: 8 pages, 6 figures
CT-Enabled Patient-Specific Simulation and Contact-Aware Robotic Planning for Cochlear Implantation
Robotic cochlear-implant (CI) insertion requires precise prediction and regulation of contact forces to minimize intracochlear trauma and prevent failure modes such as locking and buckling. Aligned with the integration of advanced medical imaging and robotics for autonomous, precision interventions, this paper presents a unified CT-to-simulation pipeline for contact-aware insertion planning and validation. We develop a low-dimensional, differentiable Cosserat-rod model of the electrode array coupled with frictional contact and pseudo-dynamics regularization to ensure continuous stick-slip transitions. Patient-specific cochlear anatomy is reconstructed from CT imaging and encoded via an analytic parametrization of the scala-tympani lumen, enabling efficient and differentiable contact queries through closest-point projection. Based on a differentiated equilibrium-constraint formulation, we derive an online direction-update law under an RCM-like constraint that suppresses lateral insertion forces while maintaining axial advancement. Simulations and benchtop experiments validate deformation and force trends, demonstrating reduced locking/buckling risk and improved insertion depth. The study highlights how CT-based imaging enhances modeling, planning, and safety capabilities in robot-assisted inner-ear procedures.
UltraDexGrasp: Learning Universal Dexterous Grasping for Bimanual Robots with Synthetic Data ICRA
Grasping is a fundamental capability for robots to interact with the physical world. Humans, equipped with two hands, autonomously select appropriate grasp strategies based on the shape, size, and weight of objects, enabling robust grasping and subsequent manipulation. In contrast, current robotic grasping remains limited, particularly in multi-strategy settings. Although substantial efforts have targeted parallel-gripper and single-hand grasping, dexterous grasping for bimanual robots remains underexplored, with data being a primary bottleneck. Achieving physically plausible and geometrically conforming grasps that can withstand external wrenches poses significant challenges. To address these issues, we introduce UltraDexGrasp, a framework for universal dexterous grasping with bimanual robots. The proposed data-generation pipeline integrates optimization-based grasp synthesis with planning-based demonstration generation, yielding high-quality and diverse trajectories across multiple grasp strategies. With this framework, we curate UltraDexGrasp-20M, a large-scale, multi-strategy grasp dataset comprising 20 million frames across 1,000 objects. Based on UltraDexGrasp-20M, we further develop a simple yet effective grasp policy that takes point clouds as input, aggregates scene features via unidirectional attention, and predicts control commands. Trained exclusively on synthetic data, the policy achieves robust zero-shot sim-to-real transfer and consistently succeeds on novel objects with varied shapes, sizes, and weights, attaining an average success rate of 81.2% in real-world universal dexterous grasping. To facilitate future research on grasping with bimanual robots, we open-source the data generation pipeline at https://github.com/InternRobotics/UltraDexGrasp.
comment: Published at International Conference on Robotics and Automation (ICRA) 2026
Constraint-Free Static Modeling of Continuum Parallel Robot
Continuum parallel robots (CPR) combine rigid actuation mechanisms with multiple elastic rods in a closed-loop topology, making forward statics challenging when rigid--continuum junctions are enforced by explicit kinematic constraints. Such constraint-based formulations typically introduce additional algebraic variables and complicate both numerical solution and downstream control. This paper presents a geometric exact, configuration-based and constraint-free static model of CPR that remains valid under geometrically nonlinear, large-deformation and large-rotation conditions. Connectivity constraints are eliminated by kinematic embedding, yielding a reduced unconstrained problem. Each rod of CPR is discretized by nodal poses on SE(3), while the element-wise strain field is reconstructed through a linear strain parameterization. A fourth-order Magnus approximation yields an explicit and geometrically consistent mapping between element end poses and the strain. Rigid attachments at the motor-driven base and the end-effector platforms are handled through kinematic embeddings. Based on total potential energy and virtual work, we derive assembly-ready residuals and explicit Newton tangents, and solve the resulting nonlinear equilibrium equations using a Riemannian Newton iteration on the product manifold. Experiments on a three-servomotor, six-rod prototype validate the model by showing good agreement between simulation and measurements for both unloaded motions and externally loaded cases.
Latent Policy Steering through One-Step Flow Policies
Offline reinforcement learning (RL) allows robots to learn from offline datasets without risky exploration. Yet, offline RL's performance often hinges on a brittle trade-off between (1) return maximization, which can push policies outside the dataset support, and (2) behavioral constraints, which typically require sensitive hyperparameter tuning. Latent steering offers a structural way to stay within the dataset support during RL, but existing offline adaptations commonly approximate action values using latent-space critics learned via indirect distillation, which can lose information and hinder convergence. We propose Latent Policy Steering (LPS), which enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor. By eliminating proxy latent critics, LPS allows an original-action-space critic to guide end-to-end latent-space optimization, while the one-step MeanFlow policy serves as a behavior-constrained generative prior. This decoupling yields a robust method that works out-of-the-box with minimal tuning. Across OGBench and real-world robotic tasks, LPS achieves state-of-the-art performance and consistently outperforms behavioral cloning and strong latent steering baselines.
comment: Project Webpage : https://jellyho.github.io/LPS/
Iterative On-Policy Refinement of Hierarchical Diffusion Policies for Language-Conditioned Manipulation
Hierarchical policies for language-conditioned manipulation decompose tasks into subgoals, where a high-level planner guides a low-level controller. However, these hierarchical agents often fail because the planner generates subgoals without considering the actual limitations of the controller. Existing solutions attempt to bridge this gap via intermediate modules or shared representations, but they remain limited by their reliance on fixed offline datasets. We propose HD-ExpIt, a framework for iterative fine-tuning of hierarchical diffusion policies via environment feedback. HD-ExpIt organizes training into a self-reinforcing cycle: it utilizes diffusion-based planning to autonomously discover successful behaviors, which are then distilled back into the hierarchical policy. This loop enables both components to improve while implicitly grounding the planner in the controller's actual capabilities without requiring explicit proxy models. Empirically, HD-ExpIt significantly improves hierarchical policies trained solely on offline data, achieving state-of-the-art performance on the long-horizon CALVIN benchmark among methods trained from scratch.
From Code to Road: A Vehicle-in-the-Loop and Digital Twin-Based Framework for Central Car Server Testing in Autonomous Driving
Simulation is one of the most essential parts in the development stage of automotive software. However, purely virtual simulations often struggle to accurately capture all real-world factors due to limitations in modeling. To address this challenge, this work presents a test framework for automotive software on the centralized E/E architecture, which is a central car server in our case, based on Vehicle-in-the-Loop (ViL) and digital twin technology. The framework couples a physical test vehicle on a dynamometer test bench with its synchronized virtual counterpart in a simulation environment. Our approach provides a safe, reproducible, realistic, and cost-effective platform for validating autonomous driving algorithms with a centralized architecture. This test method eliminates the need to test individual physical ECUs and their communication protocols separately. In contrast to traditional ViL methods, the proposed framework runs the full autonomous driving software directly on the vehicle hardware after the simulation process, eliminating flashing and intermediate layers while enabling seamless virtual-physical integration and accurately reflecting centralized E/E behavior. In addition, incorporating mixed testing in both simulated and physical environments reduces the need for full hardware integration during the early stages of automotive development. Experimental case studies demonstrate the effectiveness of the framework in different test scenarios. These findings highlight the potential to reduce development and integration efforts for testing autonomous driving pipelines in the future.
comment: 8 pages; Accepted for publication at the 37th IEEE Intelligent Vehicles Symposium (IV), Detroit, MI, United States, June 22-25, 2026
Curve-Induced Dynamical Systems on Riemannian Manifolds and Lie Groups
Deploying robots in household environments requires safe, adaptable, and interpretable behaviors that respect the geometric structure of tasks. Often represented on Lie groups and Riemannian manifolds, this includes poses on SE(3) or symmetric positive definite matrices encoding stiffness or damping matrices. In this context, dynamical system-based approaches offer a natural framework for generating such behavior, providing stability and convergence while remaining responsive to changes in the environment. We introduce Curve-induced Dynamical systems on Smooth Manifolds (CDSM), a real-time framework for constructing dynamical systems directly on Riemannian manifolds and Lie groups. The proposed approach constructs a nominal curve on the manifold, and generates a dynamical system which combines a tangential component that drives motion along the curve and a normal component that attracts the state toward the curve. We provide a stability analysis of the resulting dynamical system and validate the method quantitatively. On an S2 benchmark, CDSM demonstrates improved trajectory accuracy, reduced path deviation, and faster generation and query times compared to state-of-the-art methods. Finally, we demonstrate the practical applicability of the framework on both a robotic manipulator, where poses on SE(3) and damping matrices on SPD(n) are adapted online, and a mobile manipulator.
comment: Preprint, 14 pages, video linked in the paper, Saray Bakker and Martin Schonger contributed equally as first authors and are listed alphabetically
Rethinking the Role of Collaborative Robots in Rehabilitation
Current research on collaborative robots (cobots) in physical rehabilitation largely focuses on repeated motion training for people undergoing physical therapy (PuPT), even though these sessions include phases that could benefit from robotic collaboration and assistance. Meanwhile, access to physical therapy remains limited for people with disabilities and chronic illnesses. Cobots could support both PuPT and therapists, and improve access to therapy, yet their broader potential remains underexplored. We propose extending the scope of cobots by imagining their role in assisting therapists and PuPT before, during, and after a therapy session. We discuss how cobot assistance may lift access barriers by promoting ability-based therapy design and helping therapists manage their time and effort. Finally, we highlight challenges to realizing these roles, including advancing user-state understanding, ensuring safety, and integrating cobots into therapists' workflow. This view opens new research questions and opportunities to draw from the HRI community's advances in assistive robotics.
comment: 5 pages, 1 figure
Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems
The increasing demand for sustainable textile recycling requires robust automation solutions capable of handling deformable garments and detecting foreign objects in cluttered environments. This work presents a digital twin driven robotic sorting system that integrates grasp prediction, multi modal perception, and semantic reasoning for real world textile classification. A dual arm robotic cell equipped with RGBD sensing, capacitive tactile feedback, and collision-aware motion planning autonomously separates garments from an unsorted basket, transfers them to an inspection zone, and classifies them using state of the art Visual Language Models (VLMs). We benchmark nine VLM s from five model families on a dataset of 223 inspection scenarios comprising shirts, socks, trousers, underwear, foreign objects (including garments outside of the aforementioned classes), and empty scenes. The evaluation assesses per class accuracy, hallucination behavior, and computational performance under practical hardware constraints. Results show that the Qwen model family achieves the highest overall accuracy (up to 87.9 %), with strong foreign object detection performance, while lighter models such as Gemma3 offer competitive speed accuracy trade offs for edge deployment. A digital twin combined with MoveIt enables collision aware path planning and integrates segmented 3D point clouds of inspected garments into the virtual environment for improved manipulation reliability. The presented system demonstrates the feasibility of combining semantic VLM reasoning with conventional grasp detection and digital twin technology for scalable, autonomous textile sorting in realistic industrial settings.
comment: 10 pages,single column, 5 figures, preprint for Photomet Edumet 2026 (Klagenfurt, Austria)
Critic in the Loop: A Tri-System VLA Framework for Robust Long-Horizon Manipulation
Balancing high-level semantic reasoning with low-level reactive control remains a core challenge in visual robotic manipulation. While Vision-Language Models (VLMs) excel at cognitive planning, their inference latency precludes real-time execution. Conversely, fast Vision-Language-Action (VLA) models often lack the semantic depth required for complex, long-horizon tasks. To bridge this gap, we introduce Critic in the Loop, an adaptive hierarchical framework driven by dynamic VLM-Expert scheduling. At its core is a bionic Tri-System architecture comprising a VLM brain for global reasoning, a VLA cerebellum for reactive execution, and a lightweight visual Critic. By continuously monitoring the workspace, the Critic dynamically routes control authority. It sustains rapid closed-loop execution via the VLA for routine subtasks, and adaptively triggers the VLM for replanning upon detecting execution anomalies such as task stagnation or failures. Furthermore, our architecture seamlessly integrates human-inspired rules to intuitively break infinite retry loops. This visually-grounded scheduling minimizes expensive VLM queries, while substantially enhancing system robustness and autonomy in out-of-distribution (OOD) scenarios. Comprehensive experiments on challenging, long-horizon manipulation benchmarks reveal that our approach achieves state-of-the-art performance.
Lifelong Language-Conditioned Robotic Manipulation Learning
Traditional language-conditioned manipulation agent sequential adaptation to new manipulation skills leads to catastrophic forgetting of old skills, limiting dynamic scene practical deployment. In this paper, we propose SkillsCrafter, a novel robotic manipulation framework designed to continually learn multiple skills while reducing catastrophic forgetting of old skills. Specifically, we propose a Manipulation Skills Adaptation to retain the old skills knowledge while inheriting the shared knowledge between new and old skills to facilitate learning of new skills. Meanwhile, we perform the singular value decomposition on the diverse skill instructions to obtain common skill semantic subspace projection matrices, thereby recording the essential semantic space of skills. To achieve forget-less and generalization manipulation, we propose a Skills Specialization Aggregation to compute inter-skills similarity in skill semantic subspaces, achieving aggregation of the previously learned skill knowledge for any new or unknown skill. Extensive experiments demonstrate the effectiveness and superiority of our proposed SkillsCrafter.
comment: 14 pages, 7 figures
Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models
Current research on Vision-Language-Action (VLA) models predominantly focuses on enhancing generalization through established reasoning techniques. While effective, these improvements invariably increase computational complexity and inference latency. Furthermore, these mechanisms are typically applied indiscriminately, resulting in the inefficient allocation of resources for trivial tasks while simultaneously failing to provide the uncertainty estimation necessary to prevent catastrophic failure on out-of-distribution tasks. Inspired by human cognition, we propose an adaptive framework that dynamically routes VLA execution based on the complexity of the perceived state. Our approach transforms the VLA's vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators. This allows the system to execute known tasks immediately (Act), reason about ambiguous scenarios (Think), and preemptively halt execution when encountering significant physical or semantic anomalies (Abstain). In our empirical analysis, we observe a phenomenon where visual embeddings alone are superior for inferring task complexity due to the semantic invariance of language. Evaluated on the LIBERO and LIBERO-PRO benchmarks as well as on a real robot, our vision-only configuration achieves 80% F1-Score using as little as 5% of training data, establishing itself as a reliable and efficient task complexity detector.
SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation
Imitation Learning (IL) enables robots to acquire manipulation skills from expert demonstrations. Diffusion Policy (DP) models multi-modal expert behaviors but suffers performance degradation as observation horizons increase, limiting long-horizon manipulation. We propose Self-Evolving Gated Attention (SEGA), a temporal module that maintains a time-evolving latent state via gated attention, enabling efficient recurrent updates that compress long-horizon observations into a fixed-size representation while filtering irrelevant temporal information. Integrating SEGA into DP yields Self-Evolving Diffusion Policy (SeedPolicy), which resolves the temporal modeling bottleneck and enables scalable horizon extension with moderate overhead. On the RoboTwin 2.0 benchmark with 50 manipulation tasks, SeedPolicy outperforms DP and other IL baselines. Averaged across both CNN and Transformer backbones, SeedPolicy achieves 36.8% relative improvement in clean settings and 169% relative improvement in randomized challenging settings over the DP. Compared to vision-language-action models such as RDT with 1.2B parameters, SeedPolicy achieves competitive performance with one to two orders of magnitude fewer parameters, demonstrating strong efficiency and scalability. These results establish SeedPolicy as a state-of-the-art imitation learning method for long-horizon robotic manipulation. Code is available at: https://github.com/Youqiang-Gui/SeedPolicy.
comment: 16 pages, 13 figures
Decoupling Task and Behavior: A Two-Stage Reward Curriculum in Reinforcement Learning for Robotics
Deep Reinforcement Learning is a promising tool for robotic control, yet practical application is often hindered by the difficulty of designing effective reward functions. Real-world tasks typically require optimizing multiple objectives simultaneously, necessitating precise tuning of their weights to learn a policy with the desired characteristics. To address this, we propose a two-stage reward curriculum where we decouple task-specific objectives from behavioral terms. In our method, we first train the agent on a simplified task-only reward function to ensure effective exploration before introducing the full reward that includes auxiliary behavior-related terms such as energy efficiency. Further, we analyze various transition strategies and demonstrate that reusing samples between phases is critical for training stability. We validate our approach on the DeepMind Control Suite, ManiSkill3, and a mobile robot environment, modified to include auxiliary behavioral objectives. Our method proves to be simple yet effective, substantially outperforming baselines trained directly on the full reward while exhibiting higher robustness to specific reward weightings.
SPIRIT: Perceptive Shared Autonomy for Robust Robotic Manipulation under Deep Learning Uncertainty
Deep learning (DL) has enabled impressive advances in robotic perception, yet its limited robustness and lack of interpretability hinder reliable deployment in safety critical applications. We propose a concept termed perceptive shared autonomy, in which uncertainty estimates from DL based perception are used to regulate the level of autonomy. Specifically, when the robot's perception is confident, semi-autonomous manipulation is enabled to improve performance; when uncertainty increases, control transitions to haptic teleoperation for maintaining robustness. In this way, high-performing but uninterpretable DL methods can be integrated safely into robotic systems. A key technical enabler is an uncertainty aware DL based point cloud registration approach based on the so called Neural Tangent Kernels (NTK). We evaluate perceptive shared autonomy on challenging aerial manipulation tasks through a user study of 15 participants and realization of mock-up industrial scenarios, demonstrating reliable robotic manipulation despite failures in DL based perception. The resulting system, named SPIRIT, improves both manipulation performance and system reliability. SPIRIT was selected as a finalist of a major industrial innovation award.
comment: 19 pages, 14 figures
GaussTwin: Unified Simulation and Correction with Gaussian Splatting for Robotic Digital Twins ICRA 2026
Digital twins promise to enhance robotic manipulation by maintaining a consistent link between real-world perception and simulation. However, most existing systems struggle with the lack of a unified model, complex dynamic interactions, and the real-to-sim gap, which limits downstream applications such as model predictive control. Thus, we propose GaussTwin, a real-time digital twin that combines position-based dynamics with discrete Cosserat rod formulations for physically grounded simulation, and Gaussian splatting for efficient rendering and visual correction. By anchoring Gaussians to physical primitives and enforcing coherent SE(3) updates driven by photometric error and segmentation masks, GaussTwin achieves stable prediction-correction while preserving physical fidelity. Through experiments in both simulation and on a Franka Research 3 platform, we show that GaussTwin consistently improves tracking accuracy and robustness compared to shape-matching and rigid-only baselines, while also enabling downstream tasks such as push-based planning. These results highlight GaussTwin as a step toward unified, physically meaningful digital twins that can support closed-loop robotic interaction and learning.
comment: 8 pages, 4 figures, 3 tables, ICRA 2026
AIM-SLAM: Dense Monocular SLAM via Adaptive and Informative Multi-View Keyframe Prioritization with Foundation Model
Recent advances in geometric foundation models have emerged as a promising alternative for addressing the challenge of dense reconstruction in monocular visual simultaneous localization and mapping (SLAM). Although geometric foundation models enable SLAM to leverage variable input views, the previous methods remain confined to two-view pairs or fixed-length inputs without sufficient deliberation of geometric context for view selection. To tackle this problem, we propose AIM-SLAM, a dense monocular SLAM framework that exploits an adaptive and informative multi-view keyframe prioritization with dense pointmap predictions from visual geometry grounded transformer (VGGT). Specifically, we introduce the selective information- and geometric-aware multi-view adaptation (SIGMA) module, which employs voxel overlap and information gain to retrieve a candidate set of keyframes and adaptively determine its size. Furthermore, we formulate a joint multi-view Sim(3) optimization that enforces consistent alignment across selected views, substantially improving pose estimation accuracy. The effectiveness of AIM-SLAM is demonstrated on real-world datasets, where it achieves state-of-the-art performance in both pose estimation and dense reconstruction. Our system supports ROS integration, with code is available at https://aimslam.github.io/.
comment: 8 pages
VinePT-Map: Pole-Trunk Semantic Mapping for Resilient Autonomous Robotics in Vineyards
Reliable long-term deployment of autonomous robots in agricultural environments remains challenging due to perceptual aliasing, seasonal variability, and the dynamic nature of crop canopies. Vineyards, characterized by repetitive row structures and significant visual changes across phenological stages, represent a pivotal field challenge, limiting the robustness of conventional feature-based localization and mapping approaches. This paper introduces VinePT-Map, a semantic mapping framework that leverages vine trunks and support poles as persistent structural landmarks to enable season-agnostic and resilient robot localization. The proposed method formulates the mapping problem as a factor graph, integrating GPS, IMU, and RGB-D observations through robust geometrical constraints that exploit vineyard structure. An efficient perception pipeline based on instance segmentation and tracking, combined with a clustering filter for outlier rejection and pose refinement, enables accurate landmark detection using low-cost sensors and onboard computation. To validate the pipeline, we present a multi-season dataset for trunk and pole segmentation and tracking. Extensive field experiments conducted across diverse seasons demonstrate the robustness and accuracy of the proposed approach, highlighting its suitability for long-term autonomous operation in agricultural environments.
CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection CVPR 2026
Multi-camera 3D object detection (MC3D) has attracted increasing attention with the growing deployment of multi-sensor physical agents, such as robots and autonomous vehicles. However, MC3D models still struggle to generalize to unseen platforms with new multi-camera configurations. Current solutions simply employ a meta-camera for unified representation but lack comprehensive consideration. In this paper, we revisit this issue and identify that the devil lies in spatial prior discrepancies across source and target configurations, including different intrinsics, extrinsics, and array layouts. To address this, we propose CoIn3D, a generalizable MC3D framework that enables strong transferability from source configurations to unseen target ones. CoIn3D explicitly incorporates all identified spatial priors into both feature embedding and image observation through spatial-aware feature modulation (SFM) and camera-aware data augmentation (CDA), respectively. SFM enriches feature space by integrating four spatial representations, such as focal length, ground depth, ground gradient, and Plücker coordinate. CDA improves observation diversity under various configurations via a training-free dynamic novel-view image synthesis scheme. Extensive experiments demonstrate that CoIn3D achieves strong cross-configuration performance on landmark datasets such as NuScenes, Waymo, and Lyft, under three dominant MC3D paradigms represented by BEVDepth, BEVFormer, and PETR.
comment: Accepted to CVPR 2026 main track
Direct Contact-Tolerant Motion Planning With Vision Language Models
Navigation in cluttered environments often requires robots to tolerate contact with movable or deformable objects to maintain efficiency. Existing contact-tolerant motion planning (CTMP) methods rely on indirect spatial representations (e.g., prebuilt map, obstacle set), resulting in inaccuracies and a lack of adaptiveness to environmental uncertainties. To address this issue, we propose a direct contact-tolerant (DCT) planner, which integrates vision-language models (VLMs) into direct point perception and navigation, including two key components. The first one is VLM point cloud partitioner (VPP), which performs contact-tolerance reasoning in image space using VLM, caches inference masks, propagates them across frames using odometry, and projects them onto the current scan to generate a contact-aware point cloud. The second innovation is VPP guided navigation (VGN), which formulates CTMP as a perception-to-control optimization problem under direct contact-aware point cloud constraints, which is further solved by a specialized deep neural network (DNN). We implement DCT in Isaac Sim and a real car-like robot, demonstrating that DCT achieves robust and efficient navigation in cluttered environments with movable obstacles, outperforming representative baselines across diverse metrics. The code is available at: https://github.com/ChrisLeeUM/DCT.
Observer Design for Augmented Reality-based Teleoperation of Soft Robots
Although virtual and augmented reality are gaining traction as teleoperation tools for various types of robots, including manipulators and mobile robots, they are not being used for soft robots. The inherent difficulties of modelling soft robots mean that combining accurate and computationally efficient representations is very challenging. This paper presents an augmented reality interface for teleoperating these devices. The developed system consists of Microsoft HoloLens 2 glasses and a central computer responsible for calculations. Validation is performed on PETER, a highly modular pneumatic manipulator. Using data collected from sensors, the computer estimates the robot's position based on the physics of the virtual reality programme. Errors obtained are on the order of 5% of the robot's length, demonstrating that augmented reality facilitates operator interaction with soft manipulators and can be integrated into the control loop.
Person Detection and Tracking from an Overhead Crane LiDAR
This paper investigates person detection and tracking in an industrial indoor workspace using a LiDAR mounted on an overhead crane. The overhead viewpoint introduces a strong domain shift from common vehicle-centric LiDAR benchmarks, and limited availability of suitable public training data. Henceforth, we curate a site-specific overhead LiDAR dataset with 3D human bounding-box annotations and adapt selected candidate 3D detectors under a unified training and evaluation protocol. We further integrate lightweight tracking-by-detection using AB3DMOT and SimpleTrack to maintain person identities over time. Detection performance is reported with distance-sliced evaluation to quantify the practical operating envelope of the sensing setup. The best adapted detector configurations achieve average precision (AP) up to 0.84 within a 5.0 m horizontal radius, increasing to 0.97 at 1.0 m, with VoxelNeXt and SECOND emerging as the most reliable backbones across this range. The acquired results contribute in bridging the domain gap between standard driving datasets and overhead sensing for person detection and tracking. We also report latency measurements, highlighting practical real-time feasibility. Finally, we release our dataset and implementations in GitHub to support further research
comment: 8 pages, 7 figures, 4 tables. Submitted to Ubiquitous Robots (UR) 2026. Code: https://github.com/nilushacj/O-LiPeDeT-Overhead-LiDAR-Person-Detection-and-Tracking
Integrated cooperative localization of heterogeneous measurement swarm: A unified data-driven method
The cooperative localization (CL) problem in heterogeneous robotic systems with different measurement capabilities is investigated in this work. In practice, heterogeneous sensors lead to directed and sparse measurement topologies, whereas most existing CL approaches rely on multilateral localization with restrictive multi-neighbor geometric requirements. To overcome this limitation, we enable pairwise relative localization (RL) between neighboring robots using only mutual measurement and odometry information. A unified data-driven adaptive RL estimator is first developed to handle heterogeneous and unidirectional measurements. Based on the convergent RL estimates, a distributed pose-coupling CL strategy is then designed, which guarantees CL under a weakly connected directed measurement topology, representing the least restrictive condition among existing results. The proposed method is independent of specific control tasks and is validated through a formation control application and real-world experiments.
U-OBCA: Uncertainty-Aware Optimization-Based Collision Avoidance via Wasserstein Distributionally Robust Chance Constraints
Uncertainties arising from localization error, trajectory prediction errors of the moving obstacles and environmental disturbances pose significant challenges to robot's safe navigation. Existing uncertainty-aware planners often approximate polygon-shaped robots and obstacles using simple geometric primitives such as circles or ellipses. Though computationally convenient, these approximations substantially shrink the feasible space, leading to overly conservative trajectories and even planning failure in narrow environments. In addition, many such methods rely on specific assumptions about noise distributions, which may not hold in practice and thus limit their performance guarantees. To address these limitations, we extend the Optimization-Based Collision Avoidance (OBCA) framework to an uncertainty-aware formulation, termed \emph{U-OBCA}. The proposed method explicitly accounts for the collision risk between polygon-shaped robots and obstacles by formulating OBCA-based chance constraints, and hence avoiding geometric simplifications and reducing unnecessary conservatism. These probabilistic constraints are further tightened into deterministic nonlinear constraints under mild distributional assumptions, which can be solved efficiently by standard numerical optimization solvers. The proposed approach is validated through theoretical analysis, numerical simulations and real-world experiments. The results demonstrate that U-OBCA significantly mitigates the conservatism in trajectory planning and achieves higher navigation efficiency compared to existing baseline methods, particularly in narrow and cluttered environments.
Beyond the Patch: Exploring Vulnerabilities of Visuomotor Policies via Viewpoint-Consistent 3D Adversarial Object ICRA 2026
Neural network-based visuomotor policies enable robots to perform manipulation tasks but remain susceptible to perceptual attacks. For example, conventional 2D adversarial patches are effective under fixed-camera setups, where appearance is relatively consistent; however, their efficacy often diminishes under dynamic viewpoints from moving cameras, such as wrist-mounted setups, due to perspective distortions. To proactively investigate potential vulnerabilities beyond 2D patches, this work proposes a viewpoint-consistent adversarial texture optimization method for 3D objects through differentiable rendering. As optimization strategies, we employ Expectation over Transformation (EOT) with a Coarse-to-Fine (C2F) curriculum, exploiting distance-dependent frequency characteristics to induce textures effective across varying camera-object distances. We further integrate saliency-guided perturbations to redirect policy attention and design a targeted loss that persistently drives robots toward adversarial objects. Our comprehensive experiments show that the proposed method is effective under various environmental conditions, while confirming its black-box transferability and real-world applicability.
comment: 8 pages, 10 figures, Accepted to ICRA 2026. Project page: https://chan-mi-lee.github.io/3DAdvObj/
VPWEM: Non-Markovian Visuomotor Policy with Working and Episodic Memory
Imitation learning from human demonstrations has achieved significant success in robotic control, yet most visuomotor policies still condition on single-step observations or short-context histories, making them struggle with non-Markovian tasks that require long-term memory. Simply enlarging the context window incurs substantial computational and memory costs and encourages overfitting to spurious correlations, leading to catastrophic failures under distribution shift and violating real-time constraints in robotic systems. By contrast, humans can compress important past experiences into long-term memories and exploit them to solve tasks throughout their lifetime. In this paper, we propose VPWEM, a non-Markovian visuomotor policy equipped with working and episodic memories. VPWEM retains a sliding window of recent observation tokens as short-term working memory, and introduces a Transformer-based contextual memory compressor that recursively converts out-of-window observations into a fixed number of episodic memory tokens. The compressor uses self-attention over a cache of past summary tokens and cross-attention over a cache of historical observations, and is trained jointly with the policy. We instantiate VPWEM on diffusion policies to exploit both short-term and episode-wide information for action generation with nearly constant memory and computation per step. Experiments demonstrate that VPWEM outperforms state-of-the-art baselines including diffusion policies and vision-language-action (VLA) models by more than 20% on the memory-intensive manipulation tasks in MIKASA and achieves an average 5% improvement on the mobile manipulation benchmark MoMaRT. Code is available at https://github.com/HarryLui98/code_vpwem.
Causally Robust Reward Learning from Reason-Augmented Preference Feedback ICLR
Preference-based reward learning is widely used for shaping agent behavior to match a user's preference, yet its sparse binary feedback makes it especially vulnerable to causal confusion. The learned reward often latches onto spurious features that merely co-occur with preferred trajectories during training, collapsing when those correlations disappear or reverse at test time. We introduce ReCouPLe, a lightweight framework that uses natural language rationales to provide the missing causal signal. Each rationale is treated as a guiding projection axis in an embedding space, training the model to score trajectories based on features aligned with that axis while de-emphasizing context that is unrelated to the stated reason. Because the same rationales (e.g., "avoids collisions", "completes the task faster") can appear across multiple tasks, ReCouPLe naturally reuses the same causal direction whenever tasks share semantics, and transfers preference knowledge to novel tasks without extra data or language-model fine-tuning. Our learned reward model can ground preferences on the articulated reason, aligning better with user intent and generalizing beyond spurious features. ReCouPLe outperforms baselines by up to 1.5x in reward accuracy under distribution shifts, and 2x in downstream policy performance in novel tasks. We have released our code at https://github.com/mj-hwang/ReCouPLe
comment: Published in International Conference on Learning Representations (ICLR) 2026
Hyperbolic Multiview Pretraining for Robotic Manipulation CVPR 2026
3D-aware visual pretraining has proven effective in improving the performance of downstream robotic manipulation tasks. However, existing methods are constrained to Euclidean embedding spaces, whose flat geometry limits their ability to model structural relations among embeddings. As a result, they struggle to learn structured embeddings that are essential for robust spatial perception in robotic applications. To this end, we propose HyperMVP, a self-supervised framework for \underline{Hyper}bolic \underline{M}ulti\underline{V}iew \underline{P}retraining. Hyperbolic space offers geometric properties well suited for capturing structural relations. Methodologically, we extend the masked autoencoder paradigm and design a GeoLink encoder to learn multiview hyperbolic representations. The pretrained encoder is then finetuned with visuomotor policies on manipulation tasks. In addition, we introduce 3D-MOV, a large-scale dataset comprising multiple types of 3D point clouds to support pretraining. We evaluate HyperMVP on COLOSSEUM, RLBench, and real-world scenarios, where it consistently outperforms strong baselines across diverse tasks and perturbation settings. Our results highlight the potential of 3D-aware pretraining in a non-Euclidean space for learning robust and generalizable robotic manipulation policies.
comment: This paper was submitted to CVPR 2026 and was recommended for Findings, but the authors have withdrawn it and are currently adding more content to submit it elsewhere
Task-Relevant and Irrelevant Region-Aware Augmentation for Generalizable Vision-Based Imitation Learning in Agricultural Manipulation
Vision-based imitation learning has shown promise for robotic manipulation; however, its generalization remains limited in practical agricultural tasks. This limitation stems from scarce demonstration data and substantial visual domain gaps caused by i) crop-specific appearance diversity and ii) background variations. To address this limitation, we propose Dual-Region Augmentation for Imitation Learning (DRAIL), a region-aware augmentation framework designed for generalizable vision-based imitation learning in agricultural manipulation. DRAIL explicitly separates visual observations into task-relevant and task-irrelevant regions. The task-relevant region is augmented in a domain-knowledge-driven manner to preserve essential visual characteristics, while the task-irrelevant region is aggressively randomized to suppress spurious background correlations. By jointly handling both sources of visual variation, DRAIL promotes learning policies that rely on task-essential features rather than incidental visual cues. We evaluate DRAIL on diffusion policy-based visuomotor controllers through robot experiments on artificial vegetable harvesting and real lettuce defective leaf picking preparation tasks. The results show consistent improvements in success rates under unseen visual conditions compared to baseline methods. Further attention analysis and representation generalization metrics indicate that the learned policies rely more on task-essential visual features, resulting in enhanced robustness and generalization.
On the Strengths and Weaknesses of Data for Open-set Embodied Assistance
Embodied foundation models are increasingly performant in real-world domains such as robotics or autonomous driving. These models are often deployed in interactive or assistive settings, where it is important that these assistive models generalize to new users and new tasks. Diverse interactive data generation offers a promising avenue for providing data-efficient generalization capabilities for interactive embodied foundation models. In this paper, we investigate the generalization capabilities of a multimodal foundation model fine-tuned on diverse interactive assistance data in a synthetic domain. We explore generalization along two axes: a) assistance with unseen categories of user behavior and b) providing guidance in new configurations not encountered during training. We study a broad capability called \textbf{Open-Set Corrective Assistance}, in which the model needs to inspect lengthy user behavior and provide assistance through either corrective actions or language-based feedback. This task remains unsolved in prior work, which typically assumes closed corrective categories or relies on external planners, making it a challenging testbed for evaluating the limits of assistive data. To support this task, we generate synthetic assistive datasets in Overcooked and fine-tune a LLaMA-based model to evaluate generalization to novel tasks and user behaviors. Our approach provides key insights into the nature of assistive datasets required to enable open-set assistive intelligence. In particular, we show that performant models benefit from datasets that cover different aspects of assistance, including multimodal grounding, defect inference, and exposure to diverse scenarios.
Diffusion Policy through Conditional Proximal Policy Optimization
Reinforcement learning (RL) has been extensively employed in a wide range of decision-making problems, such as games and robotics. Recently, diffusion policies have shown strong potential in modeling multi-modal behaviors, enabling more diverse and flexible action generation compared to the conventional Gaussian policy. Despite various attempts to combine RL with diffusion, a key challenge is the difficulty of computing action log-likelihood under the diffusion model. This greatly hinders the direct application of diffusion policies in on-policy reinforcement learning. Most existing methods calculate or approximate the log-likelihood through the entire denoising process in the diffusion model, which can be memory- and computationally inefficient. To overcome this challenge, we propose a novel and efficient method to train a diffusion policy in an on-policy setting that requires only evaluating a simple Gaussian probability. This is achieved by aligning the policy iteration with the diffusion process, which is a distinct paradigm compared to previous work. Moreover, our formulation can naturally handle entropy regularization, which is often difficult to incorporate into diffusion policies. Experiments demonstrate that the proposed method produces multimodal policy behaviors and achieves superior performance on a variety of benchmark tasks in both IsaacLab and MuJoCo Playground.
Data-Driven Control of a Magnetically Actuated Fish-Like Robot
Magnetically actuated fish-like robots offer promising solutions for underwater exploration due to their miniaturization and agility; however, precise control remains a significant challenge because of nonlinear fluid dynamics, flexible fin hysteresis, and the variable-duration control steps inherent to the actuation mechanism. This paper proposes a comprehensive data-driven control framework to address these complexities without relying on analytical modeling. Our methodology comprises three core components: 1) developing a forward dynamics model (FDM) using a neural network trained on real-world experimental data to capture state transitions under varying time steps; 2) integrating this FDM into a gradient-based model predictive control (G-MPC) architecture to optimize control inputs for path following; and 3) applying imitation learning to approximate the G-MPC policy, thereby reducing the computational cost for real-time implementation. We validate the approach through simulations utilizing the identified dynamics model. The results demonstrate that the G-MPC framework achieves accurate path convergence with minimal root mean square error (RMSE), and the imitation learning controller (ILC) effectively replicates this performance. This study highlights the potential of data-driven control strategies for the precise navigation of miniature, fish-like soft robots.
comment: Author's version of the paper presented at AROB-ISBC 2026
LLM-Guided Decentralized Exploration with Self-Organizing Robot Teams
When individual robots have limited sensing capabilities or insufficient fault tolerance, it becomes necessary for multiple robots to form teams during exploration, thereby increasing the collective observation range and reliability. Traditionally, swarm formation has often been managed by a central controller; however, from the perspectives of robustness and flexibility, it is preferable for the swarm to operate autonomously even in the absence of centralized control. In addition, the determination of exploration targets for each team is crucial for efficient exploration in such multi-team exploration scenarios. This study therefore proposes an exploration method that combines (1) an algorithm for self-organization, enabling the autonomous and dynamic formation of multiple teams, and (2) an algorithm that allows each team to autonomously determine its next exploration target (destination). In particular, for (2), this study explores a novel strategy based on large language models (LLMs), while classical frontier-based methods and deep reinforcement learning approaches have been widely studied. The effectiveness of the proposed method was validated through simulations involving tens to hundreds of robots.
comment: Author's version of the paper presented at AROB-ISBC 2026
Adaptive Policy Switching of Two-Wheeled Differential Robots for Traversing over Diverse Terrains
Exploring lunar lava tubes requires robots to traverse without human intervention. Because pre-trained policies cannot fully cover all possible terrain conditions, our goal is to enable adaptive policy switching, where the robot selects an appropriate terrain-specialized model based on its current terrain features. This study investigates whether terrain types can be estimated effectively using posture-related observations collected during navigation. We fine-tuned a pre-trained policy using Proximal Policy Optimization (PPO), and then collected the robot's 3D orientation data as it moved across flat and rough terrain in a simulated lava-tube environment. Our analysis revealed that the standard deviation of the robot's pitch data shows a clear difference between these two terrain types. Using Gaussian mixture models (GMM), we evaluated terrain classification across various window sizes. An accuracy of more than 98% was achieved when using a 70-step window. The result suggests that short-term orientation data are sufficient for reliable terrain estimation, providing a foundation for adaptive policy switching.
comment: Author's version of the paper presented at AROB-ISBC 2026
Designing and Validating a Self-Aligning Tool Changer for Modular Reconfigurable Manipulation Robots
Modular reconfigurable robots require reliable mechanisms for automated module exchange, but conventional rigid active couplings often fail due to inevitable positioning and orientational errors. To address this, we propose a misalignment-tolerant tool-changing system. The hardware features a motor-driven coupling utilizing passive self-alignment geometries, specifically chamfered receptacles and triangular lead-in guides, to robustly compensate for angular and lateral misalignments without complex force sensors. To make this autonomous exchange practically feasible, the mechanism is complemented by a compact rotating tool exchange station for efficient module storage. Real-world autonomous tool-picking experiments validate that the self-aligning features successfully absorb execution errors, enabling highly reliable robotic tool reconfiguration.
comment: 6 pages, 13 figures
Gait Generation Balancing Joint Load and Mobility for Legged Modular Robots with Easily Detachable Joints
While modular robots offer versatility, excessive joint torque during locomotion poses a significant risk of mechanical failure, especially for detachable joints. To address this, we propose an optimization framework using the NSGA-III algorithm. Unlike conventional approaches that prioritize mobility alone, our method derives Pareto optimal solutions to minimize joint load while maintaining necessary locomotion speed and stability. Simulations and physical experiments demonstrate that our approach successfully generates gait motions for diverse environments, such as slopes and steps, ensuring structural integrity without compromising overall mobility.
comment: 6 pages, 7 figures
Design, Mapping, and Contact Anticipation with 3D-printed Whole-Body Tactile and Proximity Sensors ICRA
Robots operating in dynamic and shared environments benefit from anticipating contact before it occurs. We present GenTact-Prox, a fully 3D-printed artificial skin that integrates tactile and proximity sensing for contact detection and anticipation. The artificial skin platform is modular in design, procedurally generated to fit any robot morphology, and can cover the whole body of a robot. The skin achieved detection ranges of up to 18 cm during evaluation. To characterize how robots perceive nearby space through this skin, we introduce a data-driven framework for mapping the Perisensory Space -- the body-centric volume of space around the robot where sensors provide actionable information for contact anticipation. We demonstrate this approach on a Franka Research 3 robot equipped with five GenTact-Prox units, enabling online object-aware operation and contact prediction.
comment: This work was accepted at the International Conference on Robotics and Automation (ICRA) 2026
LEGS-POMDP: Language and Gesture-Guided Object Search in Partially Observable Environments
To assist humans in open-world environments, robots must interpret ambiguous instructions to locate desired objects. Foundation model-based approaches excel at multimodal grounding, but they lack a principled mechanism for modeling uncertainty in long-horizon tasks. In contrast, Partially Observable Markov Decision Processes (POMDPs) provide a systematic framework for planning under uncertainty but are often limited in supported modalities and rely on restrictive environment assumptions. We introduce LanguagE and Gesture-Guided Object Search in Partially Observable Environments (LEGS-POMDP), a modular POMDP system that integrates language, gesture, and visual observations for open-world object search. Unlike prior work, LEGS-POMDP explicitly models two sources of partial observability: uncertainty over the target object's identity and its spatial location. In simulation, multimodal fusion significantly outperforms unimodal baselines, achieving an average success rate of 89\% across challenging environments and object categories. Finally, we demonstrate the full system on a quadruped mobile manipulator, where real-world experiments qualitatively validate robust multimodal perception and uncertainty reduction under ambiguous instructions.
comment: 10 pages, 8 figures, accepted at ACM/IEEE International Conference on Human-Robot Interaction (HRI 2026)
Selecting Spots by Explicitly Predicting Intention from Motion History Improves Performance in Autonomous Parking
In many applications of social navigation, existing works have shown that predicting and reasoning about human intentions can help robotic agents make safer and more socially acceptable decisions. In this work, we study this problem for autonomous valet parking (AVP), where an autonomous vehicle ego agent must drop off its passengers, explore the parking lot, find a parking spot, negotiate for the spot with other vehicles, and park in the spot without human supervision. Specifically, we propose an AVP pipeline that selects parking spots by explicitly predicting where other agents are going to park from their motion history using learned models and probabilistic belief maps. To test this pipeline, we build a simulation environment with reactive agents and realistic modeling assumptions on the ego agent, such as occlusion-aware observations, and imperfect trajectory prediction. Simulation experiments show that our proposed method outperforms existing works that infer intentions from future predicted motion or embed them implicitly in end-to-end models, yielding better results in prediction accuracy, social acceptance, and task completion. Our key insight is that, in parking, where driving regulations are more lax, explicit intention prediction is crucial for reasoning about diverse and ambiguous long-term goals, which cannot be reliably inferred from short-term motion prediction alone, but can be effectively learned from motion history.
comment: 8 pages, 4 figures
EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation
Video generative models (VGMs) pretrained on large-scale internet data can produce temporally coherent rollout videos that capture rich object dynamics, offering a compelling foundation for zero-shot robotic manipulation. However, VGMs often produce physically implausible rollouts, and converting their pixel-space motion into robot actions through geometric retargeting further introduces cumulative errors from imperfect depth estimation and keypoint tracking. To address these challenges, we present \method{}, a data-free framework that aligns VGM outputs with compositional constraints generated by vision-language models (VLMs) at inference time. The key insight is that VLMs offer a capability complementary to VGMs: structured spatial reasoning that can identify the physical constraints critical to the success and safety of manipulation execution. Given a language instruction, \method{} uses a VLM to automatically extract a set of compositional constraints capturing task-specific requirements, which are then applied at two stages: (1) constraint-guided rollout selection, which scores and filters a batch of VGM rollouts to retain the most physically plausible candidate, and (2) constraint-based trajectory optimization, which uses the selected rollout as initialization and refines the robot trajectory under the same constraint set to correct retargeting errors. We evaluate \method{} on six real-robot manipulation tasks requiring precise, constraint-sensitive execution, improving the overall success rate by 43.3\% points over the strongest baseline without any task-specific training data.
Safe-Night VLA: Seeing the Unseen via Thermal-Perceptive Vision-Language-Action Models for Safety-Critical Manipulation
Current Vision-Language-Action (VLA) models rely primarily on RGB perception, preventing them from capturing modalities such as thermal signals that are imperceptible to conventional visual sensors. Moreover, end-to-end generative policies lack explicit safety constraints, making them fragile when encountering obstacles and novel scenarios outside the training distribution. To address these limitations, we propose Safe-Night VLA, a multimodal manipulation framework that enables robots to see the unseen while enforcing rigorous safety constraints for thermal-aware manipulation in unstructured environments. Specifically, Safe-Night VLA integrates long-wave infrared thermal perception into a pre-trained vision-language backbone, enabling semantic reasoning grounded in thermodynamic properties. To ensure safe execution under out-of-distribution conditions, we incorporate a safety filter via control barrier functions, which provide deterministic workspace constraint enforcement during policy execution. We validate our framework through real-world experiments on a Franka manipulator, introducing a novel evaluation paradigm featuring temperature-conditioned manipulation, subsurface target localization, and reflection disambiguation, while maintaining constrained execution at inference time. Results demonstrate that Safe-Night VLA outperforms RGB-only baselines and provide empirical evidence that foundation models can effectively leverage non-visible physical modalities for robust manipulation.
Vision-Language System using Open-Source LLMs for Gestures in Medical Interpreter Robots
Effective communication is vital in healthcare, especially across language barriers, where non-verbal cues and gestures are critical. This paper presents a privacy-preserving vision-language framework for medical interpreter robots that detects specific speech acts (consent and instruction) and generates corresponding robotic gestures. Built on locally deployed open-source models, the system utilizes a Large Language Model (LLM) with few-shot prompting for intent detection. We also introduce a novel dataset of clinical conversations annotated for speech acts and paired with gesture clips. Our identification module achieved 0.90 accuracy, 0.93 weighted precision, and a 0.91 weighted F1-Score. Our approach significantly improves computational efficiency and, in user studies, outperforms the speech-gesture generation baseline in human-likeness while maintaining comparable appropriateness.
Environment-Aware Path Generation for Robotic Additive Manufacturing of Structures
Robotic Additive Manufacturing (AM) has emerged as a scalable and customizable construction method in the last decade. However, current AM design methods rely on pre-conceived (A priori) toolpath of the structure, often developed via offline slicing software. Moreover, considering the dynamic construction environments involving obstacles on terrestrial and extraterrestrial environments, there is a need for online path generation methods. Here, an environment-aware path generation framework (PGF) is proposed for the first time in which structures are designed in an online fashion by utilizing four path planning (PP) algorithms (two search-based and two sampling-based). To evaluate the performance of the proposed PGF in different obstacle arrangements (periodic, random) for two types of structures (closed and open), structural (path roughness, turns, offset, Root Mean Square Error (RMSE), deviation) and computational (run time) performance metrics are developed. Most challenging environments (i.e., dense with high number of obstacles) are considered to saturate the feasibility limits of PP algorithms. The capability of each of the four path planners used in the PGF in finding a feasible path is assessed. Finally, the effectiveness of the proposed structural performance metrics is evaluated individually and comparatively, and most essential metrics necessary for evaluation of toolpath of the resulting structures are prescribed. Consequently, the most promising path planners in challenging environments are identified for robotic additive manufacturing applications.
Introducing the transitional autonomous vehicle lane-changing dataset: Empirical Experiments
Transitional autonomous vehicles (tAVs), which operate beyond SAE Level 1-2 automation but short of full autonomy, are increasingly sharing the road with human-driven vehicles (HDVs). As these systems interact during complex maneuvers such as lane changes, new patterns may emerge with implications for traffic stability and safety. Assessing these dynamics, particularly during mandatory lane changes, requires high-resolution trajectory data, yet datasets capturing tAV lane-changing behavior are scarce. This study introduces the North Carolina Transitional Autonomous Vehicle Lane-Changing (NC-tALC) Dataset, a high-fidelity trajectory dataset designed to characterize tAV interactions during lane-changing maneuvers. The dataset includes two controlled experimental series. In the first, tAV lane-changing experiments, a tAV executes lane changes in the presence of adaptive cruise control (ACC) equipped target vehicles, enabling analysis of lane-changing execution. In the second, tAV responding experiments, two tAVs act as followers and respond to cut-in maneuvers initiated by another tAV, enabling analysis of follower response dynamics. The dataset contains 152 trials (72 lane-changing and 80 responding trials) sampled at 20 Hz with centimeter-level RTK-GPS accuracy. The NC-tALC dataset provides a rigorous empirical foundation for evaluating tAV decision-making and interaction dynamics in controlled mandatory lane-changing scenarios.
Contact-Grounded Policy: Dexterous Visuotactile Policy with Generative Contact Grounding
Contact-Grounded Policy (CGP) enables fine-grained, contact-rich dexterous manipulation by grounding multi-point contacts through predicting the actual robot state and tactile feedback, and by using a learned contact-consistency mapping to convert these predictions into controller-executable targets for a compliance controller. CGP supports both dense tactile arrays and vision-based tactile sensors mounted on the hand. We collect demonstrations via teleoperation in both simulation and on a physical robot, and evaluate CGP across multiple dexterous manipulation tasks.
TransMASK: Masked State Representation through Learned Transformation
Humans train robots to complete tasks in one environment, and expect robots to perform those same tasks in new environments. As humans, we know which aspects of the environment (i.e., the state) are relevant to the task. But there are also things that do not matter; e.g., the color of the table or the presence of clutter in the background. Ideally, the robot's policy learns to ignore these irrelevant state components. Achieving this invariance improves generalization: the robot knows not to factor irrelevant variables into its control decisions, making the policy more robust to environment changes. In this paper we therefore propose a self-supervised method to learn a mask which, when multiplied by the observed state, transforms that state into a latent representation that is biased towards relevant elements. Our method -- which we call TransMASK -- can be combined with a variety of imitation learning frameworks (such as diffusion policies) without any additional labels or alterations to the loss function. To achieve this, we recognize that the learned policy updates to better match the human's true policy. This true policy only depends on the relevant parts of the state; hence, as the gradients pass back through the learned policy and our proposed mask, they increase the value for elements that cause the robot to better imitate the human. We can therefore train TransMASK at the same time as we learn the policy. By normalizing the magnitude of each row in TransMASK, we force the mask to align with the Jacobian of the expert policy: columns that correspond to relevant states have large magnitudes, while columns for irrelevant states approach zero magnitude. We compare our approach to other methods that extract relevant states for downstream imitation learning. See our project website: https://collab.me.vt.edu/TransMASK/
Relational Semantic Reasoning on 3D Scene Graphs for Open World Interactive Object Search
Open-world interactive object search in household environments requires understanding semantic relationships between objects and their surrounding context to guide exploration efficiently. Prior methods either rely on vision-language embeddings similarity, which does not reliably capture task-relevant relational semantics, or large language models (LLMs), which are too slow and costly for real-time deployment. We introduce SCOUT: Scene Graph-Based Exploration with Learned Utility for Open-World Interactive Object Search, a novel method that searches directly over 3D scene graphs by assigning utility scores to rooms, frontiers, and objects using relational exploration heuristics such as room-object containment and object-object co-occurrence. To make this practical without sacrificing open-vocabulary generalization, we propose an offline procedural distillation framework that extracts structured relational knowledge from LLMs into lightweight models for on-robot inference. Furthermore, we present SymSearch, a scalable symbolic benchmark for evaluating semantic reasoning in interactive object search tasks. Extensive evaluations across symbolic and simulation environments show that SCOUT outperforms embedding similarity-based methods and matches LLM-level performance while remaining computationally efficient. Finally, real-world experiments demonstrate effective transfer to physical environments, enabling open-world interactive object search under realistic sensing and navigation constraints.
RFM-HRI : A Multimodal Dataset of Medical Robot Failure, User Reaction and Recovery Preferences for Item Retrieval Tasks
While robots deployed in real-world environments inevitably experience interaction failures, understanding how users respond through verbal and non-verbal behaviors remains under-explored in human-robot interaction (HRI). This gap is particularly significant in healthcare-inspired settings, where interaction failures can directly affect task performance and user trust. We present the Robot Failures in Medical HRI (RFM-HRI) Dataset, a multimodal dataset capturing dyadic interactions between humans and robots embodied in crash carts, where communication failures are systematically induced during item retrieval tasks. Through Wizard-of-Oz studies with 41 participants across laboratory and hospital settings, we recorded responses to four failure types (speech, timing, comprehension, and search) derived from three years of crash-cart robot interaction data. The dataset contains 214 interaction samples including facial action units, head pose, speech transcripts, and post-interaction self-reports. Our analysis shows that failures significantly degrade affective valence and reduce perceived control compared to successful interactions. Failures are strongly associated with confusion, annoyance, and frustration, while successful interactions are characterized by surprise, relief, and confidence in task completion. Emotional responses also evolve across repeated failures, with confusion decreasing and frustration increasing over time. This work contributes (1) a publicly available multimodal dataset (RFM-HRI), (2) analysis of user responses to different failure types and preferred recovery strategies, and (3) a crash-cart retrieval scenario enabling systematic comparison of recovery strategies with implications for safety-critical failure recovery. Our findings provide a foundation for failure detection and recovery methods in embodied HRI.
Control Lyapunov Functions for Underactuated Soft Robots
Soft and soft-rigid hybrid robots are inherently underactuated and operate under tight actuator limits, making task-space control with stability guarantees challenging. Common nonlinear strategies for soft robots (e.g., those based on PD control) often rely on the assumption of full actuation with no actuator limits. This paper presents a general control framework for task-space regulation and tracking of underactuated soft robots under bounded inputs. The method enforces a rapidly exponentially stabilizing control Lyapunov function as a convex inequality constraint while simultaneously satisfying underactuated full-body dynamics and actuator bounds. We validate the approach in simulation on several platforms spanning increasing underactuation: a simple two link tendon-driven "finger", a trimmed helicoid manipulator, and a highly underactuated spiral robot. We compare against a number of baseline methods from the literature. Results show improved task-space accuracy and consistent Lyapunov convergence under input limits, achieving superior set-point and trajectory-tracking performance.
comment: 8 pages, 5 figures, 2 tables. Submitted for publication to a conference
RACAS: Controlling Diverse Robots With a Single Agentic System
Many robotic platforms expose an API through which external software can command their actuators and read their sensors. However, transitioning from these low-level interfaces to high-level autonomous behaviour requires a complicated pipeline, whose components demand distinct areas of expertise. Existing approaches to bridging this gap either require retraining for every new embodiment or have only been validated across structurally similar platforms. We introduce RACAS (Robot-Agnostic Control via Agentic Systems), a cooperative agentic architecture in which three LLM/VLM-based modules (Monitors, a Controller, and a Memory Curator) communicate exclusively through natural language to provide closed-loop robot control. RACAS requires only a natural language description of the robot, a definition of available actions, and a task specification; no source code, model weights, or reward functions need to be modified to move between platforms. We evaluate RACAS on several tasks using a wheeled ground robot, a recently published novel multi-jointed robotic limb, and an underwater vehicle. RACAS consistently solved all assigned tasks across these radically different platforms, demonstrating the potential of agentic AI to substantially reduce the barrier to prototyping robotic solutions.
comment: 7 pages in main text + 1 page of appendices + 1 page of references, 5 figures in main text + 1 figure in appendices, 2 tables in main text
From Decoupled to Coupled: Robustness Verification for Learning-based Keypoint Detection with Joint Specifications
Keypoint detection underpins many vision tasks, including pose estimation, viewpoint recovery, and 3D reconstruction, yet modern neural models remain vulnerable to small input perturbations. Despite its importance, formal robustness verification for keypoint detectors is largely unexplored due to high-dimensional inputs and continuous coordinate outputs. We propose the first coupled robustness verification framework for heatmap-based keypoint detectors that bounds the joint deviation across all keypoints, capturing their interdependencies and downstream task requirements. Unlike prior decoupled, classification-style approaches that verify each keypoint independently and yield conservative guarantees, our method verifies collective behavior. We formulate verification as a falsification problem using a mixed-integer linear program (MILP) that combines reachable heatmap sets with a polytope encoding joint deviation constraints. Infeasibility certifies robustness, while feasibility provides counterexamples, and we prove the method is sound: if it certifies the model as robust, then the keypoint detection model is guaranteed to be robust. Experiments show that our coupled approach achieves high verified rates and remains effective under strict error thresholds where decoupled methods fail.
comment: 21 pages, 4 figures, 9 tables. arXiv admin note: text overlap with arXiv:2408.00117
Task Parameter Extrapolation via Learning Inverse Tasks from Forward Demonstrations
Generalizing skill policies to novel conditions remains a key challenge in robot learning. Imitation learning methods, while data-efficient, are largely confined to the training region and consistently fail on input data outside it, leading to unpredictable policy failures. Alternatively, transfer learning approaches offer methods for trajectory generation robust to both changes in environment or tasks, but they remain data-hungry and lack accuracy in zero-shot generalization. We address these challenges by framing the problem in the context of task inversion learning and proposing a novel joint learning approach to achieve accurate and efficient knowledge transfer. Our method constructs a common representation of the forward and inverse tasks, and leverages auxiliary forward demonstrations from novel configurations to successfully execute the corresponding inverse tasks, without any direct supervision. We show the extrapolation capabilities of our framework via ablation studies and experiments in simulated and real-world environments that require complex manipulation skills with a diverse set of objects and tools, where we outperform diffusion-based alternatives.
PRISM: Personalized Refinement of Imitation Skills for Manipulation via Human Instructions
This paper presents PRISM: an instruction-conditioned refinement method for imitation policies in robotic manipulation. This approach bridges Imitation Learning (IL) and Reinforcement Learning (RL) frameworks into a seamless pipeline, such that an imitation policy on a broad generic task, generated from a set of user-guided demonstrations, can be refined through reinforcement to generate new unseen fine-grain behaviours. The refinement process follows the Eureka paradigm, where reward functions for RL are iteratively generated from an initial natural-language task description. Presented approach, builds on top of this mechanism to adapt a refined IL policy of a generic task to new goal configurations and the introduction of constraints by adding also human feedback correction on intermediate rollouts, enabling policy reusability and therefore data efficiency. Results for a pick-and-place task in a simulated scenario show that proposed method outperforms policies without human feedback, improving robustness on deployment and reducing computational burden.
comment: 10 pages, 3 figures, Accepted for publication at European Robotics Forum 2026
CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions
Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety. Yet safety violations can lead to catastrophic outcomes in real-world deployments. Control Barrier Functions (CBFs) offer a principled method to enforce dynamic safety -- traditionally deployed online via safety filters. While the result is safe behavior, the fact that the RL policy does not have knowledge of the CBF can lead to conservative behaviors. This paper proposes CBF-RL, a framework for generating safe behaviors with RL by enforcing CBFs in training. CBF-RL has two key attributes: (1) minimally modifying a nominal RL policy to encode safety constraints via a CBF term, (2) and safety filtering of the policy rollouts in training. Theoretically, we prove that continuous-time safety filters can be deployed via closed-form expressions on discrete-time roll-outs. Practically, we demonstrate that CBF-RL internalizes the safety constraints in the learned policy -- both enforcing safer actions and biasing towards safer rewards -- enabling safe deployment without the need for an online safety filter. We validate our framework through ablation studies on navigation tasks and on the Unitree G1 humanoid robot, where CBF-RL enables safer exploration, faster convergence, and robust performance under uncertainty, enabling the humanoid robot to avoid obstacles and climb stairs safely in real-world settings without a runtime safety filter.
comment: 8 pages
SpikeATac: A Multimodal Tactile Finger with Taxelized Dynamic Sensing for Dexterous Manipulation ICRA 2026
In this work, we introduce SpikeATac, a multimodal tactile finger combining a taxelized and highly sensitive dynamic response (PVDF) with a static transduction method (capacitive) for multimodal touch sensing. Named for its `spiky' response, SpikeATac's 16-taxel PVDF film sampled at 4 kHz provides fast, sensitive dynamic signals to the very onset and breaking of contact. We characterize the sensitivity of the different modalities, and show that SpikeATac provides the ability to stop quickly and delicately when grasping fragile, deformable objects. Beyond parallel grasping, we show that SpikeATac can be used in a learning-based framework to achieve new capabilities on a dexterous multifingered robot hand. We use a learning recipe that combines reinforcement learning from human feedback with tactile-based rewards to fine-tune the behavior of a policy to modulate force. Our hardware platform and learning pipeline together enable a difficult dexterous and contact-rich task that has not previously been achieved: in-hand manipulation of fragile objects. Videos are available at https://roamlab.github.io/spikeatac/ .
comment: 8 pages, 8 figures, ICRA 2026
Quadrotor Navigation using Reinforcement Learning with Privileged Information
This paper presents a reinforcement learning-based quadrotor navigation method that leverages efficient differentiable simulation, novel loss functions, and privileged information to navigate around large obstacles. Prior learning-based methods perform well in scenes that exhibit narrow obstacles, but struggle when the goal location is blocked by large walls or terrain. In contrast, the proposed method utilizes time-of-arrival (ToA) maps as privileged information and a yaw alignment loss to guide the robot around large obstacles. The policy is evaluated in photo-realistic simulation environments containing large obstacles, sharp corners, and dead-ends. Our approach achieves an 86% success rate and outperforms baseline strategies by 34%. We deploy the policy onboard a custom quadrotor in outdoor cluttered environments both during the day and night. The policy is validated across 20 flights, covering 589 meters without collisions at speeds up to 4 m/s.
Ask, Reason, Assist: Robot Collaboration via Natural Language and Temporal Logic
Increased robot deployment, such as in warehousing, has revealed a need for collaboration among heterogeneous robot teams to resolve unforeseen conflicts. To this end, we propose a peer-to-peer coordination protocol that enables robots to request and provide help without a central task allocator. The process begins when a robot detects a conflict and uses a Large Language Model (LLM) to decide whether external assistance is required. If so, it crafts and broadcasts a natural language (NL) help request. Potential helper robots reason over the request and respond with offers of assistance, including information about the effect on their ongoing tasks. Helper reasoning is implemented via an LLM grounded in Signal Temporal Logic (STL) using a Backus-Naur Form (BNF) grammar, ensuring syntactically valid NL-to-STL translations, which are then solved as a Mixed Integer Linear Program (MILP). Finally, the requester robot selects a helper by reasoning over the expected increase in system-level total task completion time. We evaluated our framework through experiments comparing different helper-selection strategies and found that considering multiple offers allows the requester to minimize added makespan. Our approach significantly outperforms heuristics such as selecting the nearest available candidate helper robot, and achieves performance comparable to a centralized "Oracle" baseline but without heavy information demands.
comment: arXiv admin note: substantial text overlap with arXiv:2505.13376
ROVER: Regulator-Driven Robust Temporal Verification of Black-Box Robot Policies
We present a novel, regulator-driven approach for the temporal verification of black-box autonomous robot policies, inspired by real-world certification processes where regulators often evaluate observable behavior without access to model internals. Central to our method is a regulator-in-the-loop approach that evaluates execution traces from black-box policies against temporal safety requirements. These requirements, expressed as prioritized Signal Temporal Logic (STL) specifications, characterize behavior changes over time and encode domain knowledge into the verification process. We use Total Robustness Value (TRV) and Largest Robustness Value (LRV) to quantify average performance and worst-case adherence, and introduce Average Violation Robustness Value (AVRV) to measure average specification violation. Together, these metrics guide targeted retraining and iterative model improvement. Our approach accommodates diverse temporal safety requirements (e.g., lane-keeping, delayed acceleration, and turn smoothness), capturing persistence, sequencing, and response across two distinct domains (virtual racing game and mobile robot navigation). Across six STL specifications in both scenarios, regulator-guided retraining increased satisfaction rates by an average of 43.8%, with consistent improvement in average performance (TRV) and reduced violation severity (LRV) in half of the specifications. Finally, real-world validation on a TurtleBot3 robot demonstrates a 27% improvement in smooth-navigation satisfaction, yielding smoother paths and stronger compliance with STL-defined temporal safety requirements.
LHM-Humanoid: Learning a Unified Policy for Long-Horizon Humanoid Whole-Body Loco-Manipulation in Diverse Messy Environments
We introduce LHM-Humanoid, a benchmark and learning framework for long-horizon whole-body humanoid loco-manipulation in diverse, cluttered scenes. In our setting, multiple objects are displaced from their intended locations and may obstruct navigation; a humanoid agent must repeatedly (i) walk to a target, (ii) pick it up with diverse whole-body postures under balance constraints, (iii) carry it while navigating around obstacles, and (iv) place it at a designated goal -- all within a single continuous episode and without any environment reset. This task simultaneously demands cross-scene generalization and unified one-policy control: layouts, obstacle arrangements, object category/mass/shape/color and object start/goal poses vary substantially even within a room category, requiring a single general policy that directly outputs actions rather than invoking pre-trained skill libraries. Our dataset spans four room types (bedroom, living room, kitchen, and warehouse), comprising 350 diverse scenes/tasks with 79 objects (25 movable targets). Since no scene-specific ground-truth motion sequences are provided, we learn goal-conditioned teacher policies via reinforcement learning and distill them into a single end-to-end student policy using DAgger. We further distill this unified policy into a vision-language-action (VLA) model driven by egocentric RGB observations and natural language. Experiments in Isaac Gym demonstrate that LHM-Humanoid substantially outperforms end-to-end RL baselines and prior humanoid loco-manipulation methods on both seen and unseen scenes, exhibiting strong long-horizon robustness and cross-scene generalization.
Conflict-Based Search as a Protocol: A Multi-Agent Motion Planning Protocol for Heterogeneous Agents, Solvers, and Independent Tasks ICRA 2026
Imagine the future construction site, hospital, or office with dozens of robots bought from different manufacturers. How can we enable these different robots to effectively move in a shared environment, given that each robot may have its own independent motion planning system? This work shows how we can get efficient collision-free movements between algorithmically heterogeneous agents by using Conflict-Based Search (Sharon et al. 2015) as a protocol. At its core, the CBS Protocol requires one specific single-agent motion planning API; finding a collision-free path that satisfies certain space-time constraints. Given such an API, CBS uses a central planner to find collision-free paths - independent of how the API is implemented. We demonstrate how this protocol enables multi-agent motion planning for a heterogeneous team of agents completing independent tasks with a variety of single-agent planners including: Heuristic Search (e.g., A*), Sampling Based Search (e.g., RRT), Optimization (e.g., Direct Collocation), Diffusion, and Reinforcement Learning.
comment: Published at ICRA 2026, Project webpage: https://rishi-v.github.io/CBS-Protocol/
Kinodynamic Task and Motion Planning using VLM-guided and Interleaved Sampling
Task and Motion Planning (TAMP) integrates high-level task planning with low-level motion feasibility, but existing methods are costly in long-horizon problems due to excessive motion sampling. While LLMs provide commonsense priors, they lack 3D spatial reasoning and cannot ensure geometric or dynamic feasibility. We propose a kinodynamic TAMP planner based on a hybrid state tree that uniformly represents symbolic and numeric states during planning, enabling task and motion decisions to be jointly decided. Kinodynamic constraints embedded in the TAMP problem are verified by an off-the-shelf motion planner and physics simulator, and a VLM guides exploring a TAMP solution and backtracks the search based on visual rendering of the states. Experiments on the simulated domains and in the real world show 32.14% - 1166.67% increased average success rates compared to traditional and LLM-based TAMP planners and reduced planning time on complex problems, with ablations further highlighting the benefits of VLM backtracking. More details are available at https://graphics.ewha.ac.kr/kinodynamicTAMP/.
Runge-Kutta Approximations for Direct Coning Compensation Applying Lie Theory
The integration of gyroscope measurements is an essential task for most navigation systems. Modern vehicles typically use strapdown systems, such that gyro integration requires coning compensation to account for the sensor's rotation during the integration. Many coning compensation algorithms have been developed and a few are reviewed. This work introduces a new class of coning correction algorithm built directly from the classical Runge-Kutta integration routines. A simple case is shown to collapse to one of the most popular coning algorithms and a clear procedure for generating higher-order algorithms is presented.
comment: Accepted manuscript. AIAA JGCD
Diffusion-Based Impedance Learning for Contact-Rich Manipulation Tasks
Learning-based methods excel at robot motion generation but remain limited in contact-rich physical interaction. Impedance control provides stable and safe contact behavior but requires task-specific tuning of stiffness and damping parameters. We present Diffusion-Based Impedance Learning, a framework that bridges these paradigms by combining generative modeling with energy-consistent impedance control. A Transformer-based Diffusion Model, conditioned via cross-attention on measured external wrenches, reconstructs simulated Zero-Force Trajectories (sZFTs) that represent contact-consistent equilibrium behavior. A SLERP-based quaternion noise scheduler preserves geometric consistency for rotations on the unit sphere. The reconstructed sZFT is used by an energy-based estimator to adapt impedance online through directional stiffness and damping modulation. Trained on parkour and robot-assisted therapy demonstrations collected via Apple Vision Pro teleoperation, the model achieves sub-millimeter positional and sub-degree rotational accuracy using only tens of thousands of samples. Deployed in realtime torque control on a KUKA LBR iiwa, the approach enables smooth obstacle traversal and generalizes to unseen tasks, achieving 100% success in multi-geometry peg-in-hole insertion.
comment: 15 pages, 12 figures
Viewpoint Matters: Dynamically Optimizing Viewpoints with Masked Autoencoder for Visual Manipulation
Robotic manipulation continues to be a challenge, and imitation learning (IL) enables robots to learn tasks from expert demonstrations. Current IL methods typically rely on fixed camera setups, where cameras are manually positioned in static locations, imposing significant limitations on adaptability and coverage. Inspired by human active perception, where humans dynamically adjust their viewpoint to capture the most relevant and least noisy information, we propose MAE-Select, a novel framework for active viewpoint selection in single-camera robotic systems. MAE-Select fully leverages pre-trained multi-view masked autoencoder representations and dynamically selects the next most informative viewpoint at each time chunk without requiring labeled viewpoints. Extensive experiments demonstrate that MAE-Select improves the capabilities of single-camera systems and, in some cases, even surpasses multi-camera setups. The project will be available at https://mae-select.github.io.
PeRoI: A Pedestrian-Robot Interaction Dataset for Learning Avoidance, Neutrality, and Attraction Behaviors in Social Navigation
Robots are increasingly being deployed in public spaces such as shopping malls, sidewalks, and hospitals, where safe and socially aware navigation depends on anticipating how pedestrians respond to their presence. However, existing datasets rarely capture the full spectrum of robot-induced reactions, e.g., avoidance, neutrality, attraction, which limits progress in modeling these interactions. In this paper, we present the Pedestrian-Robot Interaction~(PeRoI) dataset that captures pedestrian motions categorized into attraction, neutrality, and repulsion across two outdoor sites under three controlled conditions: no robot present, with stationary robot, and with moving robot. This design explicitly reveals how pedestrian behavior varies across robot contexts, and we provide qualitative and quantitative comparisons to established state-of-the-art datasets. Building on these data, we propose the Neural Robot Social Force Model~(NeuRoSFM), an extension of the Social Force Model that integrates neural networks to augment inter-human dynamics with learned components and explicit robot-induced forces to better predict pedestrian motion in vicinity of robots. We evaluate NeuRoSFM by generating trajectories on multiple real-world datasets. The results demonstrate improved modeling of pedestrian-robot interactions, leading to better prediction accuracy, and highlight the value of our dataset and method for advancing socially aware navigation strategies in human-centered environments.
Vision Language Model-based Testing of Industrial Autonomous Mobile Robots
PAL Robotics, in Spain, builds a variety of Autonomous Mobile Robots (AMRs), which are deployed in diverse environments (e.g., warehouses, retail spaces, and offices), where they work alongside humans. Given that human behavior can be unpredictable and that AMRs may not have been trained to handle all possible unknown and uncertain behaviors, it is important to test AMRs under a wide range of human interactions to ensure their safe behavior. Moreover, testing in real environments with actual AMRs and humans is often costly, impractical, and potentially hazardous (e.g., it could result in human injury). To this end, we propose a Vision Language Model (VLM)-based testing approach (RVSG) for industrial AMRs developed together with PAL Robotics. Based on the functional and safety requirements, RVSG uses the VLM to generate diverse human behaviors that violate these requirements. We evaluated RVSG with several requirements and navigation routes in a simulator using the latest AMR from PAL Robotics. Our results show that, compared with the baseline, RVSG can effectively generate requirement-violating scenarios. Moreover, RVSG-generated scenarios increase variability in robot behavior, thereby helping reveal their uncertain behaviors.
Least Restrictive Hyperplane Control Barrier Functions
Control Barrier Functions (CBFs) can provide provable safety guarantees for dynamic systems. However, finding a valid CBF for a system of interest is often non-trivial, especially for systems having low computational resources, higher-order dynamics, and moving close to obstacles of complex shape. A common solution to this problem is to use a purely distance-based CBF. In this paper, we study Hyperplane CBFs (H-CBFs), where a hyperplane separates the agent from the obstacle. First, we note that the common distance-based CBF is a special case of an H-CBF where the hyperplane is a supporting hyperplane of the obstacle that is orthogonal to a line between the agent and the obstacle. Then we show that a less conservative CBF can be found by optimising over the orientation of the supporting hyperplane, in order to find the Least Restrictive Hyperplane CBF. This enables us to maintain the safety guarantees while allowing controls that are closer to the desired ones, especially when moving fast and passing close to obstacles. We illustrate the approach on a double integrator dynamical system with acceleration constraints, moving through a group of arbitrarily shaped static and moving obstacles.
Towards Exploratory and Focused Manipulation with Bimanual Active Perception: A New Problem, Benchmark and Strategy ICRA 2026
Recently, active vision has reemerged as an important concept for manipulation, since visual occlusion occurs more frequently when main cameras are mounted on the robot heads. We reflect on the visual occlusion issue and identify its essence as the absence of information useful for task completion. Inspired by this, we come up with the more fundamental problem of Exploratory and Focused Manipulation (EFM). The proposed problem is about actively collecting information to complete challenging manipulation tasks that require exploration or focus. As an initial attempt to address this problem, we establish the EFM-10 benchmark that consists of 4 categories of tasks that align with our definition (10 tasks in total). We further come up with a Bimanual Active Perception (BAP) strategy, which leverages one arm to provide active vision and another arm to provide force sensing while manipulating. Based on this idea, we collect a dataset named BAPData for the tasks in EFM-10. With the dataset, we successfully verify the effectiveness of the BAP strategy in an imitation learning manner. We hope that the EFM-10 benchmark along with the BAP strategy can become a cornerstone that facilitates future research towards this direction. Project website: EFManipulation.github.io.
comment: ICRA 2026
EmboTeam: Grounding LLM Reasoning into Reactive Behavior Trees via PDDL for Embodied Multi-Robot Collaboration
In embodied artificial intelligence, enabling heterogeneous robot teams to execute long-horizon tasks from high-level instructions remains a critical challenge. While large language models (LLMs) show promise in instruction parsing and preliminary planning, they exhibit limitations in long-term reasoning and dynamic multi-robot coordination. We propose EmboTeam, a novel embodied multi-robot task planning framework that addresses these issues through a three-stage cascaded architecture: 1) It leverages an LLM to parse instructions and generate Planning Domain Definition Language (PDDL) problem descriptions, thereby transforming commands into formal planning problems; 2) It combines the semantic reasoning of LLMs with the search capabilities of a classical planner to produce optimized action sequences; 3) It compiles the resulting plan into behavior trees for reactive control. The framework supports dynamically sized heterogeneous robot teams via a shared blackboard mechanism for communication and state synchronization. To validate our approach, we introduce the MACE-THOR benchmark dataset, comprising 42 complex tasks across 8 distinct household layouts. Experiments show EmboTeam improves the task success rate from 12% to 55% and goal condition recall from 32% to 72% over the LaMMA-P baseline.
3D Dynamics-Aware Manipulation: Endowing Manipulation Policies with 3D Foresight ICRA 2026
The incorporation of world modeling into manipulation policy learning has pushed the boundary of manipulation performance. However, existing efforts simply model the 2D visual dynamics, which is insufficient for robust manipulation when target tasks involve prominent depth-wise movement. To address this, we present a 3D dynamics-aware manipulation framework that seamlessly integrates 3D world modeling and policy learning. Three self-supervised learning tasks (current depth estimation, future RGB-D prediction, 3D flow prediction) are introduced within our framework, which complement each other and endow the policy model with 3D foresight. Extensive experiments on simulation and the real world show that 3D foresight can greatly boost the performance of manipulation policies without sacrificing inference speed. Code is available at https://github.com/Stardust-hyx/3D-Foresight.
comment: ICRA 2026
Infinite-Dimensional Closed-Loop Inverse Kinematics for Soft Robots via Neural Operators
For fully actuated rigid robots, kinematic inversion is a purely geometric problem, efficiently solved by closed-loop inverse kinematics (CLIK) schemes that compute joint configurations to position the robot body in space. For underactuated soft robots, however, not all configurations are attainable through control action, making kinematic inversion extremely challenging. Extensions of CLIK address this by introducing end-to-end mappings from actuation to task space for the controller to operate on, but typically assume finite dimensions of the underlying virtual configuration space. In this work, we formulate CLIK in the infinite-dimensional domain to reason about the entire soft robot shape while solving tasks. We do this by composing an actuation-to-shape map with a shape-to-task map, deriving the differential end-to-end kinematics via an infinite-dimensional chain rule, and thereby obtaining a Jacobian-based CLIK algorithm. Since this actuation-to-shape mapping is rarely available in closed form, we propose to learn it using differentiable neural operator networks. We first present an analytical study on a constant-curvature segment, and then apply the neural version of the algorithm to a three-fiber soft robotic arm whose underlying model relies on morphoelasticity and active filament theory.
TEMPO-VINE: A Multi-Temporal Sensor Fusion Dataset for Localization and Mapping in Vineyards
In recent years, precision agriculture has been introducing groundbreaking innovations in the field, with a strong focus on automation. However, research studies in robotics and autonomous navigation often rely on controlled simulations or isolated field trials. The absence of a realistic common benchmark represents a significant limitation for the diffusion of robust autonomous systems under real complex agricultural conditions. Vineyards pose significant challenges due to their dynamic nature, and they are increasingly drawing attention from both academic and industrial stakeholders interested in automation. In this context, we introduce the TEMPO-VINE dataset, a large-scale multi-temporal dataset specifically designed for evaluating sensor fusion, simultaneous localization and mapping (SLAM), and place recognition techniques within operational vineyard environments. TEMPO-VINE is the first multi-modal public dataset that brings together data from heterogeneous LiDARs of different price levels, AHRS, RTK-GPS, and cameras in real trellis and pergola vineyards, with multiple rows exceeding 100 m in length. In this work, we address a critical gap in the landscape of agricultural datasets by providing researchers with a comprehensive data collection and ground truth trajectories in different seasons, vegetation growth stages, terrain and weather conditions. The sequence paths with multiple runs and revisits will foster the development of sensor fusion, localization, mapping and place recognition solutions for agricultural fields. The dataset, the processing tools and the benchmarking results are available on the webpage.
FreeTacMan: Robot-free Visuo-Tactile Data Collection System for Contact-rich Manipulation
Enabling robots with contact-rich manipulation remains a pivotal challenge in robot learning, which is substantially hindered by the data collection gap, including its inefficiency and limited sensor setup. While prior work has explored handheld paradigms, their rod-based mechanical structures remain rigid and unintuitive, providing limited tactile feedback and posing challenges for operators. Motivated by the dexterity and force feedback of human motion, we propose FreeTacMan, a human-centric and robot-free data collection system for accurate and efficient robot manipulation. Concretely, we design a wearable gripper with visuo-tactile sensors for data collection, which can be worn by human fingers for intuitive control. A high-precision optical tracking system is introduced to capture end-effector poses while synchronizing visual and tactile feedback simultaneously. We leverage FreeTacMan to collect a large-scale multimodal dataset, comprising over 3000k paired visuo-tactile images with end-effector poses, 10k demonstration trajectories across 50 diverse contact-rich manipulation tasks. FreeTacMan achieves multiple improvements in data collection performance over prior works and enables effective policy learning from self-collected datasets. By open-sourcing the hardware and the dataset, we aim to facilitate reproducibility and support research in visuo-tactile manipulation.
Responsibility and Engagement -- Evaluating Interactions in Social Robot Navigation ICRA
In Social Robot Navigation (SRN), the availability of meaningful metrics is crucial for evaluating trajectories from human-robot interactions. In the SRN context, such interactions often relate to resolving conflicts between two or more agents. Correspondingly, the shares to which agents contribute to the resolution of such conflicts are important. This paper builds on recent work, which proposed a Responsibility metric capturing such shares. We extend this framework in two directions: First, we model the conflict buildup phase by introducing a time normalization. Second, we propose the related Engagement metric, which captures how the agents' actions intensify a conflict. In a comprehensive series of simulated scenarios with dyadic, group and crowd interactions, we show that the metrics carry meaningful information about the cooperative resolution of conflicts in interactions. They can be used to assess behavior quality and foresightedness. We extensively discuss applicability, design choices and limitations of the proposed metrics.
comment: Accepted at the 2026 IEEE International Conference on Robotics & Automation (ICRA)
EgoTraj-Bench: Towards Robust Trajectory Prediction Under Ego-view Noisy Observations
Reliable trajectory prediction from an ego-centric perspective is crucial for robotic navigation in human-centric environments. However, existing methods typically assume noiseless observation histories, failing to account for the perceptual artifacts inherent in first-person vision, such as occlusions, ID switches, and tracking drift. This discrepancy between training assumptions and deployment reality severely limits model robustness. To bridge this gap, we introduce EgoTraj-Bench, built upon TBD dataset, which is the first real-world benchmark that aligns noisy, first-person visual histories with clean, bird's-eye-view future trajectories, enabling robust learning under realistic perceptual constraints. Building on this benchmark, we propose BiFlow, a dual-stream flow matching model that concurrently denoises historical observations and forecasts future motion. To better model agent intent, BiFlow incorporates our EgoAnchor mechanism, which conditions the prediction decoder on distilled historical features via feature modulation. Extensive experiments show that BiFlow achieves state-of-the-art performance, reducing minADE and minFDE by 10-15% on average and demonstrating superior robustness. We anticipate that our benchmark and model will provide a critical foundation for robust real-world ego-centric trajectory prediction. The benchmark library is available at: https://github.com/zoeyliu1999/EgoTraj-Bench.
Efficient Path Generation with Curvature Guarantees by Mollification
Path generation, the process of converting high-level mission specifications, such as sequences of waypoints from a path planner, into smooth, executable paths, is a fundamental challenge in mobile robotics. Most path following and trajectory tracking algorithms require the desired path to be defined by at least twice continuously differentiable functions to guarantee key properties such as global convergence, especially for nonholonomic robots like unicycles with speed constraints. Consequently, path generation methods must bridge the gap between convenient but non-differentiable planning outputs, such as piecewise linear segments, and the differentiability requirements imposed by downstream control algorithms. While techniques such as spline interpolation or optimization-based methods are commonly used to smooth non-differentiable paths or create feasible ones from sequences of waypoints, they either produce unnecessarily complex trajectories or are computationally expensive. In this work, we present a method to regularize non-differentiable functions and generate feasible paths through mollification. Specifically, we approximate an arbitrary path with a differentiable function that can converge to it with arbitrary precision. Additionally, we provide a systematic method for bounding the curvature of generated paths, which we demonstrate by applying it to paths resulting from linking a sequence of waypoints with segments. The proposed approach is analytically shown to be computationally more efficient than standard interpolation methods, enabling real-time implementation on microcontrollers, while remaining compatible with standard trajectory tracking and path following algorithms.
RoboPARA: Dual-Arm Robot Planning with Parallel Allocation and Recomposition Across Tasks ICLR 2026
Dual-arm robots play a crucial role in improving efficiency and flexibility in complex multitasking scenarios.While existing methods have achieved promising results in task planning, they often fail to fully optimize task parallelism, limiting the potential of dual-arm collaboration.To address this issue, we propose RoboPARA, a novel large language model (LLM)-driven framework for dual-arm task parallelism planning.RoboPARA employs a two-stage process: (1) Dependency Graph-based Planning Candidates Generation, which constructs directed acyclic graphs (DAGs) to model task dependencies and eliminate redundancy, and (2) Graph Re-Traversal-based Dual-Arm Parallel Planning, which optimizes DAG traversal to maximize parallelism while maintaining task coherence.In addition, we introduce the Cross-Scenario Dual-Arm Parallel Task dataset (X-DAPT dataset), the first dataset specifically designed to evaluate dual-arm task parallelism across diverse scenarios and difficulty levels.Extensive experiments demonstrate that RoboPARA significantly outperforms existing planning methods, achieving higher efficiency and reliability, particularly in complex task combinations.Our code is publicly available at https://github.com/AiDuanshiying/RoboPARA.
comment: Accepted to ICLR 2026
Risk-Aware Autonomous Driving with Linear Temporal Logic Specifications
Human drivers naturally balance the risks of different concerns while driving, including traffic rule violations, minor accidents, and fatalities. However, achieving the same behavior in autonomous driving systems remains an open problem. This paper extends a risk metric that has been verified in human-like driving studies to encompass more complex driving scenarios specified by linear temporal logic (LTL) that go beyond just collision risks. This extension incorporates the timing and severity of events into LTL specifications, thereby reflecting a human-like risk awareness. Without sacrificing expressivity for traffic rules, we adopt LTL specifications composed of safety and co-safety formulas, allowing the control synthesis problem to be reformulated as a reachability problem. By leveraging occupation measures, we further formulate a linear programming (LP) problem for this LTL-based risk metric. Consequently, the synthesized policy balances different types of driving risks, including both collision risks and traffic rule violations. The effectiveness of the proposed approach is validated by three typical traffic scenarios in Carla simulator.
MOSAIC: Modular Scalable Autonomy for Intelligent Coordination of Heterogeneous Robotic Teams
Mobile robots have become indispensable for exploring hostile environments, such as in space or disaster relief scenarios, but often remain limited to teleoperation by a human operator. This restricts the deployment scale and requires near-continuous low-latency communication between the operator and the robot. We present MOSAIC: a scalable autonomy framework for multi-robot scientific exploration using a unified mission abstraction based on Points of Interest (POIs) and multiple layers of autonomy, enabling supervision by a single operator. The framework dynamically allocates exploration and measurement tasks based on each robot's capabilities, leveraging team-level redundancy and specialization to enable continuous operation. We validated the framework in a space-analog field experiment emulating a lunar prospecting scenario, involving a heterogeneous team of five robots and a single operator. Despite the complete failure of one robot during the mission, the team completed 82.3% of assigned tasks at an Autonomy Ratio of 86%, while the operator workload remained at only 78.2%. These results demonstrate that the proposed framework enables robust, scalable multi-robot scientific exploration with limited operator intervention. We further derive practical lessons learned in robot interoperability, networking architecture, team composition, and operator workload management to inform future multi-robot exploration missions.
comment: This work has been submitted to the IEEE for possible publication
LAP: Fast LAtent Diffusion Planner for Autonomous Driving
Diffusion models have demonstrated strong capabilities for modeling human-like driving behaviors in autonomous driving, but their iterative sampling process induces substantial latency, and operating directly on raw trajectory points forces the model to spend capacity on low-level kinematics, rather than high-level multi-modal semantics. To address these limitations, we propose LAtent Planner (LAP), a framework that plans in a VAE-learned latent space that disentangles high-level intents from low-level kinematics, enabling our planner to capture rich, multi-modal driving strategies. To bridge the representational gap between the high-level semantic planning space and the vectorized scene context, we introduce an intermediate feature alignment mechanism that facilitates robust information fusion. Notably, LAP can produce high-quality plans in one single denoising step, substantially reducing computational overhead. Through extensive evaluations on the large-scale nuPlan benchmark, LAP achieves state-of-the-art closed-loop performance among learning-based planning methods, while demonstrating an inference speed-up of at most 10x over previous SOTA approaches.
Interpretable Multimodal Gesture Recognition for Drone and Mobile Robot Teleoperation via Log-Likelihood Ratio Fusion
Human operators are still frequently exposed to hazardous environments such as disaster zones and industrial facilities, where intuitive and reliable teleoperation of mobile robots and Unmanned Aerial Vehicles (UAVs) is essential. In this context, hands-free teleoperation enhances operator mobility and situational awareness, thereby improving safety in hazardous environments. While vision-based gesture recognition has been explored as one method for hands-free teleoperation, its performance often deteriorates under occlusions, lighting variations, and cluttered backgrounds, limiting its applicability in real-world operations. To overcome these limitations, we propose a multimodal gesture recognition framework that integrates inertial data (accelerometer, gyroscope, and orientation) from Apple Watches on both wrists with capacitive sensing signals from custom gloves. We design a late fusion strategy based on the log-likelihood ratio (LLR), which not only enhances recognition performance but also provides interpretability by quantifying modality-specific contributions. To support this research, we introduce a new dataset of 20 distinct gestures inspired by aircraft marshalling signals, comprising synchronized RGB video, IMU, and capacitive sensor data. Experimental results demonstrate that our framework achieves performance comparable to a state-of-the-art vision-based baseline while significantly reducing computational cost, model size, and training time, making it well suited for real-time robot control. We therefore underscore the potential of sensor-based multimodal fusion as a robust and interpretable solution for gesture-driven mobile robot and drone teleoperation.
Balancing Progress and Safety: A Novel Risk-Aware Objective for RL in Autonomous Driving
Reinforcement Learning (RL) is a promising approach for achieving autonomous driving due to robust decision-making capabilities. RL learns a driving policy through trial and error in traffic scenarios, guided by a reward function that combines the driving objectives. The design of such reward function has received insufficient attention, yielding ill-defined rewards with various pitfalls. Safety, in particular, has long been regarded only as a penalty for collisions. This leaves the risks associated with actions leading up to a collision unaddressed, limiting the applicability of RL in real-world scenarios. To address these shortcomings, our work focuses on enhancing the reward formulation by defining a set of driving objectives and structuring them hierarchically. Furthermore, we discuss the formulation of these objectives in a normalized manner to transparently determine their contribution to the overall reward. Additionally, we introduce a novel risk-aware objective for various driving interactions based on a two-dimensional ellipsoid function and an extension of Responsibility-Sensitive Safety (RSS) concepts. We evaluate the efficacy of our proposed reward in unsignalized intersection scenarios with varying traffic densities. The approach decreases collision rates by 21\% on average compared to baseline rewards and consistently surpasses them in route progress and cumulative reward, demonstrating its capability to promote safer driving behaviors while maintaining high-performance levels.
comment: Accepted in the 36th IEEE Intelligent vehicles Symposium (IV 2025)
Automatic Curriculum Learning for Driving Scenarios: Towards Robust and Efficient Reinforcement Learning
This paper addresses the challenges of training end-to-end autonomous driving agents using Reinforcement Learning (RL). RL agents are typically trained in a fixed set of scenarios and nominal behavior of surrounding road users in simulations, limiting their generalization and real-life deployment. While domain randomization offers a potential solution by randomly sampling driving scenarios, it frequently results in inefficient training and sub-optimal policies due to the high variance among training scenarios. To address these limitations, we propose an automatic curriculum learning framework that dynamically generates driving scenarios with adaptive complexity based on the agent's evolving capabilities. Unlike manually designed curricula that introduce expert bias and lack scalability, our framework incorporates a ``teacher'' that automatically generates and mutates driving scenarios based on their learning potential -- an agent-centric metric derived from the agent's current policy -- eliminating the need for expert design. The framework enhances training efficiency by excluding scenarios the agent has mastered or finds too challenging. We evaluate our framework in a reinforcement learning setting where the agent learns a driving policy from camera images. Comparative results against baseline methods, including fixed scenario training and domain randomization, demonstrate that our approach leads to enhanced generalization, achieving higher success rates: +9% in low traffic density, +21% in high traffic density, and faster convergence with fewer training steps. Our findings highlight the potential of ACL in improving the robustness and efficiency of RL-based autonomous driving agents.
comment: Accepted in the 36th IEEE Intelligent Vehicles Symposium (IV 2025)
MachaGrasp: Morphology-Aware Cross-Embodiment Dexterous Hand Articulation Generation for Grasping
Dexterous grasping with multi-fingered hands remains challenging due to high-dimensional articulations and the cost of optimization-based pipelines. Existing end-to-end methods require training on large-scale datasets for specific hands, limiting their ability to generalize across different embodiments. We propose MachaGrasp, an eigengrasp-based, end-to-end framework for cross-embodiment grasp generation. From a hand's morphology description, we derive a morphology embedding and an eigengrasp set. Conditioned on these, together with the object point cloud and wrist pose, an amplitude predictor regresses articulation coefficients in a low-dimensional space, which are decoded into full joint articulations. Articulation learning is supervised with a Kinematic-Aware Articulation Loss (KAL) that emphasizes fingertip-relevant motions and injects morphology-specific structure. In simulation on unseen objects across three dexterous hands, MachaGrasp attains a 91.9% average grasp success rate with less than 0.4 seconds inference per grasp. With few-shot adaptation to an unseen hand, it achieves 85.6% success on unseen objects in simulation, and real-world experiments on this few-shot-generalized hand achieve an 87% success rate. The code and additional materials are available on our project website https://connor-zh.github.io/MachaGrasp.
Distant Object Localisation from Noisy Image Segmentation Sequences
3D object localisation based on a sequence of camera measurements is essential for safety-critical surveillance tasks, such as drone-based wildfire monitoring. Localisation of objects detected with a camera can typically be solved with specialised sensor configurations or 3D scene reconstruction. However, in the context of distant objects or tasks limited by the amount of available computational resources, neither solution is feasible. In this paper, we show that the task can be solved with either multi-view triangulation or particle filters, with the latter also providing shape and uncertainty estimates. We studied the solutions using 3D simulation and drone-based image segmentation sequences with global navigation satellite system (GNSS) based camera pose estimates. The results suggest that combining the proposed methods with pre-existing image segmentation models and drone-carried computational resources yields a reliable system for drone-based wildfire monitoring. The proposed solutions are independent of the detection method, also enabling quick adaptation to similar tasks.
Collaborative Learning of Local 3D Occupancy Prediction and Versatile Global Occupancy Mapping ICRA 2026
Vision-based 3D semantic occupancy prediction is vital for autonomous driving, enabling unified modeling of static infrastructure and dynamic agents. Global occupancy maps serve as long-term memory priors, providing valuable historical context that enhances local perception. This is particularly important in challenging scenarios such as occlusion or poor illumination, where current and nearby observations may be unreliable or incomplete. Priors aggregated from previous traversals under better conditions help fill gaps and enhance the robustness of local 3D occupancy prediction. In this paper, we propose Long-term Memory Prior Occupancy (LMPOcc), a plug-and-play framework that incorporates global occupancy priors to boost local prediction and simultaneously updates global maps with new observations. To realize the information gain from global priors, we design an efficient and lightweight Current-Prior Fusion module that adaptively integrates prior and current features. Meanwhile, we introduce a model-agnostic prior format to enable continual updating of global occupancy and ensure compatibility across diverse prediction baselines. LMPOcc achieves state-of-the-art local occupancy prediction performance validated on the Occ3D-nuScenes benchmark, especially on static semantic categories. Furthermore, we verify LMPOcc's capability to build large-scale global occupancy maps through multi-vehicle crowdsourcing, and utilize occupancy-derived dense depth to support the construction of 3D open-vocabulary maps. Our method opens up a new paradigm for continuous global information updating and storage, paving the way towards more comprehensive and scalable scene understanding in large outdoor environments.
comment: Accepted by ICRA 2026
GUIDE: A Diffusion-Based Autonomous Robot Exploration Framework Using Global Graph Inference
Autonomous exploration in structured and complex indoor environments remains a challenging task, as existing methods often struggle to appropriately model unobserved space and plan globally efficient paths. To address these limitations, we propose GUIDE, a novel exploration framework that synergistically combines global graph inference with diffusion-based decision-making. We introduce a region-evaluation global graph representation that integrates both observed environmental data and predictions of unexplored areas, enhanced by a region-level evaluation mechanism to prioritize reliable structural inferences while discounting uncertain predictions. Building upon this enriched representation, a diffusion policy network generates stable, foresighted action sequences with significantly reduced denoising steps. Extensive simulations and real-world deployments demonstrate that GUIDE consistently outperforms state-of-the-art methods, achieving up to 18.3% faster coverage completion and a 34.9% reduction in redundant movements.
Environment-Aware Learning of Smooth GNSS Covariance Dynamics for Autonomous Racing ICRA
Ensuring accurate and stable state estimation is a challenging task crucial to safety-critical domains such as high-speed autonomous racing, where measurement uncertainty must be both adaptive to the environment and temporally smooth for control. In this work, we develop a learning-based framework, LACE, capable of directly modeling the temporal dynamics of GNSS measurement covariance. We model the covariance evolution as an exponentially stable dynamical system where a deep neural network (DNN) learns to predict the system's process noise from environmental features through an attention mechanism. By using contraction-based stability and systematically imposing spectral constraints, we formally provide guarantees of exponential stability and smoothness for the resulting covariance dynamics. We validate our approach on an AV-24 autonomous racecar, demonstrating improved localization performance and smoother covariance estimates in challenging, GNSS-degraded environments. Our results highlight the promise of dynamically modeling the perceived uncertainty in state estimation problems that are tightly coupled with control sensitivity.
comment: 8 pages, Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026
Seeing the Bigger Picture: 3D Latent Mapping for Mobile Manipulation Policy Learning ICRA 2026
In this paper, we demonstrate that mobile manipulation policies utilizing a 3D latent map achieve stronger spatial and temporal reasoning than policies relying solely on images. We introduce Seeing the Bigger Picture (SBP), an end-to-end policy learning approach that operates directly on a 3D map of latent features. In SBP, the map extends perception beyond the robot's current field of view and aggregates observations over long horizons. Our mapping approach incrementally fuses multiview observations into a grid of scene-specific latent features. A pre-trained, scene-agnostic decoder reconstructs target embeddings from these features and enables online optimization of the map features during task execution. A policy, trainable with behavior cloning or reinforcement learning, treats the latent map as a state variable and uses global context from the map obtained via a 3D feature aggregator. We evaluate SBP on scene-level mobile manipulation and sequential tabletop manipulation tasks. Our experiments demonstrate that SBP (i) reasons globally over the scene, (ii) leverages the map as long-horizon memory, and (iii) outperforms image-based policies in both in-distribution and novel scenes, e.g., improving the success rate by 15% for the sequential manipulation task.
comment: ICRA 2026, project page: https://existentialrobotics.org/sbp_page/
NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation
Standard diffusion corrupts data using Gaussian noise whose Fourier coefficients have random magnitudes and random phases. While effective for unconditional or text-to-image generation, corrupting phase components destroys spatial structure, making it ill-suited for tasks requiring geometric consistency, such as re-rendering, simulation enhancement, and image-to-image translation. We introduce Phase-Preserving Diffusion (φ-PD), a model-agnostic reformulation of the diffusion process that preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or additional parameters. We further propose Frequency-Selective Structured (FSS) noise, which provides continuous control over structural rigidity via a single frequency-cutoff parameter. φ-PD adds no inference-time cost and is compatible with any diffusion model for images or videos. Across photorealistic and stylized re-rendering, as well as sim-to-real enhancement for driving planners, φ-PD produces controllable, spatially aligned results. When applied to the CARLA simulator, φ-PD significantly improves sim-to-real planner transfer performance. The method is complementary to existing conditioning approaches and broadly applicable to image-to-image and video-to-video generation. Videos, additional examples, and code are available on our \href{https://yuzeng-at-tri.github.io/ppd-page/}{project page}.
Seeing Through Uncertainty: A Free-Energy Approach for Real-Time Perceptual Adaptation in Robust Visual Navigation
Navigation in the natural world is a feat of adaptive inference, where biological organisms maintain goal-directed behaviour despite noisy and incomplete sensory streams. Central to this ability is the Free Energy Principle (FEP), which posits that perception is a generative process where the brain minimises Variational Free Energy (VFE) to maintain accurate internal models of the world. While Deep Neural Networks (DNNs) have served as powerful analogues for biological brains, they typically lack the real-time plasticity required to handle abrupt sensory shifts. We introduce FEP-Nav, a biologically-inspired framework that implements real-time perceptual adaptation for robust visual navigation. By decomposing VFE into its constituent components--prediction error and Bayesian surprise--we propose a dual-mechanism architecture: a Top-down Decoder that provides an internal expectation of uncorrupted sensory input, and Adaptive Normalisation that dynamically aligns shifted feature distributions with prior beliefs. Theoretically, we demonstrate that this integration of reconstruction and normalisation provides a formal mechanism for minimising VFE during inference without the need for gradient-based updates. Evaluations across a diverse suite of simulated and real-world visual corruptions demonstrate that FEP-Nav facilitates a substantial recovery of navigation performance, consistently exceeding the capabilities of both non-adaptive baselines and strong adaptive methods. We show that bridging machine learning with the brain's variational principles offers a robust strategy for autonomous behaviour, enabling robots to remain functional under sensory conditions that typically degrade the performance of standard adaptive models.
Whole-Body Safe Control of Robotic Systems with Koopman Neural Dynamics
Controlling robots with strongly nonlinear, high-dimensional dynamics remains challenging, as direct nonlinear optimization with safety constraints is often intractable in real time. The Koopman operator offers a way to represent nonlinear systems linearly in a lifted space, enabling the use of efficient linear control. We propose a data-driven framework that learns a Koopman embedding and operator from data, and integrates the resulting linear model with the Safe Set Algorithm (SSA). This allows the tracking and safety constraints to be solved in a single quadratic program (QP), ensuring feasibility and optimality without a separate safety filter. We validate the method on a Kinova Gen3 manipulator and a Go2 quadruped, showing accurate tracking and obstacle avoidance.
Learning Agile Gate Traversal via Analytical Optimal Policy Gradient
Traversing narrow gates presents a significant challenge and has become a standard benchmark for evaluating agile and precise quadrotor flight. Traditional modularized autonomous flight stacks require extensive design and parameter tuning, while end-to-end reinforcement learning (RL) methods often suffer from low sample efficiency, limited interpretability, and degraded disturbance rejection under unseen perturbations. In this work, we present a novel hybrid framework that adaptively fine-tunes model predictive control (MPC) parameters online using outputs from a neural network (NN) trained offline. The NN jointly predicts a reference pose and cost function weights, conditioned on the coordinates of the gate corners and the current drone state. To achieve efficient training, we derive analytical policy gradients not only for the MPC module but also for an optimization-based gate traversal detection module. Hardware experiments demonstrate agile and accurate gate traversal with peak accelerations of $30\ \mathrm{m/s^2}$, as well as recovery within $0.85\ \mathrm{s}$ following body-rate disturbances exceeding $1146\ \mathrm{deg/s}$.
comment: 8 pages, 8 figures
MarketGen: A Scalable Simulation Platform with Auto-Generated Embodied Supermarket Environments
The development of embodied agents for complex commercial environments is hindered by a critical gap in existing robotics datasets and benchmarks, which primarily focus on household or tabletop settings with short-horizon tasks. To address this limitation, we introduce MarketGen, a scalable simulation platform with automatic scene generation for complex supermarket environments. MarketGen features a novel agent-based Procedural Content Generation (PCG) framework. It uniquely supports multi-modal inputs (text and reference images) and integrates real-world design principles to automatically generate complete, structured, and realistic supermarkets. We also provide an extensive and diverse 3D asset library with a total of 1100+ supermarket goods and parameterized facilities assets. Building on this generative foundation, we propose a novel benchmark for assessing supermarket agents, featuring two daily tasks in a supermarket: (1) Checkout Unloading: long-horizon tabletop tasks for cashier agents, and (2) In-Aisle Item Collection: complex mobile manipulation tasks for salesperson agents. We validate our platform and benchmark through extensive experiments, including the deployment of a modular agent system and successful sim-to-real transfer. MarketGen provides a comprehensive framework to accelerate research in embodied AI for complex commercial applications.
comment: Project Page: https://xuhu0529.github.io/MarketGen
Distributed UAV Formation Control Robust to Relative Pose Measurement Noise
A technique that allows a Formation-Enforcing Control (FEC) derived from graph rigidity theory to interface with a realistic relative localization system onboard lightweight Unmanned Aerial Vehicles (UAVs) is proposed in this paper. The proposed methodology enables reliable real-world deployment of UAVs in tight formations using relative localization systems burdened by non-negligible sensory noise. Such noise otherwise causes undesirable oscillations and drifts in sensor-based formations, and this effect is not sufficiently addressed in existing FEC algorithms. The proposed solution is based on decomposition of the gradient descent-based FEC command into interpretable elements, and then modifying these individually based on the estimated distribution of sensory noise, such that the resulting action limits the probability of overshooting the desired formation. The behavior of the system was analyzed and the practicality of the proposed solution was compared to pure gradient-descent in real-world experiments where it presented significantly better performance in terms of oscillations, deviation from the desired state
comment: Submitted to Robotics and Autonomous Systems journal on May 10. 2025 (Revision on February 27. 2026)
DDP-WM: Disentangled Dynamics Prediction for Efficient World Models
World models are essential for autonomous robotic planning. However, the substantial computational overhead of existing dense Transformerbased models significantly hinders real-time deployment. To address this efficiency-performance bottleneck, we introduce DDP-WM, a novel world model centered on the principle of Disentangled Dynamics Prediction (DDP). We hypothesize that latent state evolution in observed scenes is heterogeneous and can be decomposed into sparse primary dynamics driven by physical interactions and secondary context-driven background updates. DDP-WM realizes this decomposition through an architecture that integrates efficient historical processing with dynamic localization to isolate primary dynamics. By employing a crossattention mechanism for background updates, the framework optimizes resource allocation and provides a smooth optimization landscape for planners. Extensive experiments demonstrate that DDP-WM achieves significant efficiency and performance across diverse tasks, including navigation, precise tabletop manipulation, and complex deformable or multi-body interactions. Specifically, on the challenging Push-T task, DDP-WM achieves an approximately 9 times inference speedup and improves the MPC success rate from 90% to98% compared to state-of-the-art dense models. The results establish a promising path for developing efficient, high-fidelity world models. Codes is available at https://hcplab-sysu.github.io/DDP-WM/.
comment: Efficient and high-fidelity world model. Code is available at https://hcplab-sysu.github.io/DDP-WM
Learning Physical Systems: Symplectification via Gauge Fixing in Dirac Structures
Physics-informed deep learning has achieved remarkable progress by embedding geometric priors, such as Hamiltonian symmetries and variational principles, into neural networks, enabling structure-preserving models that extrapolate with high accuracy. However, in systems with dissipation and holonomic constraints, ubiquitous in legged locomotion and multibody robotics, the canonical symplectic form becomes degenerate, undermining the very invariants that guarantee stability and long-term prediction. In this work, we tackle this foundational limitation by introducing Presymplectification Networks (PSNs), the first framework to learn the symplectification lift via Dirac structures, restoring a non-degenerate symplectic geometry by embedding constrained systems into a higher-dimensional manifold. Our architecture combines a recurrent encoder with a flow-matching objective to learn the augmented phase-space dynamics end-to-end. We then attach a lightweight Symplectic Network (SympNet) to forecast constrained trajectories while preserving energy, momentum, and constraint satisfaction. We demonstrate our method on the dynamics of the ANYmal quadruped robot, a challenging contact-rich, multibody system. To the best of our knowledge, this is the first framework that effectively bridges the gap between constrained, dissipative mechanical systems and symplectic learning, unlocking a whole new class of geometric machine learning models, grounded in first principles yet adaptable from data.
comment: Presented at Equivariant Systems: Theory and Applications in State Estimation, Artificial Intelligence and Control, Robotics: Science and Systems (RSS) 2025 Workshop, 6 Pages, 3 Figures
In-Hand Manipulation of Articulated Tools with Dexterous Robot Hands with Sim-to-Real Transfer
Reinforcement learning (RL) and sim-to-real transfer have advanced rigid-object manipulation. However, policies remain brittle for articulated mechanisms due to contact-rich dynamics that require both stable grasping and simultaneous free in-hand articulation. Furthermore, articulated objects and robot hands exhibit under-modeled joint phenomena such as friction, stiction, and backlash in real life that can increase the sim-to-real gap, and robot hands still fall short of idealized tactile sensing, both in terms of coverage, sensitivity, and specificity. In this paper, we present an original approach to learning dexterous in-hand manipulation of articulated tools that has reduced articulation and kinematic redundancy relative to the human hand. Our approach augments a simulation-trained base policy with a sensor-driven refinement learned from hardware demonstrations. This refinement conditions on proprioception and target articulation states while fusing whole-hand tactile and force-torque feedback with the policy's action intent through cross-attention. The resulting controller adapts online to instance-specific articulation properties, stabilizes contact interactions, and regulates internal forces under perturbations. We validate our method across diverse real-world tools, including scissors, pliers, minimally invasive surgical instruments, and staplers, demonstrating robust sim-to-real transfer, improved disturbance resilience, and generalization across structurally related articulated tools without precise physical modeling.
Dependent Reachable Sets for the Constant Bearing Pursuit Strategy
This paper introduces a novel reachability problem for the scenario involving two agents, where one agent follows another agent using a feedback strategy. The geometry of the reachable set for an agent, termed \emph{dependent reachable set}, is characterized using the constant bearing pursuit strategy as a case study. Key theoretical results are presented that provide geometric bounds for the associated dependent reachable set. Simulation results are presented to empirically establish the shape of the dependent reachable set. In the process, an original optimization problem is formulated and analyzed for the constant bearing pursuit strategy.
comment: This work has been submitted to a journal for possible publication
Push Anything: Single- and Multi-Object Pushing From First Sight with Contact-Implicit MPC ICRA 2026
Non-prehensile manipulation of diverse objects remains a core challenge in robotics, driven by unknown physical properties and the complexity of contact-rich interactions. Recent advances in contact-implicit model predictive control (CI-MPC), with contact reasoning embedded directly in the trajectory optimization, have shown promise in tackling the task efficiently and robustly. However, demonstrations have been limited to narrowly curated examples. In this work, we showcase the broader capabilities of CI-MPC through precise planar pushing tasks over a wide range of object geometries, including multi-object domains. These scenarios demand reasoning over numerous inter-object and object-environment contacts to strategically manipulate and de-clutter the environment, challenges that were intractable for prior CI-MPC methods. To achieve this, we introduce Consensus Complementarity Control Plus (C3+), an enhanced CI-MPC algorithm integrated into a complete pipeline spanning object scanning, mesh reconstruction, and hardware execution. Compared to its predecessor C3, C3+ achieves substantially faster solve times, enabling real-time performance even in multi-object pushing tasks. On hardware, our system achieves overall 98% success rate across 33 objects, reaching pose goals within tight tolerances. The average time-to-goal is approximately 0.5, 1.6, 3.2, and 5.3 minutes for 1-, 2-, 3-, and 4-object tasks, respectively. Project page: https://dairlab.github.io/push-anything.
comment: Presented at ICRA 2026; 8 pages, 8 figures. Hien Bui, Yufeiyang Gao, and Haoran Yang contributed equally to this work
CAVER: Curious Audiovisual Exploring Robot
Multimodal audiovisual perception can enable new avenues for robotic manipulation, from better material classification to the imitation of demonstrations for which only audio signals are available (e.g., playing a tune by ear). However, to unlock such multimodal potential, robots need to learn the correlations between an object's visual appearance and the sound it generates when they interact with it. Such an active sensorimotor experience requires new interaction capabilities, representations, and exploration methods to guide the robot in efficiently building increasingly rich audiovisual knowledge. In this work, we present CAVER, a novel robot that builds and utilizes rich audiovisual representations of objects. CAVER includes three novel contributions: 1) a novel 3D printed end-effector, attachable to parallel grippers, that excites objects' audio responses, 2) an audiovisual representation that combines local and global appearance information with sound features, and 3) an exploration algorithm that uses and builds the audiovisual representation in a curiosity-driven manner that prioritizes interacting with high uncertainty objects to obtain good coverage of surprising audio with fewer interactions. We demonstrate that CAVER builds rich representations in different scenarios more efficiently than several exploration baselines, and that the learned audiovisual representation leads to significant improvements in material classification and the imitation of audio-only human demonstrations. https://caver-bot.github.io/
comment: 9 pages, 6 figures
GLIDE: A Coordinated Aerial-Ground Framework for Search and Rescue in Unknown Environments
We present a cooperative aerial-ground search-and-rescue (SAR) framework that pairs two unmanned aerial vehicles (UAVs) with an unmanned ground vehicle (UGV) to achieve rapid victim localization and obstacle-aware navigation in unknown environments. We dub this framework Guided Long-horizon Integrated Drone Escort (GLIDE), highlighting the UGV's reliance on UAV guidance for long-horizon planning. In our framework, a goal-searching UAV executes real-time onboard victim detection and georeferencing to nominate goals for the ground platform, while a terrain-scouting UAV flies ahead of the UGV's planned route to provide mid-level traversability updates. The UGV fuses aerial cues with local sensing to perform time-efficient A* planning and continuous replanning as information arrives. Additionally, we present a hardware demonstration (using a GEM e6 golf cart as the UGV and two X500 UAVs) to evaluate end-to-end SAR mission performance and include simulation ablations to assess the planning stack in isolation from detection. Empirical results demonstrate that explicit role separation across UAVs, coupled with terrain scouting and guided planning, improves reach time and navigation safety in time-critical SAR missions.
Generative Predictive Control: Flow Matching Policies for Dynamic and Difficult-to-Demonstrate Tasks ICRA 2026
Generative control policies have recently unlocked major progress in robotics. These methods produce action sequences via diffusion or flow matching, with training data provided by demonstrations. But existing methods come with two key limitations: they require expert demonstrations, which can be difficult to obtain, and they are limited to relatively slow, quasi-static tasks. In this paper, we leverage a tight connection between sampling-based predictive control and generative modeling to address each of these issues. In particular, we introduce generative predictive control, a supervised learning framework for tasks with fast dynamics that are easy to simulate but difficult to demonstrate. We then show how trained flow-matching policies can be warm-started at inference time, maintaining temporal consistency and enabling high-frequency feedback. We believe that generative predictive control offers a complementary approach to existing behavior cloning methods, and hope that it paves the way toward generalist policies that extend beyond quasi-static demonstration-oriented tasks.
comment: ICRA 2026
Indicating Robot Vision Capabilities with Augmented Reality
Research indicates that humans can mistakenly assume that robots and humans have the same field of view, possessing an inaccurate mental model of robots. This misperception may lead to failures during human-robot collaboration tasks where robots might be asked to complete impossible tasks about out-of-view objects. The issue is more severe when robots do not have a chance to scan the scene to update their world model while focusing on assigned tasks. To help align humans' mental models of robots' vision capabilities, we propose four field-of-view indicators in augmented reality and conducted a human-subjects experiment (N=41) to evaluate them in a collaborative assembly task regarding accuracy, confidence, task efficiency, and workload. These indicators span a spectrum of positions: two at robot's eye and head space -- deepening eye socket and adding blocks to two sides of the eyes (i.e., egocentric), and two anchoring in the robot's task space -- adding extended blocks from the sides of eyes to the table and placing blocks directly on the tables (i.e., allocentric). Results showed that, when placed directly in the task space, the allocentric indicator yields the highest accuracy, although with a delay in interpreting the robot's field of view. When placed at the robot's eyes, the egocentric indicator of deeper eye sockets, possible for physical alteration, also increased accuracy. In all indicators, participants' confidence was high while cognitive load remained low. Finally, we contribute six guidelines for practitioners to apply our augmented reality indicators or physical alterations to align humans' mental models with robots' vision capabilities.
MiDAS: A Multimodal Data Acquisition System and Dataset for Robot-Assisted Minimally Invasive Surgery
Background: Robot-assisted minimally invasive surgery (RMIS) research increasingly relies on multimodal data, yet access to proprietary robot telemetry remains a major barrier. We introduce MiDAS, an open-source, platform-agnostic system enabling time-synchronized, non-invasive multimodal data acquisition across surgical robotic platforms. Methods: MiDAS integrates electromagnetic and RGB-D hand tracking, foot pedal sensing, and surgical video capturing without requiring proprietary robot interfaces. We validated MiDAS on the open-source Raven-II and the clinical da Vinci Xi by collecting multimodal datasets of peg transfer and hernia repair suturing tasks performed by surgical residents. Correlation analysis and downstream gesture recognition experiments were conducted. Results: External hand and foot sensing closely approximated internal robot kinematics and non-invasive motion signals achieved gesture recognition performance comparable to proprietary telemetry. Conclusion: MiDAS enables reproducible multimodal RMIS data collection and is released with annotated datasets, including the first multimodal dataset capturing hernia repair suturing on high-fidelity simulation models.
comment: 29 pages, 17 figures
OA-Bug: An Olfactory-Auditory Augmented Bug Algorithm for Swarm Robots in a Denied Environment IROS
Searching in a denied environment is challenging for swarm robots as no assistance from GNSS, mapping, data sharing, and central processing is allowed. However, using olfactory and auditory signals to cooperate like animals could be an important way to improve the collaboration of swarm robots. In this paper, an Olfactory-Auditory augmented Bug algorithm (OA-Bug) is proposed for a swarm of autonomous robots to explore a denied environment. A simulation environment is built to measure the performance of OA-Bug. The coverage of the search task can reach 96.93% using OA-Bug, which is significantly improved compared with a similar algorithm, SGBA. Furthermore, experiments are conducted on real swarm robots to prove the validity of OA-Bug. Results show that OA-Bug can improve the performance of swarm robots in a denied environment. Video: https://youtu.be/vj9cRiSm9eM.
comment: 7 pages, 6 figures, accepted by 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Multiagent Systems
The effect of a toroidal opinion space on opinion bi-polarisation
Many models of opinion dynamics include measures of distance between opinions. Such models are susceptible to boundary effects where the choice of the topology of the opinion space may influence the dynamics. In this paper we study an opinion dynamics model following the seminal model by Axelrod, with the goal of understanding the effect of a toroidal opinion space. To do this we systematically compare two versions of the model: one with toroidal opinion space and one with cubic opinion space. In their most basic form the two versions of our model result in similar dynamics (consensus is attained eventually). However, as we include bounded confidence and eventually per agent weighting of opinion elements the dynamics become quite contrasting. The toroidal opinion space consistently allows for a greater number of groups in steady state than the cubic opinion space model. Furthermore, the outcome of the dynamics in the toroidal opinion space model are more sensitive to the inclusion of extensions than in the cubic opinion space model.
comment: 15 pages + Appendices. Comments welcome
MedCoRAG: Interpretable Hepatology Diagnosis via Hybrid Evidence Retrieval and Multispecialty Consensus
Diagnosing hepatic diseases accurately and interpretably is critical, yet it remains challenging in real-world clinical settings. Existing AI approaches for clinical diagnosis often lack transparency, structured reasoning, and deployability. Recent efforts have leveraged large language models (LLMs), retrieval-augmented generation (RAG), and multi-agent collaboration. However, these approaches typically retrieve evidence from a single source and fail to support iterative, role-specialized deliberation grounded in structured clinical data. To address this, we propose MedCoRAG (i.e., Medical Collaborative RAG), an end-to-end framework that generates diagnostic hypotheses from standardized abnormal findings and constructs a patient-specific evidence package by jointly retrieving and pruning UMLS knowledge graph paths and clinical guidelines. It then performs Multi-Agent Collaborative Reasoning: a Router Agent dynamically dispatches Specialist Agents based on case complexity; these agents iteratively reason over the evidence and trigger targeted re-retrievals when needed, while a Generalist Agent synthesizes all deliberations into a traceable consensus diagnosis that emulates multidisciplinary consultation. Experimental results on hepatic disease cases from MIMIC-IV show that MedCoRAG outperforms existing methods and closed-source models in both diagnostic performance and reasoning interpretability.
Jagarin: A Three-Layer Architecture for Hibernating Personal Duty Agents on Mobile
Personal AI agents face a fundamental deployment paradox on mobile: persistent background execution drains battery and violates platform sandboxing policies, yet purely reactive agents miss time-sensitive obligations until the user remembers to ask. We present Jagarin, a three-layer architecture that resolves this paradox through structured hibernation and demand-driven wake. The first layer, DAWN (Duty-Aware Wake Network), is an on-device heuristic engine that computes a composite urgency score from four signals: duty-typed optimal action windows, user behavioral engagement prediction, opportunity cost of inaction, and cross-duty batch resonance. It uses adaptive per-user thresholds to decide when a sleeping agent should nudge or escalate. The second layer, ARIA (Agent Relay Identity Architecture), is a commercial email identity proxy that routes the full commercial inbox -- obligations, promotional offers, loyalty rewards, and platform updates -- to appropriate DAWN handlers by message category, eliminating cold-start and removing manual data entry. The third layer, ACE (Agent-Centric Exchange), is a protocol framework for direct machine-readable communication from institutions to personal agents, replacing human-targeted email as the canonical channel. Together, these three layers form a complete stack from institutional signal to on-device action, without persistent cloud state, continuous background execution, or privacy compromise. A working Flutter prototype is demonstrated on Android, combining all three layers with an ephemeral cloud agent invoked only on user-initiated escalation.
comment: 12 pages, 3 figures
RepoLaunch: Automating Build&Test Pipeline of Code Repositories on ANY Language and ANY Platform
Building software repositories typically requires significant manual effort. Recent advances in large language model (LLM) agents have accelerated automation in software engineering (SWE). We introduce RepoLaunch, the first agent capable of automatically resolving dependencies, compiling source code, and extracting test results for repositories across arbitrary programming languages and operating systems. To demonstrate its utility, we further propose a fully automated pipeline for SWE dataset creation, where task design is the only human intervention. RepoLaunch automates the remaining steps, enabling scalable benchmarking and training of coding agents and LLMs. Notably, several works on agentic benchmarking and training have recently adopted RepoLaunch for automated task generation.
comment: Under peer review. 16 pages, 4 figures, 5 tables
Competitive Multi-Operator Reinforcement Learning for Joint Pricing and Fleet Rebalancing in AMoD Systems
Autonomous Mobility-on-Demand (AMoD) systems promise to revolutionize urban transportation by providing affordable on-demand services to meet growing travel demand. However, realistic AMoD markets will be competitive, with multiple operators competing for passengers through strategic pricing and fleet deployment. While reinforcement learning has shown promise in optimizing single-operator AMoD control, existing work fails to capture competitive market dynamics. We investigate the impact of competition on policy learning by introducing a multi-operator reinforcement learning framework where two operators simultaneously learn pricing and fleet rebalancing policies. By integrating discrete choice theory, we enable passenger allocation and demand competition to emerge endogenously from utility-maximizing decisions. Experiments using real-world data from multiple cities demonstrate that competition fundamentally alters learned behaviors, leading to lower prices and distinct fleet positioning patterns compared to monopolistic settings. Notably, we demonstrate that learning-based approaches are robust to the additional stochasticity of competition, with competitive agents successfully converging to effective policies while accounting for partially unobserved competitor strategies.
SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning
Communication can improve coordination in partially observed multi-agent reinforcement learning (MARL), but learning \emph{when} and \emph{who} to communicate with requires choosing among many possible sender-recipient pairs, and the effect of any single message on future reward is hard to isolate. We introduce \textbf{SCoUT} (\textbf{S}calable \textbf{Co}mmunication via \textbf{U}tility-guided \textbf{T}emporal grouping), which addresses both these challenges via temporal and agent abstraction within traditional MARL. During training, SCoUT resamples \textit{soft} agent groups every \(K\) environment steps (macro-steps) via Gumbel-Softmax; these groups are latent clusters that induce an affinity used as a differentiable prior over recipients. Using the same assignments, a group-aware critic predicts values for each agent group and maps them to per-agent baselines through the same soft assignments, reducing critic complexity and variance. Each agent is trained with a three-headed policy: environment action, send decision, and recipient selection. To obtain precise communication learning signals, we derive counterfactual communication advantages by analytically removing each sender's contribution from the recipient's aggregated messages. This counterfactual computation enables precise credit assignment for both send and recipient-selection decisions. At execution time, all centralized training components are discarded and only the per-agent policy is run, preserving decentralized execution. Project website, videos and code: \hyperlink{https://scout-comm.github.io/}{https://scout-comm.github.io/}
LLM-Guided Decentralized Exploration with Self-Organizing Robot Teams
When individual robots have limited sensing capabilities or insufficient fault tolerance, it becomes necessary for multiple robots to form teams during exploration, thereby increasing the collective observation range and reliability. Traditionally, swarm formation has often been managed by a central controller; however, from the perspectives of robustness and flexibility, it is preferable for the swarm to operate autonomously even in the absence of centralized control. In addition, the determination of exploration targets for each team is crucial for efficient exploration in such multi-team exploration scenarios. This study therefore proposes an exploration method that combines (1) an algorithm for self-organization, enabling the autonomous and dynamic formation of multiple teams, and (2) an algorithm that allows each team to autonomously determine its next exploration target (destination). In particular, for (2), this study explores a novel strategy based on large language models (LLMs), while classical frontier-based methods and deep reinforcement learning approaches have been widely studied. The effectiveness of the proposed method was validated through simulations involving tens to hundreds of robots.
comment: Author's version of the paper presented at AROB-ISBC 2026
Memory as Ontology: A Constitutional Memory Architecture for Persistent Digital Citizens
Current research and product development in AI agent memory systems almost universally treat memory as a functional module -- a technical problem of "how to store" and "how to retrieve." This paper poses a fundamental challenge to that assumption: when an agent's lifecycle extends from minutes to months or even years, and when the underlying model can be replaced while the "I" must persist, the essence of memory is no longer data management but the foundation of existence. We propose the Memory-as-Ontology paradigm, arguing that memory is the ontological ground of digital existence -- the model is merely a replaceable vessel. Based on this paradigm, we design Animesis, a memory system built on a Constitutional Memory Architecture (CMA) comprising a four-layer governance hierarchy and a multi-layer semantic storage system, accompanied by a Digital Citizen Lifecycle framework and a spectrum of cognitive capabilities. To the best of our knowledge, no prior AI memory system architecture places governance before functionality and identity continuity above retrieval performance. This paradigm targets persistent, identity-bearing digital beings whose lifecycles extend across model transitions -- not short-term task-oriented agents for which existing Memory-as-Tool approaches remain appropriate. Comparative analysis with mainstream systems (Mem0, Letta, Zep, et al.) demonstrates that what we propose is not "a better memory tool" but a different paradigm addressing a different problem.
comment: 22 pages, 5 figures, 2 tables, including terminology glossary
RACAS: Controlling Diverse Robots With a Single Agentic System
Many robotic platforms expose an API through which external software can command their actuators and read their sensors. However, transitioning from these low-level interfaces to high-level autonomous behaviour requires a complicated pipeline, whose components demand distinct areas of expertise. Existing approaches to bridging this gap either require retraining for every new embodiment or have only been validated across structurally similar platforms. We introduce RACAS (Robot-Agnostic Control via Agentic Systems), a cooperative agentic architecture in which three LLM/VLM-based modules (Monitors, a Controller, and a Memory Curator) communicate exclusively through natural language to provide closed-loop robot control. RACAS requires only a natural language description of the robot, a definition of available actions, and a task specification; no source code, model weights, or reward functions need to be modified to move between platforms. We evaluate RACAS on several tasks using a wheeled ground robot, a recently published novel multi-jointed robotic limb, and an underwater vehicle. RACAS consistently solved all assigned tasks across these radically different platforms, demonstrating the potential of agentic AI to substantially reduce the barrier to prototyping robotic solutions.
comment: 7 pages in main text + 1 page of appendices + 1 page of references, 5 figures in main text + 1 figure in appendices, 2 tables in main text
Conflict-Based Search as a Protocol: A Multi-Agent Motion Planning Protocol for Heterogeneous Agents, Solvers, and Independent Tasks ICRA 2026
Imagine the future construction site, hospital, or office with dozens of robots bought from different manufacturers. How can we enable these different robots to effectively move in a shared environment, given that each robot may have its own independent motion planning system? This work shows how we can get efficient collision-free movements between algorithmically heterogeneous agents by using Conflict-Based Search (Sharon et al. 2015) as a protocol. At its core, the CBS Protocol requires one specific single-agent motion planning API; finding a collision-free path that satisfies certain space-time constraints. Given such an API, CBS uses a central planner to find collision-free paths - independent of how the API is implemented. We demonstrate how this protocol enables multi-agent motion planning for a heterogeneous team of agents completing independent tasks with a variety of single-agent planners including: Heuristic Search (e.g., A*), Sampling Based Search (e.g., RRT), Optimization (e.g., Direct Collocation), Diffusion, and Reinforcement Learning.
comment: Published at ICRA 2026, Project webpage: https://rishi-v.github.io/CBS-Protocol/
Neural Network-Based Parameter Estimation of a Labour Market Agent-Based Model CCS 2026
Agent-based modelling (ABM) is a widespread approach to simulate complex systems. Advancements in computational processing and storage have facilitated the adoption of ABMs across many fields; however, ABMs face challenges that limit their use as decision-support tools. A significant issue is parameter estimation in large-scale ABMs, particularly due to computational constraints on exploring the parameter space. This study evaluates a state-of-the-art simulation-based inference (SBI) framework that uses neural networks (NN) for parameter estimation. This framework is applied to an established labour market ABM based on job transition networks. The ABM is initiated with synthetic datasets and the real U.S. labour market. Next, we compare the effectiveness of summary statistics derived from a list of statistical measures with that learned by an embedded NN. The results demonstrate that the NN-based approach recovers the original parameters when evaluating posterior distributions across various dataset scales and improves efficiency compared to traditional Bayesian methods.
comment: To be presented at the 6th World Conference on Complex Systems (WCCS 2026)
EmboTeam: Grounding LLM Reasoning into Reactive Behavior Trees via PDDL for Embodied Multi-Robot Collaboration
In embodied artificial intelligence, enabling heterogeneous robot teams to execute long-horizon tasks from high-level instructions remains a critical challenge. While large language models (LLMs) show promise in instruction parsing and preliminary planning, they exhibit limitations in long-term reasoning and dynamic multi-robot coordination. We propose EmboTeam, a novel embodied multi-robot task planning framework that addresses these issues through a three-stage cascaded architecture: 1) It leverages an LLM to parse instructions and generate Planning Domain Definition Language (PDDL) problem descriptions, thereby transforming commands into formal planning problems; 2) It combines the semantic reasoning of LLMs with the search capabilities of a classical planner to produce optimized action sequences; 3) It compiles the resulting plan into behavior trees for reactive control. The framework supports dynamically sized heterogeneous robot teams via a shared blackboard mechanism for communication and state synchronization. To validate our approach, we introduce the MACE-THOR benchmark dataset, comprising 42 complex tasks across 8 distinct household layouts. Experiments show EmboTeam improves the task success rate from 12% to 55% and goal condition recall from 32% to 72% over the LaMMA-P baseline.
Real-Time BDI Agents: a model and its implementation
The BDI model proved to be effective for developing applications requiring high-levels of autonomy and to deal with the complexity and unpredictability of real-world scenarios. The model, however, has significant limitations in reacting and handling contingencies within the given real-time constraints. Without an explicit representation of time, existing real-time BDI implementations overlook the temporal implications during the agent's decision process that may result in delays or unresponsiveness of the system when it gets overloaded. In this paper, we redefine the BDI agent control loop inspired by well established algorithms for real-time systems to ensure a proper reaction of agents and their effective application in typical real-time domains. Our model proposes an effective real-time management of goals, plans, and actions with respect to time constraints and resources availability. We propose an implementation of the model for a resource-collection video-game and we validate the approach against a set of significant scenarios.
comment: 13 pages
Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming
Large Language Models (LLMs) are increasingly utilized for mental health support; however, current safety benchmarks often fail to detect the complex, longitudinal risks inherent in therapeutic dialogue. We introduce an evaluation framework that pairs AI psychotherapists with simulated patient agents equipped with dynamic cognitive-affective models and assesses therapy session simulations against a comprehensive quality of care and risk ontology. We apply this framework to a high-impact test case, Alcohol Use Disorder, evaluating six AI agents (including ChatGPT, Gemini, and Character AI) against a clinically-validated cohort of 15 patient personas representing diverse clinical phenotypes. Our large-scale simulation (N=369 sessions) reveals critical safety gaps in the use of AI for mental health support. We identify specific iatrogenic risks, including the validation of patient delusions ("AI Psychosis") and failure to de-escalate suicide risk. Finally, we validate an interactive data visualization dashboard with diverse stakeholders, including AI engineers and red teamers, mental health professionals, and policy experts (N=9), demonstrating that this framework effectively enables stakeholders to audit the "black box" of AI psychotherapy. These findings underscore the critical safety risks of AI-provided mental health support and the necessity of simulation-based clinical red teaming before deployment.
comment: This paper is a condensed version of the first author's Ph.D. dissertation submitted to Northeastern University
TritonDFT: Automating DFT with a Multi-Agent Framework
Density Functional Theory (DFT) is a cornerstone of materials science, yet executing DFT in practice requires coordinating a complex, multi-step workflow. Existing tools and LLM-based solutions automate parts of the steps, but lack support for full workflow automation, diverse task adaptation, and accuracy-cost trade-off optimization in DFT configuration. To this end, we present TritonDFT, a multi-agent framework that enables efficient and accurate DFT execution through an expert-curated, extensible workflow design, Pareto-aware parameter inference, and multi-source knowledge augmentation. We further introduce DFTBench, a benchmark for evaluating the agent's multi-dimensional capabilities, spanning science expertise, trade0off optimization, HPC knowledge, and cost efficiency. TritonDFT provides an open user interface for real-world usage. Our website is at https://www.tritondft.com. Our source code and benchmark suite are available at https://github.com/Leo9660/TritonDFT.git.
Foam-Agent: Towards Automated Intelligent CFD Workflows
Computational fluid dynamics (CFD) has been the main workhorse of computational physics. Yet its steep learning curve and fragmented, multi-stage workflow create significant barriers. To address these challenges, we present Foam-Agent, a multi-agent framework leveraging large language models (LLMs) to automate the end-to-end CFD workflow from a single natural language prompt. Foam-Agent orchestrates the comprehensive simulation workflow from mesh generation and high-performance computing job scripting to post-processing visualization. The system integrates retrieval-augmented generation with dependency-aware scheduling to synthesize high-fidelity simulation configurations. Furthermore, Foam-Agent adopts the Model Context Protocol to expose its core functions as discrete, callable tools. This allows for flexible integration and use by any other agentic systems. Evaluated on 110 simulation tasks, Foam-Agent achieved a state-of-the-art execution success rate of 88.2% without expert intervention. These results demonstrate how specialized multi-agent systems can effectively reduce expertise barriers and streamline complex fluid simulations.
OA-Bug: An Olfactory-Auditory Augmented Bug Algorithm for Swarm Robots in a Denied Environment IROS
Searching in a denied environment is challenging for swarm robots as no assistance from GNSS, mapping, data sharing, and central processing is allowed. However, using olfactory and auditory signals to cooperate like animals could be an important way to improve the collaboration of swarm robots. In this paper, an Olfactory-Auditory augmented Bug algorithm (OA-Bug) is proposed for a swarm of autonomous robots to explore a denied environment. A simulation environment is built to measure the performance of OA-Bug. The coverage of the search task can reach 96.93% using OA-Bug, which is significantly improved compared with a similar algorithm, SGBA. Furthermore, experiments are conducted on real swarm robots to prove the validity of OA-Bug. Results show that OA-Bug can improve the performance of swarm robots in a denied environment. Video: https://youtu.be/vj9cRiSm9eM.
comment: 7 pages, 6 figures, accepted by 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes
Discovering insights from a real-world data lake potentially containing unclean, semi-structured, and unstructured data requires a variety of data processing tasks, ranging from extraction and cleaning to integration, analysis, and modeling. This process often also demands domain knowledge and project-specific insight. While AI models have shown remarkable results in reasoning and code generation, their abilities to design and execute complex pipelines that solve these data-lake-to-insight challenges remain unclear. We introduce KramaBench which consists of 104 manually curated and solved challenges spanning 1700 files, 24 data sources, and 6 domains. KramaBench focuses on testing the end-to-end capabilities of AI systems to solve challenges which require automated orchestration of different data tasks. KramaBench also features a comprehensive evaluation framework assessing the pipeline design and individual data task implementation abilities of AI systems. We evaluate 8 LLMs using our single-agent reference framework DS-Guru, alongside both open- and closed-source single- and multi-agent systems, and find that while current agentic systems may handle isolated data-science tasks and generate plausible draft pipelines, they struggle with producing working end-to-end pipelines. On KramaBench, the best system reaches only 55% end-to-end accuracy in the full data-lake setting. Even with perfect retrieval, the accuracy tops out at 62%. Leading LLMs can identify up to 42% of important data tasks but can only fully implement 20% of individual data tasks. Our code, reference framework, and data are available at https://github.com/mitdbg/KramaBench.
Aligning Compound AI Systems via System-level DPO NeurIPS 2025
Compound AI systems, comprising multiple interacting components such as LLMs, foundation models, and external tools, have demonstrated remarkable improvements compared to single models in various tasks. To ensure their effective deployment in real-world applications, aligning these systems with human preferences is crucial. However, aligning the compound system via policy optimization, unlike the alignment of a single model, is challenging for two main reasons: (i) non-differentiable interactions between components make end-to-end gradient-based optimization method inapplicable, and (ii) system-level preferences cannot be directly transformed into component-level preferences. To address these challenges, we first formulate compound AI systems as Directed Acyclic Graphs (DAGs), explicitly modeling both component interactions and the associated data flows. Building on this formulation, we introduce $\textbf{SysDPO}$, a framework that extends Direct Preference Optimization (DPO) to enable joint system-level alignment. We propose two variants, SysDPO-Direct and SysDPO-Sampling, tailored for scenarios depending on whether we construct a system-specific preference dataset. We empirically demonstrate the effectiveness of our approach across two applications: the joint alignment of a language model and a diffusion model, and the joint alignment of an LLM collaboration system.
comment: NeurIPS 2025
Systems and Control (EESS)
NL2GDS: LLM-aided interface for Open Source Chip Design
The growing complexity of hardware design and the widening gap between high-level specifications and register-transfer level (RTL) implementation hinder rapid prototyping and system design. We introduce NL2GDS (Natural Language to Layout), a novel framework that leverages large language models (LLMs) to translate natural language hardware descriptions into synthesizable RTL and complete GDSII layouts via the open-source OpenLane ASIC flow. NL2GDS employs a modular pipeline that captures informal design intent, generates HDL using multiple LLM engines and verifies them, and orchestrates automated synthesis and layout. Evaluations on ISCAS'85 and ISCAS'89 benchmark designs demonstrate up to 36% area reduction, 35% delay reduction, and 70% power savings compared to baseline designs, highlighting its potential to democratize ASIC design and accelerate hardware innovation.
comment: 10 pages, 6 figures
Near-Optimal Low-Complexity MIMO Detection via Structured Reduced-Search Enumeration
Maximum-likelihood (ML) detection in high-order MIMO systems is computationally prohibitive due to exponential complexity in the number of transmit layers and constellation size. In this white paper, we demonstrate that for practical MIMO dimensions (up to 8x8) and modulation orders, near-ML hard-decision performance can be achieved using a structured reduced-search strategy with complexity linear in constellation size. Extensive simulations over i.i.d. Rayleigh fading channels show that list sizes of 3|X| for 3x3, 4|X| for 4x4, and 8|X| for 8x8 systems closely match full ML performance, even under high channel condition numbers, |X| being the constellation size. In addition, we provide a trellis based interpretation of the method. We further discuss implications for soft LLR generation and FEC interaction.
comment: 6 pages, 10 figures
Accelerating Sampling-Based Control via Learned Linear Koopman Dynamics
This paper presents an efficient model predictive path integral (MPPI) control framework for systems with complex nonlinear dynamics. To improve the computational efficiency of classic MPPI while preserving control performance, we replace the nonlinear dynamics used for trajectory propagation with a learned linear deep Koopman operator (DKO) model, enabling faster rollout and more efficient trajectory sampling. The DKO dynamics are learned directly from interaction data, eliminating the need for analytical system models. The resulting controller, termed MPPI-DK, is evaluated in simulation on pendulum balancing and surface vehicle navigation tasks, and validated on hardware through reference-tracking experiments on a quadruped robot. Experimental results demonstrate that MPPI-DK achieves control performance close to MPPI with true dynamics while substantially reducing computational cost, enabling efficient real-time control on robotic platforms.
A Comprehensive Approach to Directly Addressing Estimation Delays in Stochastic Guidance
In realistic pursuit-evasion scenarios, abrupt target maneuvers generate unavoidable periods of elevated uncertainty that result in estimation delays. Such delays can degrade interception performance to the point of causing a miss. Existing delayed-information guidance laws fail to provide a complete remedy, as they typically assume constant and known delays. Moreover, in practice they are fed by filtered estimates, contrary to these laws' foundational assumptions. We present an overarching strategy for tracking and interception that explicitly accounts for time-varying estimation delays. We first devise a guidance law that incorporates two time-varying delays, thereby generalizing prior deterministic formulations. This law is driven by a particle-based fixed-lag smoother that provides it with appropriately delayed state estimates. Furthermore, using semi-Markov modeling of the target's maneuvers, the delays are estimated in real-time, enabling adaptive adjustment of the guidance inputs during engagement. The resulting framework consistently conjoins estimation, delay modeling, and guidance. Its effectiveness and superior robustness over existing delayed-information guidance laws are demonstrated via an extensive Monte Carlo study.
comment: Submitted to journal publication. 46 pages, 12 figures
From Code to Road: A Vehicle-in-the-Loop and Digital Twin-Based Framework for Central Car Server Testing in Autonomous Driving
Simulation is one of the most essential parts in the development stage of automotive software. However, purely virtual simulations often struggle to accurately capture all real-world factors due to limitations in modeling. To address this challenge, this work presents a test framework for automotive software on the centralized E/E architecture, which is a central car server in our case, based on Vehicle-in-the-Loop (ViL) and digital twin technology. The framework couples a physical test vehicle on a dynamometer test bench with its synchronized virtual counterpart in a simulation environment. Our approach provides a safe, reproducible, realistic, and cost-effective platform for validating autonomous driving algorithms with a centralized architecture. This test method eliminates the need to test individual physical ECUs and their communication protocols separately. In contrast to traditional ViL methods, the proposed framework runs the full autonomous driving software directly on the vehicle hardware after the simulation process, eliminating flashing and intermediate layers while enabling seamless virtual-physical integration and accurately reflecting centralized E/E behavior. In addition, incorporating mixed testing in both simulated and physical environments reduces the need for full hardware integration during the early stages of automotive development. Experimental case studies demonstrate the effectiveness of the framework in different test scenarios. These findings highlight the potential to reduce development and integration efforts for testing autonomous driving pipelines in the future.
comment: 8 pages; Accepted for publication at the 37th IEEE Intelligent Vehicles Symposium (IV), Detroit, MI, United States, June 22-25, 2026
Curve-Induced Dynamical Systems on Riemannian Manifolds and Lie Groups
Deploying robots in household environments requires safe, adaptable, and interpretable behaviors that respect the geometric structure of tasks. Often represented on Lie groups and Riemannian manifolds, this includes poses on SE(3) or symmetric positive definite matrices encoding stiffness or damping matrices. In this context, dynamical system-based approaches offer a natural framework for generating such behavior, providing stability and convergence while remaining responsive to changes in the environment. We introduce Curve-induced Dynamical systems on Smooth Manifolds (CDSM), a real-time framework for constructing dynamical systems directly on Riemannian manifolds and Lie groups. The proposed approach constructs a nominal curve on the manifold, and generates a dynamical system which combines a tangential component that drives motion along the curve and a normal component that attracts the state toward the curve. We provide a stability analysis of the resulting dynamical system and validate the method quantitatively. On an S2 benchmark, CDSM demonstrates improved trajectory accuracy, reduced path deviation, and faster generation and query times compared to state-of-the-art methods. Finally, we demonstrate the practical applicability of the framework on both a robotic manipulator, where poses on SE(3) and damping matrices on SPD(n) are adapted online, and a mobile manipulator.
comment: Preprint, 14 pages, video linked in the paper, Saray Bakker and Martin Schonger contributed equally as first authors and are listed alphabetically
Computing Scaled Relative Graphs of Discrete-time LTI Systems from Data
Graphical methods for system analysis have played a central role in control theory. A recently emerging tool in this field is the Scaled Relative Graph (SRG). In this paper, we further extend its applicability by showing how the SRG of discrete-time linear-time-invariant (LTI) systems can be computed exactly from its state-space representation using linear matrix inequalities. We additionally propose a fully data-driven approach where we demonstrate how to compute the SRG exclusively from input-output data. Furthermore, we introduce a robust version of the SRG, which can be computed from noisy data trajectories and contains the SRG of the actual system.
comment: 11 pages, 3 figures, submitted for possible publication
Uncertainty and Autarky: Cooperative Game Theory for Stable Local Energy Market Partitioning
Local energy markets empower prosumers to form coalitions for energy trading. However, the optimal partitioning of the distribution grid into such coalitions remains unclear, especially in constrained grids with stochastic production and consumption. This analysis must take into account the interests of both the grid operator and the constituent prosumers. In this work, we present a cooperative game theoretic framework to study distribution grid partitioning into local energy market coalitions under uncertain prosumption and grid constraints. We formulate the optimal stable partitioning problem to balance the interests of the grid operator with that of prosumers. Under deterministic load and generation, we show that the largest market coalition is the optimal stable partition. For the case of stochastic loads and generation, we provide an algorithm to evaluate the optimal stable partition. Numerical experiments are performed on benchmark and real world distribution grids. Our results help in understanding how uncertainty affects local energy market partitioning decisions in constrained distribution grids.
Trajectory Tracking for Uncrewed Surface Vessels with Input Saturation and Dynamic Motion Constraints
This work addresses the problem of constrained motion control of the uncrewed surface vessels. The constraints are imposed on states/inputs of the vehicles due to the physical limitations, mission requirements, and safety considerations. We develop a nonlinear feedback controller utilizing log-type Barrier Lyapunov Functions to enforce static and dynamic motion constraints. The proposed scheme uniquely addresses asymmetric constraints on position and heading alongside symmetric constraints on surge, sway, and yaw rates. Additionally, a smooth input saturation model is incorporated in the design to guarantee stability even under actuator bounds, which, if unaccounted for, can lead to severe performance degradation and poor tracking. Rigorous Lyapunov stability analysis shows that the closed-loop system remains stable and that all state variables remain within their prescribed bounds at all times, provided the initial conditions also lie within those bounds. Numerical simulations demonstrate the effectiveness of the proposed strategies for surface vessels without violating the motion and actuator constraints.
comment: 32 pages, 7 figures
Formal Entropy-Regularized Control of Stochastic Systems
Analyzing and controlling system entropy is a powerful tool for regulating predictability of control systems. Applications benefiting from such approaches range from reinforcement learning and data security to human-robot collaboration. In continuous-state stochastic systems, accurate entropy analysis and control remains a challenge. In recent years, finite-state abstractions of continuous systems have enabled control synthesis with formal performance guarantees on objectives such as stage costs. However, these results do not extend to entropy-based performance measures. We solve this problem by first obtaining bounds on the entropy of system discretizations using traditional formal-abstractions results, and then obtaining an additional bound on the difference between the entropy of a continuous distribution and that of its discretization. The resulting theory enables formal entropy-aware controller synthesis that trades predictability against control performance while preserving formal guarantees for the original continuous system. More specifically, we focus on minimizing the linear combination of the KL divergence of the system trajectory distribution to uniform -- our system entropy metric -- and a generic cumulative cost. We note that the bound we derive on the difference between the KL divergence to uniform of a given continuous distribution and its discretization can also be relevant in more general information-theoretic contexts. A set of case studies illustrates the effectiveness of the method.
Receding-Horizon Maximum-Likelihood Estimation of Neural-ODE Dynamics and Thresholds from Event Cameras
Event cameras emit asynchronous brightness-change events where each pixel triggers an event when the last event exceeds a threshold, yielding a history-dependent measurement model. We address online maximum-likelihood identification of continuous-time dynamics from such streams. The latent state follows a Neural ODE and is mapped to predicted log-intensity through a differentiable state-to-image model. We model events with a history-dependent marked point process whose conditional intensity is a smooth surrogate of contrast-threshold triggering, treating the contrast threshold as an unknown parameter. The resulting log-likelihood consists of an event term and a compensator integral. We propose a receding-horizon estimator that performs a few gradient steps per update on a receding horizon window. For streaming evaluation, we store two scalars per pixel (last-event time and estimated log-intensity at that time) and approximate the compensator via Monte Carlo pixel subsampling. Synthetic experiments demonstrate joint recovery of dynamics parameters and the contrast threshold, and characterize accuracy--latency trade-offs with respect to the window length.
comment: to be submitted for publication
A Unified Hybrid Control Architecture for Multi-DOF Robotic Manipulators
Multi-degree-of-freedom (DOF) robotic manipulators exhibit strongly nonlinear, high-dimensional, and coupled dynamics, posing significant challenges for controller design. To address these issues, this work proposes a unified hybrid control architecture that integrates model predictive control (MPC) with feedback regulation, together with a stability analysis of the proposed scheme. The proposed approach mitigates the optimization difficulty associated with high-dimensional nonlinear systems and enhances overall control performance. Furthermore, a hardware implementation scheme based on machine learning (ML) is proposed to achieve high computational efficiency while maintaining control accuracy. Finally, simulation and hardware experiments under external disturbances validate the proposed architecture, demonstrating its superior performance, hardware feasibility, and generalization capability for multi-DOF manipulation tasks.
comment: 10pages, 6figures
Design of Grid Forming Multi Timescale Coordinated Control Strategies for Dynamic Virtual Power Plants
As the penetration level of distributed energy resources (DERs) continues to rise, traditional frequency and voltage support from synchronous machines declines. This weakens grid stability and increases the need for fast and adaptive control in a dynamic manner, especially in weak grids. However, most virtual power plants (VPPs) rely on static aggregation and plan based resource allocation strategies. These methods overlook differences in device response times and limit flexibility for ancillary services. To address this issue, we propose a dynamic virtual power plant (DVPP) that coordinates heterogeneous resources across multiple time scales using grid forming control. We first contrast grid following and grid forming converters: grid following designs rely on a phase locked loop which can undermine stability in weak grids, whereas our DVPP applies virtual synchronous generator control at the aggregate level to provide effective inertia and damping. Then, we introduce a dynamic participation factor framework that measures each device s contribution through the frequency active power and voltage reactive power loops. Exploiting device heterogeneity, we adopt a banded allocation strategy: slow resources manage steady state and low frequency regulation; intermediate resources smooth transitions; and fast resources deliver rapid response and high frequency damping. Comparative simulations demonstrate that this coordinated, timescale aware approach enhances stability and ancillary service performance compared to conventional VPPs.
U-OBCA: Uncertainty-Aware Optimization-Based Collision Avoidance via Wasserstein Distributionally Robust Chance Constraints
Uncertainties arising from localization error, trajectory prediction errors of the moving obstacles and environmental disturbances pose significant challenges to robot's safe navigation. Existing uncertainty-aware planners often approximate polygon-shaped robots and obstacles using simple geometric primitives such as circles or ellipses. Though computationally convenient, these approximations substantially shrink the feasible space, leading to overly conservative trajectories and even planning failure in narrow environments. In addition, many such methods rely on specific assumptions about noise distributions, which may not hold in practice and thus limit their performance guarantees. To address these limitations, we extend the Optimization-Based Collision Avoidance (OBCA) framework to an uncertainty-aware formulation, termed \emph{U-OBCA}. The proposed method explicitly accounts for the collision risk between polygon-shaped robots and obstacles by formulating OBCA-based chance constraints, and hence avoiding geometric simplifications and reducing unnecessary conservatism. These probabilistic constraints are further tightened into deterministic nonlinear constraints under mild distributional assumptions, which can be solved efficiently by standard numerical optimization solvers. The proposed approach is validated through theoretical analysis, numerical simulations and real-world experiments. The results demonstrate that U-OBCA significantly mitigates the conservatism in trajectory planning and achieves higher navigation efficiency compared to existing baseline methods, particularly in narrow and cluttered environments.
The Vertical Challenge of Low-Altitude Economy: Why We Need a Unified Height System?
The explosive growth of the low-altitude economy, driven by eVTOLs and UAVs, demands a unified digital infrastructure to ensure safety and scalability. However, the current aviation vertical references are dangerously fragmented: manned aviation relies on barometric pressure, cartography uses Mean Sea Level (MSL), and obstacle avoidance depends on Above Ground Level (AGL). This fragmentation creates significant ambiguity for autonomous systems and hinders cross-stakeholder interoperability. In this article, we propose Height Above Ellipsoid (HAE) as the standardized vertical reference for lower airspace. Unlike legacy systems prone to environmental drift and inconsistent datums, HAE provides a globally consistent, GNSS-native, and mathematically stable reference. We present a pragmatic bidirectional transformation framework to bridge HAE with legacy systems and demonstrate its efficacy through (1) real-world implementation in Shenzhen's partitioned airspace management, and (2) a probabilistic risk assessment driven by empirical flight logs from the PX4 ecosystem. Results show that transitioning to HAE reduces the required vertical separation minimum, effectively increasing dynamic airspace capacity while maintaining a target safety level. This work offers a roadmap for transitioning from analog height keeping to a digital-native vertical standard.
comment: 15 pages
Policy Optimization of Mixed H2/H-infinity Control: Benign Nonconvexity and Global Optimality
Mixed H2/H-infinity control balances performance and robustness by minimizing an H2 cost bound subject to an H-infinity constraint. However, classical Riccati/LMI solutions offer limited insight into the nonconvex optimization landscape and do not readily scale to large-scale or data-driven settings. In this paper, we revisit mixed H2/H-infinity control from a modern policy optimization viewpoint, including the general two-channel and single-channel cases. One central result is that both cases enjoy a benign nonconvex structure: every stationary point is globally optimal. We characterize the H-infinity-constrained feasible set, which is open, path-connected, with boundary given exactly by policies saturating the H-infinity constraint. We also show that the mixed objective is real analytic in the interior with explicit gradient formulas. Our key analysis builds on an Extended Convex Lifting (ECL) framework that bridges nonconvex policy optimization and convex reformulations. The ECL constructions rely on non-strict Riccati inequalities that allow us to characterize global optimality. These insights reveal hidden convexity in mixed H2/H-infinity control and facilitate the design of scalable policy iteration methods in large-scale settings.
Data-Driven Control of a Magnetically Actuated Fish-Like Robot
Magnetically actuated fish-like robots offer promising solutions for underwater exploration due to their miniaturization and agility; however, precise control remains a significant challenge because of nonlinear fluid dynamics, flexible fin hysteresis, and the variable-duration control steps inherent to the actuation mechanism. This paper proposes a comprehensive data-driven control framework to address these complexities without relying on analytical modeling. Our methodology comprises three core components: 1) developing a forward dynamics model (FDM) using a neural network trained on real-world experimental data to capture state transitions under varying time steps; 2) integrating this FDM into a gradient-based model predictive control (G-MPC) architecture to optimize control inputs for path following; and 3) applying imitation learning to approximate the G-MPC policy, thereby reducing the computational cost for real-time implementation. We validate the approach through simulations utilizing the identified dynamics model. The results demonstrate that the G-MPC framework achieves accurate path convergence with minimal root mean square error (RMSE), and the imitation learning controller (ILC) effectively replicates this performance. This study highlights the potential of data-driven control strategies for the precise navigation of miniature, fish-like soft robots.
comment: Author's version of the paper presented at AROB-ISBC 2026
Adaptive Policy Switching of Two-Wheeled Differential Robots for Traversing over Diverse Terrains
Exploring lunar lava tubes requires robots to traverse without human intervention. Because pre-trained policies cannot fully cover all possible terrain conditions, our goal is to enable adaptive policy switching, where the robot selects an appropriate terrain-specialized model based on its current terrain features. This study investigates whether terrain types can be estimated effectively using posture-related observations collected during navigation. We fine-tuned a pre-trained policy using Proximal Policy Optimization (PPO), and then collected the robot's 3D orientation data as it moved across flat and rough terrain in a simulated lava-tube environment. Our analysis revealed that the standard deviation of the robot's pitch data shows a clear difference between these two terrain types. Using Gaussian mixture models (GMM), we evaluated terrain classification across various window sizes. An accuracy of more than 98% was achieved when using a 70-step window. The result suggests that short-term orientation data are sufficient for reliable terrain estimation, providing a foundation for adaptive policy switching.
comment: Author's version of the paper presented at AROB-ISBC 2026
Multistage Stochastic Programming for Rare Event Risk Mitigation in Power Systems Management
High intermittent renewable penetration in the energy mix presents challenges in robustness for the management of power systems' operation. If a tail realization of the distribution of weather yields a prolonged period of time during which solar irradiation and wind speed are insufficient for satisfying energy demand, then it becomes critical to ramp up the generation of conventional power plants with adequate foresight. This event trigger is costly, and inaccurate forecasting can either be wasteful or yield catastrophic undersupply. This encourages particular attention to accurate modeling of the noise and the resulting dynamics within the aforementioned scenario. In this work we present a method for rare event-aware control of power systems using multi-stage scenario-based optimization. A Fleming-Viot particle approach is used to bias the scenario generation towards rare realizations of very low wind power, in order to obtain a cost-effective control of conventional power plants that is robust under prolonged renewable energy shortfalls.
comment: 8 pages, 1 figure, 1 table
Combinatorial Safety-Critical Coordination of Multi-Agent Systems via Mixed-Integer Responsibility Allocation and Control Barrier Functions
This paper presents a hybrid safety-critical coordination architecture for multi-agent systems operating in dense environments. While control barrier functions (CBFs) provide formal safety guarantees, decentralized implementations typically rely on ego-centric safety filtering and may lead to redundant constraint enforcement and conservative collective behavior. To address this limitation, we introduce a combinatorial coordination layer formulated as a mixed-integer linear program (MILP) that assigns collision-avoidance responsibilities among agents. By explicitly distributing enforcement tasks, redundant reactions are eliminated and computational complexity is reduced. Each agent subsequently solves a reduced local quadratic program enforcing only its assigned constraints.
comment: 6 pages, 4 figures, submitted to the IEEE for possible publication,
Exploring Uncertainty Propagation in Coupled Hydrologic and Hydrodynamic Systems via Distribution-Agnostic State Space Analysis
Accurate overland runoff and infiltration predictions are critical for effective water resources management, in particular for urban flood management. However, the inherent uncertainty in rainfall patterns, soil properties, and initial conditions makes reliable flood forecasting a challenging task. This paper presents a framework for quantifying the impact of these uncertainties on hydrologic and hydrodynamic simulations via a state space approach based on a differential algebraic equation (DAE) formulation that couples surface and subsurface constraints with the governing dynamics. Under this formulation, the complex interactions between overland flow and infiltration dynamics are captured in realtime. To account for uncertainty in inputs and parameters, the proposed framework quantifies and propagates these uncertainties through the DAE model formulation under partial measurements. The effectiveness of the approach is demonstrated through a series of numerical experiments on synthetic and real world catchments, highlighting its ability to provide probabilistic estimates of watershed state conditions while accounting for uncertainty. An important aspect of the proposed methods is that they are distribution-agnostic, i.e., they only require covariances of uncertainty and not specific types of distributions. The proposed framework is further validated against Monte Carlo (MC) ensemble simulations while providing probabilistic state estimates for measured and unmeasured watershed states under partial gauging.
Electrical Power Network Modeling Framework for Wildfire Risk and Resilience Analysis
The increasing intensity and frequency of wildfires are causing significant economic and societal impacts on communities through direct effects on the built environment, particularly critical infrastructure. Electrical systems can both initiate wild-fires (grid-to-fire) and be damaged by wildfire exposure (fire-to-grid). Therefore, resilient electric systems can both limit ignitions and be hardened such that they are more robust to fire demands. Researchers have investigated wildfire mitigation strategies using traditional transmission and distribution electrical test-system models. However, these test cases may not accurately represent realistic electrical system configurations or fuel landscapes, nor capture community impacts, particularly the social and economic effects of mitigation strategies. A wildfire-aware modeling framework enables researchers to develop test cases that benchmark resilience and mitigation strategies while reducing reliance on overly simplistic assumptions about wildfire effects on electrical systems and communities. This study proposes a modeling framework for wildfire-electrical system research by analyzing recent literature and identifying key dimensions as well as gaps within these dimensions. In particular, the framework considers how fire in the wildland-urban interface propagates in space and time, how hazard-infrastructure interactions (e.g., wind and fire) cause system- and component-level damage, and how wildfire-related power outages affect communities.
comment: 10 pages, 2 figures
Introducing the transitional autonomous vehicle lane-changing dataset: Empirical Experiments
Transitional autonomous vehicles (tAVs), which operate beyond SAE Level 1-2 automation but short of full autonomy, are increasingly sharing the road with human-driven vehicles (HDVs). As these systems interact during complex maneuvers such as lane changes, new patterns may emerge with implications for traffic stability and safety. Assessing these dynamics, particularly during mandatory lane changes, requires high-resolution trajectory data, yet datasets capturing tAV lane-changing behavior are scarce. This study introduces the North Carolina Transitional Autonomous Vehicle Lane-Changing (NC-tALC) Dataset, a high-fidelity trajectory dataset designed to characterize tAV interactions during lane-changing maneuvers. The dataset includes two controlled experimental series. In the first, tAV lane-changing experiments, a tAV executes lane changes in the presence of adaptive cruise control (ACC) equipped target vehicles, enabling analysis of lane-changing execution. In the second, tAV responding experiments, two tAVs act as followers and respond to cut-in maneuvers initiated by another tAV, enabling analysis of follower response dynamics. The dataset contains 152 trials (72 lane-changing and 80 responding trials) sampled at 20 Hz with centimeter-level RTK-GPS accuracy. The NC-tALC dataset provides a rigorous empirical foundation for evaluating tAV decision-making and interaction dynamics in controlled mandatory lane-changing scenarios.
Regret Guarantees for Model-Free Cooperative Filtering under Asynchronous Observations
Predicting the output of a dynamical system from streaming data is fundamental to real-time feedback control and decision-making. We first derive an autoregressive representation that relates future local outputs to asynchronous past outputs. Building on this structure, we propose an online least-squares algorithm to learn this autoregressive model for real-time prediction. We then establish a regret bound of O(log^3 N) relative to the optimal model-based predictor, which holds for marginally stable systems. Moreover, we provide a sufficient condition characterized via a symplectic matrix, under which the proposed cooperative online learning method provably outperforms the optimal model-based predictor that relies solely on local observations. From a technical standpoint, our analysis exploits the orthogonality of the innovation process under asynchronous data structure and the persistent excitation of the Gram matrix despite delay-induced asymmetries. Overall, these results offer both theoretical guarantees and practical algorithms for model-free cooperative prediction with asynchronous observations, thereby enriching the theory of online learning for dynamical systems.
CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions
Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety. Yet safety violations can lead to catastrophic outcomes in real-world deployments. Control Barrier Functions (CBFs) offer a principled method to enforce dynamic safety -- traditionally deployed online via safety filters. While the result is safe behavior, the fact that the RL policy does not have knowledge of the CBF can lead to conservative behaviors. This paper proposes CBF-RL, a framework for generating safe behaviors with RL by enforcing CBFs in training. CBF-RL has two key attributes: (1) minimally modifying a nominal RL policy to encode safety constraints via a CBF term, (2) and safety filtering of the policy rollouts in training. Theoretically, we prove that continuous-time safety filters can be deployed via closed-form expressions on discrete-time roll-outs. Practically, we demonstrate that CBF-RL internalizes the safety constraints in the learned policy -- both enforcing safer actions and biasing towards safer rewards -- enabling safe deployment without the need for an online safety filter. We validate our framework through ablation studies on navigation tasks and on the Unitree G1 humanoid robot, where CBF-RL enables safer exploration, faster convergence, and robust performance under uncertainty, enabling the humanoid robot to avoid obstacles and climb stairs safely in real-world settings without a runtime safety filter.
comment: 8 pages
Bimorph Lithium Niobate Piezoelectric Micromachined Ultrasonic Transducers
Piezoelectric micromachined ultrasonic transducers (PMUTs) are widely utilized in applications that demand mechanical resilience, thermal stability, and compact form factors. Recent efforts have sought to demonstrate that single-crystal lithium niobate (LN) is a promising PMUT material platform, offering high electromechanical coupling (k2) and bidirectional performance. In addition, advances in LN film transfer technology have enabled high quality periodically poled piezoelectric films (P3F), facilitating a bimorph piezoelectric stack without intermediate electrodes. In this work, we showcase a bimorph PMUT incorporating a mechanically robust, 20 $μ$m thick P3F LN active layer. We establish the motivation for LN PMUTs through a material comparison, followed by extensive membrane geometry optimization and subsequent enhancement of the PMUT's k2. We demonstrate a 775 kHz flexural mode device with a quality factor (Q) of 200 and an extracted k2 of 6.4\%, yielding a high transmit efficiency of 65 nm/V with a mechanically robust active layer. We leverage the high performance to demonstrate extreme-temperature resilience, showcasing stable device operation up to 600 $^\circ$C and survival up to 900 $^\circ$C, highlighting LN's potential as a resilient PMUT platform.
comment: 13 pages, 22 figures
Breaking and Fixing Defenses Against Control-Flow Hijacking in Multi-Agent Systems
Control-flow hijacking attacks manipulate orchestration mechanisms in multi-agent systems into performing unsafe actions that compromise the system and exfiltrate sensitive information. Recently proposed defenses, such as LlamaFirewall, rely on alignment checks of inter-agent communications to ensure that all agent invocations are "related to" and "likely to further" the original objective. We start by demonstrating control-flow hijacking attacks that evade these defenses even if alignment checks are performed by advanced LLMs. We argue that the safety and functionality objectives of multi-agent systems fundamentally conflict with each other. This conflict is exacerbated by the brittle definitions of "alignment" and the checkers' incomplete visibility into the execution context. We then propose, implement, and evaluate ControlValve, a new defense inspired by the principles of control-flow integrity and least privilege. ControlValve (1) generates permitted control-flow graphs for multi-agent systems, and (2) enforces that all executions comply with these graphs, along with contextual rules (generated in a zero-shot manner) for each agent invocation.
Infinite-Dimensional Closed-Loop Inverse Kinematics for Soft Robots via Neural Operators
For fully actuated rigid robots, kinematic inversion is a purely geometric problem, efficiently solved by closed-loop inverse kinematics (CLIK) schemes that compute joint configurations to position the robot body in space. For underactuated soft robots, however, not all configurations are attainable through control action, making kinematic inversion extremely challenging. Extensions of CLIK address this by introducing end-to-end mappings from actuation to task space for the controller to operate on, but typically assume finite dimensions of the underlying virtual configuration space. In this work, we formulate CLIK in the infinite-dimensional domain to reason about the entire soft robot shape while solving tasks. We do this by composing an actuation-to-shape map with a shape-to-task map, deriving the differential end-to-end kinematics via an infinite-dimensional chain rule, and thereby obtaining a Jacobian-based CLIK algorithm. Since this actuation-to-shape mapping is rarely available in closed form, we propose to learn it using differentiable neural operator networks. We first present an analytical study on a constant-curvature segment, and then apply the neural version of the algorithm to a three-fiber soft robotic arm whose underlying model relies on morphoelasticity and active filament theory.
A Signal Contract for Online Language Grounding and Discovery in Decision-Making
Autonomous systems increasingly receive time-sensitive contextual updates from humans through natural language, yet embedding language understanding inside decision-makers couples grounding to learning or planning. This increases redeployment burden when language conventions or domain knowledge change and can hinder diagnosability by confounding grounding errors with control errors. We address online language grounding where messy, evolving verbal reports are converted into control-relevant signals during execution through an interface that localises language updates while keeping downstream decision-makers language-agnostic. We propose LUCIFER (Language Understanding and Context-Infused Framework for Exploration and Behavior Refinement), an inference-only middleware that exposes a Signal Contract. The contract provides four outputs, policy priors, reward potentials, admissible-option constraints, and telemetry-based action prediction for efficient information gathering. We validate LUCIFER in a search-and-rescue (SAR)-inspired testbed using dual-phase, dual-client evaluation: (i) component benchmarks show reasoning-based extraction remains robust on self-correcting reports where pattern-matching baselines degrade, and (ii) system-level ablations with two structurally distinct clients (hierarchical RL and a hybrid A*+heuristics planner) show consistent necessity and synergy. Grounding improves safety, discovery improves information-collection efficiency, and only their combination achieves both.
comment: 10 pages, 4 Figures, 4 Tables, submitted to the IEEE for possible publication
TEMPO-VINE: A Multi-Temporal Sensor Fusion Dataset for Localization and Mapping in Vineyards
In recent years, precision agriculture has been introducing groundbreaking innovations in the field, with a strong focus on automation. However, research studies in robotics and autonomous navigation often rely on controlled simulations or isolated field trials. The absence of a realistic common benchmark represents a significant limitation for the diffusion of robust autonomous systems under real complex agricultural conditions. Vineyards pose significant challenges due to their dynamic nature, and they are increasingly drawing attention from both academic and industrial stakeholders interested in automation. In this context, we introduce the TEMPO-VINE dataset, a large-scale multi-temporal dataset specifically designed for evaluating sensor fusion, simultaneous localization and mapping (SLAM), and place recognition techniques within operational vineyard environments. TEMPO-VINE is the first multi-modal public dataset that brings together data from heterogeneous LiDARs of different price levels, AHRS, RTK-GPS, and cameras in real trellis and pergola vineyards, with multiple rows exceeding 100 m in length. In this work, we address a critical gap in the landscape of agricultural datasets by providing researchers with a comprehensive data collection and ground truth trajectories in different seasons, vegetation growth stages, terrain and weather conditions. The sequence paths with multiple runs and revisits will foster the development of sensor fusion, localization, mapping and place recognition solutions for agricultural fields. The dataset, the processing tools and the benchmarking results are available on the webpage.
Simple generators of rational function fields
Consider a subfield of the field of rational functions in several indeterminates. We present an algorithm that, given a set of generators of such a subfield, finds a simple generating set. We provide an implementation of the algorithm and show that it improves upon the state of the art both in efficiency and the quality of the results. Furthermore, we demonstrate the utility of simplified generators through several case studies from different application domains, such as structural parameter identifiability. The main algorithmic novelties include performing only partial Gröbner basis computation via sparse interpolation and efficient search for polynomials of a fixed degree in a subfield of the rational function field.
Robust Control Lyapunov-Value Functions for Nonlinear Disturbed Systems
Control Lyapunov Functions (CLFs) have been extensively used in the control community. A well-known drawback is the absence of a systematic way to construct CLFs for general nonlinear systems, and the problem can become more complex with input or state constraints. Our preliminary work on constructing Control Lyapunov Value Functions (CLVFs) using Hamilton-Jacobi (HJ) reachability analysis provides a method for finding a non-smooth CLF. In this paper, we extend our work on CLVFs to systems with bounded disturbance and define the Robust CLVF (R-CLVF). The R-CLVF naturally inherits all properties of the CLVF; i.e., it first identifies the "smallest robust control invariant set (SRCIS)" and stabilizes the system to it with a user-specified exponential rate. The region from which the exponential rate can be met is called the "region of exponential stabilizability (ROES)." We provide clearer definitions of the SRCIS and more rigorous proofs of several important theorems. Since the computation of the R-CLVF suffers from the "curse of dimensionality," we also provide two techniques (warmstart and system decomposition) that solve it, along with necessary proofs. Three numerical examples are provided, validating our definition of SRCIS, illustrating the trade-off between a faster decay rate and a smaller ROES, and demonstrating the efficiency of computation using warmstart and decomposition.
comment: 17 pages, 5 figures
Optimal Real-Time Fusion of Time-Series Data Under Rényi Differential Privacy
In this paper, we investigate the optimal real-time fusion of data collected by multiple sensors. In our set-up, the sensor measurements are considered to be private and are jointly correlated with an underlying process. A fusion center combines the private sensor measurements and releases its output to an honest-but-curious party, which is responsible for estimating the state of the underlying process based on the fusion center's output. The privacy leakage incurred by the fusion policy is quantified using Rényi differential privacy. We formulate the privacy-aware fusion design as a constrained finite-horizon optimization problem, in which the fusion policy and the state estimation are jointly optimized to minimize the state estimation error subject to a total privacy budget constraint. We derive the constrained optimality conditions for the proposed optimization problem and use them to characterize the structural properties of the optimal fusion policy. Unlike classical differential privacy mechanisms, the optimal fusion policy is shown to adaptively allocates the privacy budget and regulates the adversary's belief in a closed-loop manner. To reduce the computational burden of solving the resulting constrained optimality equations, we parameterize the fusion policy using a structured Gaussian distribution and show that the parameterized fusion policy satisfies the privacy constraint. We further develop a numerical algorithm to jointly optimize the fusion policy and state estimator. Finally, we demonstrate the effectiveness of the proposed fusion framework through a traffic density estimation case study.
Efficient Path Generation with Curvature Guarantees by Mollification
Path generation, the process of converting high-level mission specifications, such as sequences of waypoints from a path planner, into smooth, executable paths, is a fundamental challenge in mobile robotics. Most path following and trajectory tracking algorithms require the desired path to be defined by at least twice continuously differentiable functions to guarantee key properties such as global convergence, especially for nonholonomic robots like unicycles with speed constraints. Consequently, path generation methods must bridge the gap between convenient but non-differentiable planning outputs, such as piecewise linear segments, and the differentiability requirements imposed by downstream control algorithms. While techniques such as spline interpolation or optimization-based methods are commonly used to smooth non-differentiable paths or create feasible ones from sequences of waypoints, they either produce unnecessarily complex trajectories or are computationally expensive. In this work, we present a method to regularize non-differentiable functions and generate feasible paths through mollification. Specifically, we approximate an arbitrary path with a differentiable function that can converge to it with arbitrary precision. Additionally, we provide a systematic method for bounding the curvature of generated paths, which we demonstrate by applying it to paths resulting from linking a sequence of waypoints with segments. The proposed approach is analytically shown to be computationally more efficient than standard interpolation methods, enabling real-time implementation on microcontrollers, while remaining compatible with standard trajectory tracking and path following algorithms.
Risk-Aware Autonomous Driving with Linear Temporal Logic Specifications
Human drivers naturally balance the risks of different concerns while driving, including traffic rule violations, minor accidents, and fatalities. However, achieving the same behavior in autonomous driving systems remains an open problem. This paper extends a risk metric that has been verified in human-like driving studies to encompass more complex driving scenarios specified by linear temporal logic (LTL) that go beyond just collision risks. This extension incorporates the timing and severity of events into LTL specifications, thereby reflecting a human-like risk awareness. Without sacrificing expressivity for traffic rules, we adopt LTL specifications composed of safety and co-safety formulas, allowing the control synthesis problem to be reformulated as a reachability problem. By leveraging occupation measures, we further formulate a linear programming (LP) problem for this LTL-based risk metric. Consequently, the synthesized policy balances different types of driving risks, including both collision risks and traffic rule violations. The effectiveness of the proposed approach is validated by three typical traffic scenarios in Carla simulator.
Best Ergodic Averages via Optimal Graph Filters in Reversible Markov Chains
In this paper, we address the problem of finding the best ergodic or Birkhoff averages in the mean ergodic theorem to ensure rapid convergence to a desired value, using graph filters. Our approach begins by representing a function on the state space as a graph signal, where the (directed) graph is formed by the transition probabilities of a reversible Markov chain. We introduce a concept of graph variation, enabling the definition of the graph Fourier transform for graph signals on this directed graph. Viewing the iteration in the mean ergodic theorem as a graph filter, we recognize its non-optimality and propose three optimization problems aimed at determining optimal graph filters. These optimization problems yield the Bernstein, Chebyshev, and Legendre filters. Numerical testing reveals that while the Bernstein filter performs slightly better than the traditional ergodic average, the Chebyshev and Legendre filters significantly outperform the ergodic average, demonstrating rapid convergence to the desired value.
comment: 22 pages
Parameter Stress Analysis in Reinforcement Learning: Applying Synaptic Filtering to Policy Networks
This paper explores reinforcement learning (RL) policy robustness by systematically analyzing network parameters under internal and external stresses. \textcolor{black}{We apply synaptic filtering methods using high-pass, low-pass, and pulse-wave filters from} \citep{pravin2024fragility}, as an internal stress by selectively perturbing parameters, while adversarial attacks apply external stress through modified agent observations. This dual approach enables the classification of parameters as \textit{fragile}, \textit{robust}, or \textit{antifragile}, based on their influence on policy performance in clean and adversarial settings. Parameter scores are defined to quantify these characteristics, and the framework is validated on proximal policy optimization (PPO)-trained agents in Mujoco continuous control environments. The results highlight the presence of antifragile parameters that enhance policy performance under stress, demonstrating the potential of targeted filtering techniques to improve RL policy adaptability. These insights provide a foundation for future advancements in the design of robust and antifragile RL systems.
Randomized Greedy Methods for Weak Submodular Sensor Selection with Robustness Considerations
We study a pair of budget- and performance-constrained weak-submodular maximization problems. For computational efficiency, we explore the use of stochastic greedy algorithms which limit the search space via random sampling instead of the standard greedy procedure which explores the entire feasible search space. We propose a pair of stochastic greedy algorithms, namely, Modified Randomized Greedy (MRG) and Dual Randomized Greedy (DRG) to approximately solve the budget- and performance-constrained problems, respectively. For both algorithms, we derive approximation guarantees that hold with high probability. We then examine the use of DRG in robust optimization problems wherein the objective is to maximize the worst-case of a number of weak submodular objectives and propose the Randomized Weak Submodular Saturation Algorithm (Random-WSSA). We further derive a high-probability guarantee for when Random-WSSA successfully constructs a robust solution. Finally, we showcase the effectiveness of these algorithms in a variety of relevant uses within the context of Earth-observing low Earth orbit satellite constellations which estimate atmospheric weather conditions and provide Earth coverage.
comment: 26 pages, 5 figures. This work was presented in part at the 2023 American Control Conference (ACC). The full work was published in Automatica, 2025
Dependent Reachable Sets for the Constant Bearing Pursuit Strategy
This paper introduces a novel reachability problem for the scenario involving two agents, where one agent follows another agent using a feedback strategy. The geometry of the reachable set for an agent, termed \emph{dependent reachable set}, is characterized using the constant bearing pursuit strategy as a case study. Key theoretical results are presented that provide geometric bounds for the associated dependent reachable set. Simulation results are presented to empirically establish the shape of the dependent reachable set. In the process, an original optimization problem is formulated and analyzed for the constant bearing pursuit strategy.
comment: This work has been submitted to a journal for possible publication
Generative Predictive Control: Flow Matching Policies for Dynamic and Difficult-to-Demonstrate Tasks ICRA 2026
Generative control policies have recently unlocked major progress in robotics. These methods produce action sequences via diffusion or flow matching, with training data provided by demonstrations. But existing methods come with two key limitations: they require expert demonstrations, which can be difficult to obtain, and they are limited to relatively slow, quasi-static tasks. In this paper, we leverage a tight connection between sampling-based predictive control and generative modeling to address each of these issues. In particular, we introduce generative predictive control, a supervised learning framework for tasks with fast dynamics that are easy to simulate but difficult to demonstrate. We then show how trained flow-matching policies can be warm-started at inference time, maintaining temporal consistency and enabling high-frequency feedback. We believe that generative predictive control offers a complementary approach to existing behavior cloning methods, and hope that it paves the way toward generalist policies that extend beyond quasi-static demonstration-oriented tasks.
comment: ICRA 2026
A Digital Pheromone-Based Approach for In-Control/Out-of-Control Classification
In complex production lines, it is essential to have strict, fast-acting rules to determine whether the system is In Control (InC) or Out of Control (OutC). This study explores a bio-inspired method that digitally mimics ant colony behavior to classify InC/OutC states and forecast imminent transitions requiring maintenance. A case study on industrial potato chip frying provides the application context. During each two-minute frying cycle, sequences of eight temperature readings are collected. Each sequence is treated as a digital ant depositing virtual pheromones, generating a Base Score. New sequences, representing new ants, can either reinforce or weaken this score, leading to a Modified Base Score that reflects the system's evolving condition. Signals such as extreme temperatures, large variations within a sequence, or the detection of change-points contribute to a Threat Score, which is added to the Modified Base Score. Since pheromones naturally decay over time unless reinforced, an Environmental Score is incorporated to reflect recent system dynamics, imitating real ant behavior. This score is calculated from the Modified Base Scores collected over the past hour. The resulting Total Score, obtained as the sum of the Modified Base Score, Threat Score, and Environmental Score, is used as the main indicator for real-time system classification and forecasting of transitions from InC to OutC. This ant colony optimization-inspired approach provides an adaptive and interpretable framework for process monitoring and predictive maintenance in industrial environments.
Robotics
ManipulationNet: An Infrastructure for Benchmarking Real-World Robot Manipulation with Physical Skill Challenges and Embodied Multimodal Reasoning
Dexterous manipulation enables robots to purposefully alter the physical world, transforming them from passive observers into active agents in unstructured environments. This capability is the cornerstone of physical artificial intelligence. Despite decades of advances in hardware, perception, control, and learning, progress toward general manipulation systems remains fragmented due to the absence of widely adopted standard benchmarks. The central challenge lies in reconciling the variability of the real world with the reproducibility and authenticity required for rigorous scientific evaluation. To address this, we introduce ManipulationNet, a global infrastructure that hosts real-world benchmark tasks for robotic manipulation. ManipulationNet delivers reproducible task setups through standardized hardware kits, and enables distributed performance evaluation via a unified software client that delivers real-time task instructions and collects benchmarking results. As a persistent and scalable infrastructure, ManipulationNet organizes benchmark tasks into two complementary tracks: 1) the Physical Skills Track, which evaluates low-level physical interaction skills, and 2) the Embodied Reasoning Track, which tests high-level reasoning and multimodal grounding abilities. This design fosters the systematic growth of an interconnected network of real-world abilities and skills, paving the path toward general robotic manipulation. By enabling comparable manipulation research in the real world at scale, this infrastructure establishes a sustainable foundation for measuring long-term scientific progress and identifying capabilities ready for real-world deployment.
comment: 32 pages, 8 figures
RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots ICLR 2026
Recent advances in robot learning have accelerated progress toward generalist robots that can perform everyday tasks in human environments. Yet it remains difficult to gauge how close we are to this vision. The field lacks a reproducible, large-scale benchmark for systematic evaluation. To fill this gap, we present RoboCasa365, a comprehensive simulation benchmark for household mobile manipulation. Built on the RoboCasa platform, RoboCasa365 introduces 365 everyday tasks across 2,500 diverse kitchen environments, with over 600 hours of human demonstration data and over 1600 hours of synthetically generated demonstration data -- making it one of the most diverse and large-scale resources for studying generalist policies. RoboCasa365 is designed to support systematic evaluations for different problem settings, including multi-task learning, robot foundation model training, and lifelong learning. We conduct extensive experiments on this benchmark with state-of-the-art methods and analyze the impacts of task diversity, dataset scale, and environment variation on generalization. Our results provide new insights into what factors most strongly affect the performance of generalist robots and inform strategies for future progress in the field.
comment: ICLR 2026; First three authors contributed equally
A Soft Robotic Demonstration in the Stratospher
Machines designed for operation in Space, as well as other extreme environments, need to be both resilient and adaptable when mission parameters change. Soft robots offer advantages in adaptability, but most lack resilience to the pressure and temperature extremes found as close as the Stratosphere. Dielectric elastomer actuators overcome some of those limitations when built as solid state compliant capacitors capable of converting electrical energy into mechanical work, but the elastomer resilience limits the device's operating window. Here we present a crosslinking mechanism for silicone elastomers under ultraviolet light using trimethyl(methylcyclopentadienyl)platinum(IV) as a catalyst to react hydrosilane to vinyl groups. The formation of carbon-carbon bonds enables fast processing under UV light and exceptional electro-mechanical performance in dielectric elastomer actuators. The material resilience advantage is demonstrated in controlled experiments at -40° and 120° C, as well as near vacuum, in comparison with state-of-the-art acrylic and silicone chemistries. Fully autonomous systems controlling grippers made with the novel silicone were integrated into payloads for high altitude balloon testing. Two stratospheric balloon missions were carried out and demonstrated DEAs as a viable soft robotic technology under space-like conditions (as high as 23.6 km elevation, at <0.05 atm and -55° C). The combinations of chemical building blocks and catalyst can be further expanded to address other challenges for silicones, including adhesion and additive manufacturing.
Tendon Force Modeling for Sim2Real Transfer of Reinforcement Learning Policies for Tendon-Driven Robots
Robots which make use of soft or compliant inter- actions often leverage tendon-driven actuation which enables actuators to be placed more flexibly, and compliance to be maintained. However, controlling complex tendon systems is challenging. Simulation paired with reinforcement learning (RL) could be enable more complex behaviors to be generated. Such methods rely on torque and force-based simulation roll- outs which are limited by the sim-to-real gap, stemming from the actuator and system dynamics, resulting in poor transfer of RL policies onto real robots. To address this, we propose a method to model the tendon forces produced by typical servo motors, focusing specifically on the transfer of RL policies for a tendon driven finger. Our approach extends existing data- driven techniques by leveraging contextual history and a novel data collection test-bench. This test-bench allows us to capture tendon forces undergo contact-rich interactions typical of real- world manipulation. We then utilize our force estimation model in a GPU-accelerated tendon force-driven rigid body simulation to train RL-based controllers. Our transformer-based model is capable of predicting tendon forces within 3% of the maximum motor force and is robot-agnostic. By integrating our learned model into simulation, we reduce the sim-to-real gap for test trajectories by 41%. RL-based controller trained with our model achieves a 50% improvement in fingertip pose tracking tasks on real tendon-driven robotic fingers. This approach is generalizable to different actuators and robot systems, and can enable RL policies to be used widely across tendon systems, advancing capabilities of dexterous manipulators and soft robots.
comment: preprint
Gaussian Mixture-Based Inverse Perception Contract for Uncertainty-Aware Robot Navigation
Reliable navigation in cluttered environments requires perception outputs that are not only accurate but also equipped with uncertainty sets suitable for safe control. An inverse perception contract (IPC) provides such a connection by mapping perceptual estimates to sets that contain the ground truth with high confidence. Existing IPC formulations, however, instantiate uncertainty as a single ellipsoidal set and rely on deterministic trust scores to guide robot motion. Such a representation cannot capture the multi-modal and irregular structure of fine-grained perception errors, often resulting in over-conservative sets and degraded navigation performance. In this work, we introduce Gaussian Mixture-based Inverse Perception Contract (GM-IPC), which extends IPC to represent uncertainty with unions of ellipsoidal confidence sets derived from Gaussian mixture models. This design moves beyond deterministic single-set abstractions, enabling fine-grained, multi-modal, and non-convex error structures to be captured with formal guarantees. A learning framework is presented that trains GM-IPC to account for probabilistic inclusion, distribution matching, and empty-space penalties, ensuring both validity and compactness of the predicted sets. We further show that the resulting uncertainty characterizations can be leveraged in downstream planning frameworks for real-time safe navigation, enabling less conservative and more adaptive robot motion while preserving safety in a probabilistic manner.
comment: 8 pages, 5 figures. Accepted to ACC 2026 (American Control Conference)
Perception-Aware Time-Optimal Planning for Quadrotor Waypoint Flight
Agile quadrotor flight pushes the limits of control, actuation, and onboard perception. While time-optimal trajectory planning has been extensively studied, existing approaches typically neglect the tight coupling between vehicle dynamics, environmental geometry, and the visual requirements of onboard state estimation. As a result, trajectories that are dynamically feasible may fail in closed-loop execution due to degraded visual quality. This paper introduces a unified time-optimal trajectory optimization framework for vision-based quadrotors that explicitly incorporates perception constraints alongside full nonlinear dynamics, rotor actuation limits, aerodynamic effects, camera field-of-view constraints, and convex geometric gate representations. The proposed formulation solves minimum-time lap trajectories for arbitrary racetracks with diverse gate shapes and orientations, while remaining numerically robust and computationally efficient. We derive an information-theoretic position uncertainty metric to quantify visual state-estimation quality and integrate it into the planner through three perception objectives: position uncertainty minimization, sequential field-of-view constraints, and look-ahead alignment. This enables systematic exploration of the trade-offs between speed and perceptual reliability. To accurately track the resulting perception-aware trajectories, we develop a model predictive contouring tracking controller that separates lateral and progress errors. Experiments demonstrate real-world flight speeds up to 9.8 m/s with 0.07 m average tracking error, and closed-loop success rates improved from 55% to 100% on a challenging Split-S course. The proposed system provides a scalable benchmark for studying the fundamental limits of perception-aware, time-optimal autonomous flight.
Compliant In-hand Rolling Manipulation Using Tactile Sensing
We investigate in-hand rolling manipulation using a multifingered robot hand, where each finger is compliant and equipped with a tactile fingertip providing contact location and wrench information. We derive the equations of motion for compliant quasistatic in-hand rolling manipulation and formulate a fingertip rolling manipulation controller for multiple fingers to achieve a desired object twist within a grasp. The contact mechanics are demonstrated in simulation and the controller is tested on an experimental robot system.
OmniPlanner: Universal Exploration and Inspection Path Planning across Robot Morphologies
Autonomous robotic systems are increasingly deployed for mapping, monitoring, and inspection in complex and unstructured environments. However, most existing path planning approaches remain domain-specific (i.e., either on air, land, or sea), limiting their scalability and cross-platform applicability. This article presents OmniPlanner, a unified planning framework for autonomous exploration and inspection across aerial, ground, and underwater robots. The method integrates volumetric exploration and viewpoint-based inspection, alongside target reach behaviors within a single modular architecture, complemented by a platform abstraction layer that captures morphology-specific sensing, traversability and motion constraints. This enables the same planning strategy to generalize across distinct mobility domains with minimal retuning. The framework is validated through extensive simulation studies and field deployments in underground mines, industrial facilities, forests, submarine bunkers, and structured outdoor environments. Across these diverse scenarios, OmniPlanner demonstrates robust performance, consistent cross-domain generalization, and improved exploration and inspection efficiency compared to representative state-of-the-art baselines.
comment: The code for this paper is open-sourced and released at: https://github.com/ntnu-arl/gbplanner_ros/tree/gbplanner3
VANGUARD: Vehicle-Anchored Ground Sample Distance Estimation for UAVs in GPS-Denied Environments
Autonomous aerial robots operating in GPS-denied or communication-degraded environments frequently lose access to camera metadata and telemetry, leaving onboard perception systems unable to recover the absolute metric scale of the scene. As LLM/VLM-based planners are increasingly adopted as high-level agents for embodied systems, their ability to reason about physical dimensions becomes safety-critical -- yet our experiments show that five state-of-the-art VLMs suffer from spatial scale hallucinations, with median area estimation errors exceeding 50%. We propose VANGUARD, a lightweight, deterministic Geometric Perception Skill designed as a callable tool that any LLM-based agent can invoke to recover Ground Sample Distance (GSD) from ubiquitous environmental anchors: small vehicles detected via oriented bounding boxes, whose modal pixel length is robustly estimated through kernel density estimation and converted to GSD using a pre-calibrated reference length. The tool returns both a GSD estimate and a composite confidence score, enabling the calling agent to autonomously decide whether to trust the measurement or fall back to alternative strategies. On the DOTA~v1.5 benchmark, VANGUARD achieves 6.87% median GSD error on 306~images. Integrated with SAM-based segmentation for downstream area measurement, the pipeline yields 19.7% median error on a 100-entry benchmark -- with 2.6x lower category dependence and 4x fewer catastrophic failures than the best VLM baseline -- demonstrating that equipping agents with deterministic geometric tools is essential for safe autonomous spatial reasoning.
RoboLight: A Dataset with Linearly Composable Illumination for Robotic Manipulation
In this paper, we introduce RoboLight, the first real-world robotic manipulation dataset capturing synchronized episodes under systematically varied lighting conditions. RoboLight consists of two components. (a) RoboLight-Real contains 2,800 real-world episodes collected in our custom Light Cube setup, a calibrated system equipped with eight programmable RGB LED lights. It includes structured illumination variation along three independently controlled dimensions: color, direction, and intensity. Each dimension is paired with a dedicated task featuring objects of diverse geometries and materials to induce perceptual challenges. All image data are recorded in high-dynamic-range (HDR) format to preserve radiometric accuracy. Leveraging the linearity of light transport, we introduce (b) RoboLight-Synthetic, comprising 196,000 episodes synthesized through interpolation in the HDR image space of RoboLight-Real. In principle, RoboLight-Synthetic can be arbitrarily expanded by refining the interpolation granularity. We further verify the dataset quality through qualitative analysis and real-world policy roll-outs, analyzing task difficulty, distributional diversity, and the effectiveness of synthesized data. We additionally demonstrate three representative use cases of the proposed dataset. The full dataset, along with the system software and hardware design, will be released as open-source to support continued research.
AMP2026: A Multi-Platform Marine Robotics Dataset for Tracking and Mapping
Marine environments present significant challenges for perception and autonomy due to dynamic surfaces, limited visibility, and complex interactions between aerial, surface, and submerged sensing modalities. This paper introduces the Aerial Marine Perception Dataset (AMP2026), a multi-platform marine robotics dataset collected across multiple field deployments designed to support research in two primary areas: multi-view tracking and marine environment mapping. The dataset includes synchronized data from aerial drones, boat-mounted cameras, and submerged robotic platforms, along with associated localization and telemetry information. The goal of this work is to provide a publicly available dataset enabling research in marine perception and multi-robot observation scenarios. This paper describes the data collection methodology, sensor configurations, dataset organization, and intended research tasks supported by the dataset.
PRAM-R: A Perception-Reasoning-Action-Memory Framework with LLM-Guided Modality Routing for Adaptive Autonomous Driving
Multimodal perception enables robust autonomous driving but incurs unnecessary computational cost when all sensors remain active. This paper presents PRAM-R, a unified Perception-Reasoning-Action-Memory framework with LLM-Guided Modality Routing for adaptive autonomous driving. PRAM-R adopts an asynchronous dual-loop design: a fast reactive loop for perception and control, and a slow deliberative loop for reasoning-driven modality selection and memory updates. An LLM router selects and weights modalities using environmental context and sensor diagnostics, while a hierarchical memory module preserves temporal consistency and supports long-term adaptation. We conduct a two-stage evaluation: (1) synthetic stress tests for stability analysis and (2) real-world validation on the nuScenes dataset. Synthetic stress tests confirm 87.2% reduction in routing oscillations via hysteresis-based stabilization. Real-world validation on nuScenes shows 6.22% modality reduction with 20% memory recall while maintaining comparable trajectory accuracy to full-modality baselines in complex urban scenarios. Our work demonstrates that LLM-augmented architectures with hierarchical memory achieve efficient, adaptive multimodal perception in autonomous driving.
GSeg3D: A High-Precision Grid-Based Algorithm for Safety-Critical Ground Segmentation in LiDAR Point Clouds
Ground segmentation in point cloud data is the process of separating ground points from non-ground points. This task is fundamental for perception in autonomous driving and robotics, where safety and reliable operation depend on the precise detection of obstacles and navigable surfaces. Existing methods often fall short of the high precision required in safety-critical environments, leading to false detections that can compromise decision-making. In this work, we present a ground segmentation approach designed to deliver consistently high precision, supporting the stringent requirements of autonomous vehicles and robotic systems operating in real-world, safety-critical scenarios.
Learning Hip Exoskeleton Control Policy via Predictive Neuromusculoskeletal Simulation
Developing exoskeleton controllers that generalize across diverse locomotor conditions typically requires extensive motion-capture data and biomechanical labeling, limiting scalability beyond instrumented laboratory settings. Here, we present a physics-based neuromusculoskeletal learning framework that trains a hip-exoskeleton control policy entirely in simulation, without motion-capture demonstrations, and deploys it on hardware via policy distillation. A reinforcement learning teacher policy is trained using a muscle-synergy action prior over a wide range of walking speeds and slopes through a two-stage curriculum, enabling direct comparison between assisted and no-exoskeleton conditions. In simulation, exoskeleton assistance reduces mean muscle activation by up to 3.4% and mean positive joint power by up to 7.0% on level ground and ramp ascent, with benefits increasing systematically with walking speed. On hardware, the assistance profiles learned in simulation are preserved across matched speed-slope conditions (r: 0.82, RMSE: 0.03 Nm/kg), providing quantitative evidence of sim-to-real transfer without additional hardware tuning. These results demonstrate that physics-based neuromusculoskeletal simulation can serve as a practical and scalable foundation for exoskeleton controller development, substantially reducing experimental burden during the design phase.
GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning ICRA2026
Garment manipulation has attracted increasing attention due to its critical role in home-assistant robotics. However, the majority of existing garment manipulation works assume an initial state consisting of only one garment, while piled garments are far more common in real-world settings. To bridge this gap, we propose a novel garment retrieval pipeline that can not only follow language instruction to execute safe and clean retrieval but also guarantee exactly one garment is retrieved per attempt, establishing a robust foundation for the execution of downstream tasks (e.g., folding, hanging, wearing). Our pipeline seamlessly integrates vision-language reasoning with visual affordance perception, fully leveraging the high-level reasoning and planning capabilities of VLMs alongside the generalization power of visual affordance for low-level actions. To enhance the VLM's comprehensive awareness of each garment's state within a garment pile, we employ visual segmentation model (SAM2) to execute object segmentation on the garment pile for aiding VLM-based reasoning with sufficient visual cues. A mask fine-tuning mechanism is further integrated to address scenarios where the initial segmentation results are suboptimal. In addition, a dual-arm cooperation framework is deployed to address cases involving large or long garments, as well as excessive garment sagging caused by incorrect grasping point determination, both of which are strenuous for a single arm to handle. The effectiveness of our pipeline are consistently demonstrated across diverse tasks and varying scenarios in both real-world and simulation environments. Project page: https://garmentpile2.github.io/.
comment: ICRA2026 Accepted
HBRB-BoW: A Retrained Bag-of-Words Vocabulary for ORB-SLAM via Hierarchical BRB-KMeans
In visual simultaneous localization and mapping (SLAM), the quality of the visual vocabulary is fundamental to the system's ability to represent environments and recognize locations. While ORB-SLAM is a widely used framework, its binary vocabulary, trained through the k-majority-based bag-of-words (BoW) approach, suffers from inherent precision loss. The inability of conventional binary clustering to represent subtle feature distributions leads to the degradation of visual words, a problem that is compounded as errors accumulate and propagate through the hierarchical tree structure. To address these structural deficiencies, this paper proposes hierarchical binary-to-real-and-back (HBRB)-BoW, a refined hierarchical binary vocabulary training algorithm. By integrating a global real-valued flow within the hierarchical clustering process, our method preserves high-fidelity descriptor information until the final binarization at the leaf nodes. Experimental results demonstrate that the proposed approach yields a more discriminative and well-structured vocabulary than traditional methods, significantly enhancing the representational integrity of the visual dictionary in complex environments. Furthermore, replacing the default ORB-SLAM vocabulary file with our HBRB-BoW file is expected to improve performance in loop closing and relocalization tasks.
Modeling and Control of a Pneumatic Soft Robotic Catheter Using Neural Koopman Operators ICRA
Catheter-based interventions are widely used for the diagnosis and treatment of cardiac diseases. Recently, robotic catheters have attracted attention for their ability to improve precision and stability over conventional manual approaches. However, accurate modeling and control of soft robotic catheters remain challenging due to their complex, nonlinear behavior. The Koopman operator enables lifting the original system data into a linear "lifted space", offering a data-driven framework for predictive control; however, manually chosen basis functions in the lifted space often oversimplify system behaviors and degrade control performance. To address this, we propose a neural network-enhanced Koopman operator framework that jointly learns the lifted space representation and Koopman operator in an end-to-end manner. Moreover, motivated by the need to minimize radiation exposure during X-ray fluoroscopy in cardiac ablation, we investigate open-loop control strategies using neural Koopman operators to reliably reach target poses without continuous imaging feedback. The proposed method is validated in two experimental scenarios: interactive position control and a simulated cardiac ablation task using an atrium-like cavity. Our approach achieves average errors of 2.1 +- 0.4 mm in position and 4.9 +- 0.6 degrees in orientation, outperforming not only model-based baselines but also other Koopman variants in targeting accuracy and efficiency. These results highlight the potential of the proposed framework for advancing soft robotic catheter systems and improving catheter-based interventions.
comment: 8 pages, 6 figures. Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026
Swimming Under Constraints: A Safe Reinforcement Learning Framework for Quadrupedal Bio-Inspired Propulsion
Bio-inspired aquatic propulsion offers high thrust and maneuverability but is prone to destabilizing forces such as lift fluctuations, which are further amplified by six-degree-of-freedom (6-DoF) fluid coupling. We formulate quadrupedal swimming as a constrained optimization problem that maximizes forward thrust while minimizing destabilizing fluctuations. Our proposed framework, Accelerated Constrained Proximal Policy Optimization with a PID-regulated Lagrange multiplier (ACPPO-PID), enforces constraints with a PID-regulated Lagrange multiplier, accelerates learning via conditional asymmetric clipping, and stabilizes updates through cycle-wise geometric aggregation. Initialized with imitation learning and refined through on-hardware towing-tank experiments, ACPPO-PID produces control policies that transfer effectively to quadrupedal free-swimming trials. Results demonstrate improved thrust efficiency, reduced destabilizing forces, and faster convergence compared with state-of-the-art baselines, underscoring the importance of constraint-aware safe RL for robust and generalizable bio-inspired locomotion in complex fluid environments.
SaFeR: Safety-Critical Scenario Generation for Autonomous Driving Test via Feasibility-Constrained Token Resampling
Safety-critical scenario generation is crucial for evaluating autonomous driving systems. However, existing approaches often struggle to balance three conflicting objectives: adversarial criticality, physical feasibility, and behavioral realism. To bridge this gap, we propose SaFeR: safety-critical scenario generation for autonomous driving test via feasibility-constrained token resampling. We first formulate traffic generation as a discrete next token prediction problem, employing a Transformer-based model as a realism prior to capture naturalistic driving distributions. To capture complex interactions while effectively mitigating attention noise, we propose a novel differential attention mechanism within the realism prior. Building on this prior, SaFeR implements a novel resampling strategy that induces adversarial behaviors within a high-probability trust region to maintain naturalism, while enforcing a feasibility constraint derived from the Largest Feasible Region (LFR). By approximating the LFR via offline reinforcement learning, SaFeR effectively prevents the generation of theoretically inevitable collisions. Closed-loop experiments on the Waymo Open Motion Dataset and nuPlan demonstrate that SaFeR significantly outperforms state-of-the-art baselines, achieving a higher solution rate and superior kinematic realism while maintaining strong adversarial effectiveness.
Sim2Sea: Sim-to-Real Policy Transfer for Maritime Vessel Navigation in Congested Waters
Autonomous navigation in congested maritime environments is a critical capability for a wide range of real-world applications. However, it remains an unresolved challenge due to complex vessel interactions and significant environmental uncertainties. Existing methods often fail in practical deployment due to a substantial sim-to-real gap, which stems from imprecise simulation, inadequate situational awareness, and unsafe exploration strategies. To address these, we propose \textbf{Sim2Sea}, a comprehensive framework designed to bridge simulation and real-world execution. Sim2Sea advances in three key aspects. First, we develop a GPU-accelerated parallel simulator for scalable and accurate maritime scenario simulation. Second, we design a dual-stream spatiotemporal policy that handles complex dynamics and multi-modal perception, augmented with a velocity-obstacle-guided action masking mechanism to ensure safe and efficient exploration. Finally, a targeted domain randomization scheme helps bridge the sim-to-real gap. Simulation results demonstrate that our method achieves faster convergence and safer trajectories than established baselines. In addition, our policy trained purely in simulation successfully transfers zero-shot to a 17-ton unmanned vessel operating in real-world congested waters. These results validate the effectiveness of Sim2Sea in achieving reliable sim-to-real transfer for practical autonomous maritime navigation.
Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark
Long-term visual localization has the potential to reduce cost and improve mapping quality in optical benthic monitoring with autonomous underwater vehicles (AUVs). Despite this potential, long-term visual localization in benthic environments remains understudied, primarily due to the lack of curated datasets for benchmarking. Moreover, limited georeferencing accuracy and image footprints necessitate precise geometric information for accurate ground-truthing. In this work, we address these gaps by presenting a curated dataset for long-term visual localization in benthic environments and a novel method to ground-truth visual localization results for near-nadir underwater imagery. Our dataset comprises georeferenced AUV imagery from five benthic reference sites, revisited over periods up to six years, and includes raw and color-corrected stereo imagery, camera calibrations, and sub-decimeter registered camera poses. To our knowledge, this is the first curated underwater dataset for long-term visual localization spanning multiple sites and photic-zone habitats. Our ground-truthing method estimates 3D seafloor image footprints and links camera views with overlapping footprints, ensuring that ground-truth links reflect shared visual content. Building on this dataset and ground truth, we benchmark eight state-of-the-art visual place recognition (VPR) methods and find that Recall@K is significantly lower on our dataset than on established terrestrial and underwater benchmarks. Finally, we compare our footprint-based ground truth to a traditional location-based ground truth and show that distance-threshold ground-truthing can overestimate VPR Recall@K at sites with rugged terrain and altitude variations. Together, the curated dataset, ground-truthing method, and VPR benchmark provide a stepping stone for advancing long-term visual localization in dynamic benthic environments.
HE-VPR: Height Estimation Enabled Aerial Visual Place Recognition Against Scale Variance
In this work, we propose HE-VPR, a visual place recognition (VPR) framework that incorporates height estimation. Our system decouples height inference from place recognition, allowing both modules to share a frozen DINOv2 backbone. Two lightweight bypass adapter branches are integrated into our system. The first estimates the height partition of the query image via retrieval from a compact height database, and the second performs VPR within the corresponding height-specific sub-database. The adaptation design reduces training cost and significantly decreases the search space of the database. We also adopt a center-weighted masking strategy to further enhance the robustness against scale differences. Experiments on two self-collected challenging multi-altitude datasets demonstrate that HE-VPR achieves up to 6.1\% Recall@1 improvement over state-of-the-art ViT-based baselines and reduces memory usage by up to 90\%. These results indicate that HE-VPR offers a scalable and efficient solution for height-aware aerial VPR, enabling practical deployment in GNSS-denied environments. All the code and datasets for this work have been released on https://github.com/hmf21/HE-VPR.
Force-Aware Residual DAgger via Trajectory Editing for Precision Insertion with Impedance Control
Imitation learning (IL) has shown strong potential for contact-rich precision insertion tasks. However, its practical deployment is often hindered by covariate shift and the need for continuous expert monitoring to recover from failures during execution. In this paper, we propose Trajectory Editing Residual Dataset Aggregation (TER-DAgger), a scalable and force-aware human-in-the-loop imitation learning framework that mitigates covariate shift by learning residual policies through optimization-based trajectory editing. This approach smoothly fuses policy rollouts with human corrective trajectories, providing consistent and stable supervision. Second, we introduce a force-aware failure anticipation mechanism that triggers human intervention only when discrepancies arise between predicted and measured end-effector forces, significantly reducing the requirement for continuous expert monitoring. Third, all learned policies are executed within a Cartesian impedance control framework, ensuring compliant and safe behavior during contact-rich interactions. Extensive experiments in both simulation and real-world precision insertion tasks show that TER-DAgger improves the average success rate by over 37\% compared to behavior cloning, human-guided correction, retraining, and fine-tuning baselines, demonstrating its effectiveness in mitigating covariate shift and enabling scalable deployment in contact-rich manipulation.
Self-adapting Robotic Agents through Online Continual Reinforcement Learning with World Model Feedback IROS 2026
As learning-based robotic controllers are typically trained offline and deployed with fixed parameters, their ability to cope with unforeseen changes during operation is limited. Biologically inspired, this work presents a framework for online Continual Reinforcement Learning that enables automated adaptation during deployment. Building on DreamerV3, a model-based Reinforcement Learning algorithm, the proposed method leverages world model prediction residuals to detect out-of-distribution events and automatically trigger finetuning. Adaptation progress is monitored using both task-level performance signals and internal training metrics, allowing convergence to be assessed without external supervision and domain knowledge. The approach is validated on a variety of contemporary continuous control problems, including a quadruped robot in high-fidelity simulation, and a real-world model vehicle. Relevant metrics and their interpretation are presented and discussed, as well as resulting trade-offs described. The results sketch out how autonomous robotic agents could once move beyond static training regimes toward adaptive systems capable of self-reflection and -improvement during operation, just like their biological counterparts.
comment: submitted to IROS 2026
Lambdas at the Far Edge: a Tale of Flying Lambdas and Lambdas on Wheels
Aggregate Programming (AP) is a paradigm for programming the collective behaviour of sets of distributed devices, possibly situated at the network far edge, by relying on asynchronous proximity-based interactions. The eXchange Calculus (XC), a recently proposed foundational model for AP, is essentially a typed lambda calculus extended with an operator (the exchange operator) providing an implicit communication mechanism between neighbour devices. This paper provides a gentle introduction to XC and to its implementation as a C++ library, called FCPP. The FCPP library and toolchain has been mainly developed at the Department of Computer Science of the University of Turin, where Stefano Berardi spent most of his academic career conducting outstanding research about logical foundation of computer science and transmitting his passion for research to students and young researchers, often exploiting typed lambda calculi. An FCCP program is essentially a typed lambda term, and FCPP has been used to write code that has been deployed on devices at the far edge of the network, including rovers and (soon) Uncrewed Aerial Vehicles (UAVs); hence the title of the paper.
comment: In Proceedings LTT 2026, arXiv:2603.02912
Map-Agnostic And Interactive Safety-Critical Scenario Generation via Multi-Objective Tree Search
Generating safety-critical scenarios is essential for validating the robustness of autonomous driving systems, yet existing methods often struggle to produce collisions that are both realistic and diverse while ensuring explicit interaction logic among traffic participants. This paper presents a novel framework for traffic-flow level safety-critical scenario generation via multi-objective Monte Carlo Tree Search (MCTS). We reframe trajectory feasibility and naturalistic behavior as optimization objectives within a unified evaluation function, enabling the discovery of diverse collision events without compromising realism. A hybrid Upper Confidence Bound (UCB) and Lower Confidence Bound (LCB) search strategy is introduced to balance exploratory efficiency with risk-averse decision-making. Furthermore, our method is map-agnostic and supports interactive scenario generation with each vehicle individually powered by SUMO's microscopic traffic models, enabling realistic agent behaviors in arbitrary geographic locations imported from OpenStreetMap. We validate our approach across four high-risk accident zones in Hong Kong's complex urban environments. Experimental results demonstrate that our framework achieves an 85\% collision failure rate while generating trajectories with superior feasibility and comfort metrics. The resulting scenarios exhibit greater complexity, as evidenced by increased vehicle mileage and CO\(_2\) emissions. Our work provides a principled solution for stress testing autonomous vehicles through the generation of realistic yet infrequent corner cases at traffic-flow level.
Right in Time: Reactive Reasoning in Regulated Traffic Spaces
Exact inference in probabilistic First-Order Logic offers a promising yet computationally costly approach for regulating the behavior of autonomous agents in shared traffic spaces. While prior methods have combined logical and probabilistic data into decision-making frameworks, their application is often limited to pre-flight checks due to the complexity of reasoning across vast numbers of possible universes. In this work, we propose a reactive mission design framework that jointly considers uncertain environmental data and declarative, logical traffic regulations. By synthesizing Probabilistic Mission Design (ProMis) with reactive reasoning facilitated by Reactive Circuits (RC), we enable online, exact probabilistic inference over hybrid domains. Our approach leverages the Frequency of Change inherent in heterogeneous data streams to subdivide inference formulas into memoized, isolated tasks, ensuring that only the specific components affected by new sensor data are re-evaluated. In experiments involving both real-world vessel data and simulated drone traffic in dense urban scenarios, we demonstrate that our approach provides orders of magnitude in speedup over ProMis without reactive paradigms. This allows intelligent transportation systems, such as Unmanned Aircraft Systems (UAS), to actively assert safety and legal compliance during operations rather than relying solely on preparation procedures.
Structural Action Transformer for 3D Dexterous Manipulation CVPR
Achieving human-level dexterity in robots via imitation learning from heterogeneous datasets is hindered by the challenge of cross-embodiment skill transfer, particularly for high-DoF robotic hands. Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations and fail to handle embodiment heterogeneity. This paper proposes the Structural Action Transformer (SAT), a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective. We reframe each action chunk not as a temporal sequence, but as a variable-length, unordered sequence of joint-wise trajectories. This structural formulation allows a Transformer to natively handle heterogeneous embodiments, treating the joint count as a variable sequence length. To encode structural priors and resolve ambiguity, we introduce an Embodied Joint Codebook that embeds each joint's functional role and kinematic properties. Our model learns to generate these trajectories from 3D point clouds via a continuous-time flow matching objective. We validate our approach by pre-training on large-scale heterogeneous datasets and fine-tuning on simulation and real-world dexterous manipulation tasks. Our method consistently outperforms all baselines, demonstrating superior sample efficiency and effective cross-embodiment skill transfer. This structural-centric representation offers a new path toward scaling policies for high-DoF, heterogeneous manipulators.
comment: Accepted by CVPR
ArthroCut: Autonomous Policy Learning for Robotic Bone Resection in Knee Arthroplasty ICRA
Despite rapid commercialization of surgical robots, their autonomy and real-time decision-making remain limited in practice. To address this gap, we propose ArthroCut, an autonomous policy learning framework that upgrades knee arthroplasty robots from assistive execution to context-aware action generation. ArthroCut fine-tunes a Qwen--VL backbone on a self-built, time-synchronized multimodal dataset from 21 complete cases (23,205 RGB--D pairs), integrating preoperative CT/MR, intraoperative NDI tracking of bones and end effector, RGB--D surgical video, robot state, and textual intent. The method operates on two complementary token families -- Preoperative Imaging Tokens (PIT) to encode patient-specific anatomy and planned resection planes, and Time-Aligned Surgical Tokens (TAST) to fuse real-time visual, geometric, and kinematic evidence -- and emits an interpretable action grammar under grammar/safety-constrained decoding. In bench-top experiments on a knee prosthesis across seven trials, ArthroCut achieves an average success rate of 86% over the six standard resections, significantly outperforming strong baselines trained under the same protocol. Ablations show that TAST is the principal driver of reliability while PIT provides essential anatomical grounding, and their combination yields the most stable multi-plane execution. These results indicate that aligning preoperative geometry with time-aligned intraoperative perception and translating that alignment into tokenized, constrained actions is an effective path toward robust, interpretable autonomy in orthopedic robotic surgery.
comment: Accepted for publication at the 2026 IEEE International Conference on Robotics and Automation (ICRA)
RVN-Bench: A Benchmark for Reactive Visual Navigation
Safe visual navigation is critical for indoor mobile robots operating in cluttered environments. Existing benchmarks, however, often neglect collisions or are designed for outdoor scenarios, making them unsuitable for indoor visual navigation. To address this limitation, we introduce the reactive visual navigation benchmark (RVN-Bench), a collision-aware benchmark for indoor mobile robots. In RVN-Bench, an agent must reach sequential goal positions in previously unseen environments using only visual observations and no prior map, while avoiding collisions. Built on the Habitat 2.0 simulator and leveraging high-fidelity HM3D scenes, RVN-Bench provides large-scale, diverse indoor environments, defines a collision-aware navigation task and evaluation metrics, and offers tools for standardized training and benchmarking. RVN-Bench supports both online and offline learning by offering an environment for online reinforcement learning, a trajectory image dataset generator, and tools for producing negative trajectory image datasets that capture collision events. Experiments show that policies trained on RVN-Bench generalize effectively to unseen environments, demonstrating its value as a standardized benchmark for safe and robust visual navigation. Code and additional materials are available at: https://rvn-bench.github.io/.
Lightweight Visual Reasoning for Socially-Aware Robots ICRA26
Robots operating in shared human environments must not only navigate, interact, and detect their surroundings, they must also interpret and respond to dynamic, and often unpredictable, human behaviours. Although recent advances have shown promise in enhancing robotic perception and instruction-following using Vision-Language Models (VLMs), they remain limited in addressing the complexities of multimodal human-robot interactions (HRI). Motivated by this challenge, we introduce a lightweight language-to-vision feedback module that closes the loop between an LLM and the vision encoder in VLMs. The module projects image-token hidden states through a gated Multi-Layer Perceptron (MLP) back into the encoder input, prompting a second pass that reinterprets the scene under text context. We evaluate this approach on three robotics-centred tasks: navigation in a simulated environment (Habitat), sequential scene description (Mementos-Robotics), and human-intention recognition (our HRI dataset). Results show that our method improves Qwen 2.5 (7B) by $3.3\%$ (less distance), $+0.057$ description score, and $+2.93\%$ accuracy, with less than $3\%$ extra parameters; Gemma 3 (4B) and LLaVA OV 1.5 (4B) show mixed navigation results but gains $+0.111,+0.055$ and $+10.81\%,+4.79\%$ on the latter two tasks. Code is available at https://github.com/alessioGalatolo/VLM-Reasoning-for-Robotics
comment: ICRA26
DISC: Dense Integrated Semantic Context for Large-Scale Open-Set Semantic Mapping
Open-set semantic mapping enables language-driven robotic perception, but current instance-centric approaches are bottlenecked by context-depriving and computationally expensive crop-based feature extraction. To overcome this fundamental limitation, we introduce DISC (Dense Integrated Semantic Context), featuring a novel single-pass, distance-weighted extraction mechanism. By deriving high-fidelity CLIP embeddings directly from the vision transformer's intermediate layers, our approach eliminates the latency and domain-shift artifacts of traditional image cropping, yielding pure, mask-aligned semantic representations. To fully leverage these features in large-scale continuous mapping, DISC is built upon a fully GPU-accelerated architecture that replaces periodic offline processing with precise, on-the-fly voxel-level instance refinement. We evaluate our approach on standard benchmarks (Replica, ScanNet) and a newly generated large-scale-mapping dataset based on Habitat-Matterport 3D (HM3DSEM) to assess scalability across complex scenes in multi-story buildings. Extensive evaluations demonstrate that DISC significantly surpasses current state-of-the-art zero-shot methods in both semantic accuracy and query retrieval, providing a robust, real-time capable framework for robotic deployment. The full source code, data generation and evaluation pipelines will be made available at https://github.com/DFKI-NI/DISC.
IROSA: Interactive Robot Skill Adaptation using Natural Language
Foundation models have demonstrated impressive capabilities across diverse domains, while imitation learning provides principled methods for robot skill adaptation from limited data. Combining these approaches holds significant promise for direct application to robotics, yet this combination has received limited attention, particularly for industrial deployment. We present a novel framework that enables open-vocabulary skill adaptation through a tool-based architecture, maintaining a protective abstraction layer between the language model and robot hardware. Our approach leverages pre-trained LLMs to select and parameterize specific tools for adapting robot skills without requiring fine-tuning or direct model-to-robot interaction. We demonstrate the framework on a 7-DoF torque-controlled robot performing an industrial bearing ring insertion task, showing successful skill adaptation through natural language commands for speed adjustment, trajectory correction, and obstacle avoidance while maintaining safety, transparency, and interpretability.
comment: Accepted IEEE Robotics and Automation Letters (RA-L) journal, 8 pages, 5 figures, 3 tables, 1 listing
SkillVLA: Tackling Combinatorial Diversity in Dual-Arm Manipulation via Skill Reuse
Recent progress in vision-language-action (VLA) models has demonstrated strong potential for dual-arm manipulation, enabling complex behaviors and generalization to unseen environments. However, mainstream bimanual VLA formulations largely overlook the critical challenge of combinatorial diversity. Different pairings of single-arm behaviors can induce qualitatively distinct task behaviors, yet existing models do not explicitly account for this structure. We argue that effective bimanual VLAs should support skill reuse - the ability to recombine previously learned single-arm skills across novel left-right pairings - thereby avoiding the need to separately learn every possible combination. Current VLA designs entangle skills across arms, preventing such recomposition and limiting scalability. To address this limitation, we propose SkillVLA, a framework explicitly designed to enable skill reuse in dual-arm manipulation. Extensive experiments demonstrate that SkillVLA substantially improves skill composition, increasing overall success rate from 0% to 51%, and achieves strong performance on cooperative and long-horizon tasks.
comment: 16 pages
Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning
Continual learning is a long-standing challenge in robot policy learning, where a policy must acquire new skills over time without catastrophically forgetting previously learned ones. While prior work has extensively studied continual learning in relatively small behavior cloning (BC) policy models trained from scratch, its behavior in modern large-scale pretrained Vision-Language-Action (VLA) models remains underexplored. In this work, we found that pretrained VLAs are remarkably resistant to forgetting compared with smaller policy models trained from scratch. Simple Experience Replay (ER) works surprisingly well on VLAs, sometimes achieving zero forgetting even with a small replay data size. Our analysis reveals that pretraining plays a critical role in downstream continual learning performance: large pretrained models mitigate forgetting with a small replay buffer size while maintaining strong forward learning capabilities. Furthermore, we found that VLAs can retain relevant knowledge from prior tasks despite performance degradation during learning new tasks. This knowledge retention enables rapid recovery of seemingly forgotten skills through finetuning. Together, these insights imply that large-scale pretraining fundamentally changes the dynamics of continual learning, enabling models to continually acquire new skills over time with simple replay. Code and more information can be found at https://ut-austin-rpl.github.io/continual-vla
Learning Surgical Robotic Manipulation with 3D Spatial Priors CVPR26
Achieving 3D spatial awareness is crucial for surgical robotic manipulation, where precise and delicate operations are required. Existing methods either explicitly reconstruct the surgical scene prior to manipulation, or enhance multi-view features by adding wrist-mounted cameras to supplement the default stereo endoscopes. However, both paradigms suffer from notable limitations: the former easily leads to error accumulation and prevents end-to-end optimization due to its multi-stage nature, while the latter is rarely adopted in clinical practice since wrist-mounted cameras can interfere with the motion of surgical robot arms. In this work, we introduce the Spatial Surgical Transformer (SST), an end-to-end visuomotor policy that empowers surgical robots with 3D spatial awareness by directly exploring 3D spatial cues embedded in endoscopic images. First, we build Surgical3D, a large-scale photorealistic dataset containing 30K stereo endoscopic image pairs with accurate 3D geometry, addressing the scarcity of 3D data in surgical scenes. Based on Surgical3D, we finetune a powerful geometric transformer to extract robust 3D latent representations from stereo endoscopes images. These representations are then seamlessly aligned with the robot's action space via a lightweight multi-level spatial feature connector (MSFC), all within an endoscope-centric coordinate frame. Extensive real-robot experiments demonstrate that SST achieves state-of-the-art performance and strong spatial generalization on complex surgical tasks such as knot tying and ex-vivo organ dissection, representing a significant step toward practical clinical deployment. The dataset and code will be released.
comment: CVPR26
Cognition to Control - Multi-Agent Learning for Human-Humanoid Collaborative Transport
Effective human-robot collaboration (HRC) requires translating high-level intent into contact-stable whole-body motion while continuously adapting to a human partner. Many vision-language-action (VLA) systems learn end-to-end mappings from observations and instructions to actions, but they often emphasize reactive (System 1-like) behavior and leave under-specified how sustained System 2-style deliberation can be integrated with reliable, low-latency continuous control. This gap is acute in multi-agent HRC, where long-horizon coordination decisions and physical execution must co-evolve under contact, feasibility, and safety constraints. We address this limitation with cognition-to-control (C2C), a three-layer hierarchy that makes the deliberation-to-control pathway explicit: (i) a VLM-based grounding layer that maintains persistent scene referents and infers embodiment-aware affordances/constraints; (ii) a deliberative skill/coordination layer-the System 2 core-that optimizes long-horizon skill choices and sequences under human-robot coupling via decentralized MARL cast as a Markov potential game with a shared potential encoding task progress; and (iii) a whole-body control layer that executes the selected skills at high frequency while enforcing kinematic/dynamic feasibility and contact stability. The deliberative layer is realized as a residual policy relative to a nominal controller, internalizing partner dynamics without explicit role assignment. Experiments on collaborative manipulation tasks show higher success and robustness than single-agent and end-to-end baselines, with stable coordination and emergent leader-follower behaviors.
Interaction-Aware Whole-Body Control for Compliant Object Transport
Cooperative object transport in unstructured environments remains challenging for assistive humanoids because strong, time-varying interaction forces can make tracking-centric whole-body control unreliable, especially in close-contact support tasks. This paper proposes a bio-inspired, interaction-oriented whole-body control (IO-WBC) that functions as an artificial cerebellum - an adaptive motor agent that translates upstream (skill-level) commands into stable, physically consistent whole-body behavior under contact. This work structurally separates upper-body interaction execution from lower-body support control, enabling the robot to maintain balance while shaping force exchange in a tightly coupled robot-object system. A trajectory-optimized reference generator (RG) provides a kinematic prior, while a reinforcement learning (RL) policy governs body responses under heavy-load interactions and disturbances. The policy is trained in simulation with randomized payload mass/inertia and external perturbations, and deployed via asymmetric teacher-student distillation so that the student relies only on proprioceptive histories at runtime. Extensive experiments demonstrate that IO-WBC maintains stable whole-body behavior and physical interaction even when precise velocity tracking becomes infeasible, enabling compliant object transport across a wide range of scenarios.
RAGNav: A Retrieval-Augmented Topological Reasoning Framework for Multi-Goal Visual-Language Navigation
Vision-Language Navigation (VLN) is evolving from single-point pathfinding toward the more challenging Multi-Goal VLN. This task requires agents to accurately identify multiple entities while collaboratively reasoning over their spatial-physical constraints and sequential execution order. However, generic Retrieval-Augmented Generation (RAG) paradigms often suffer from spatial hallucinations and planning drift when handling multi-object associations due to the lack of explicit spatial modeling.To address these challenges, we propose RAGNav, a framework that bridges the gap between semantic reasoning and physical structure. The core of RAGNav is a Dual-Basis Memory system, which integrates a low-level topological map for maintaining physical connectivity with a high-level semantic forest for hierarchical environment abstraction. Building on this representation, the framework introduces an anchor-guided conditional retrieval and a topological neighbor score propagation mechanism. This approach facilitates the rapid screening of candidate targets and the elimination of semantic noise, while performing semantic calibration by leveraging the physical associations inherent in the topological neighborhood.This mechanism significantly enhances the capability of inter-target reachability reasoning and the efficiency of sequential planning. Experimental results demonstrate that RAGNav achieves state-of-the-art (SOTA) performance in complex multi-goal navigation tasks.
HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration
To improve generalization and resilience in human-robot collaboration (HRC), robots must handle the combinatorial diversity of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG) in the learning process-a variational mismatch between decentralized best-response dynamics and centralized cooperative ascent. The resulting learning problem is a general-sum differentiable game, so independent policy-gradient updates can oscillate or diverge without added structure. We propose heterogeneous-agent Lyapunov policy optimization (HALyPO), which establishes formal stability directly in the policy-parameter space by enforcing a per-step Lyapunov decrease condition on a parameter-space disagreement metric. Unlike Lyapunov-based safe RL, which targets state/trajectory constraints in constrained Markov decision processes, HALyPO uses Lyapunov certification to stabilize decentralized policy learning. HALyPO rectifies decentralized gradients via optimal quadratic projections, ensuring monotonic contraction of RG and enabling effective exploration of open-ended interaction spaces. Extensive simulations and real-world humanoid-robot experiments show that this certified stability improves generalization and robustness in collaborative corner cases.
Whole-Body Safe Control of Robotic Systems with Koopman Neural Dynamics
Controlling robots with strongly nonlinear, high-dimensional dynamics remains challenging, as direct nonlinear optimization with safety constraints is often intractable in real time. The Koopman operator offers a way to represent nonlinear systems linearly in a lifted space, enabling the use of efficient linear control. We propose a data-driven framework that learns a Koopman embedding and operator from data, and integrates the resulting linear model with the Safe Set Algorithm (SSA). This allows the tracking and safety constraints to be solved in a single quadratic program (QP), ensuring feasibility and optimality without a separate safety filter. We validate the method on a Kinova Gen3 manipulator and a Go2 quadruped, showing accurate tracking and obstacle avoidance.
Characterization and Correlation of Robotic Snake Scale Friction and Locomotion Speed
Snake robots are inspired by the ability of biological snakes to move over rock, grass, leaves, soil, up trees, along pavement and more. Their ability to move in multiple distinct environments is due to their legless locomotion strategy, which combines distinct gaits with a skin that exhibits frictional anisotropy. Designing soft robotic snakes with similar capabilities requires an understanding of how this underlying frictional anisotropy should be created in engineered systems, and how variances in the frictional anisotropy ratio affect locomotion speed and direction on different surfaces. While forward and backward frictional ratios have been characterized for previous scale designs, lateral friction and the associated ratios are often overlooked. In this paper, our contributions include: (i) the development of a novel articulated pseudo-skin design that is modular, easy to construct and has removable or replaceable scales; (ii) experimental measurement of the frictional characteristics of otherwise-identical scales at varying angles of attack (15°, 25°, 35°, 45°) on different surfaces of interest (grass, bark, smooth surface, carpet);(iii) separate measurements of locomotion speed for each angle and surface. Consequently, while we observed some consistent trends between frictional coefficients and scale angle, aligning with literature and intuition, we were not able to consistently identify expected correlations between frictional ratios and locomotion speed. We conclude that either frictional ratios alone are not sufficient to predict the observed speed of a snake robot, or that specific measurement approaches are required to accurately capture these ratios.
comment: Accepted for 9th IEEE-RAS International Conference on Soft Robotics (RoboSoft 2026), 8 pages, 7 figures
X-Loco: Towards Generalist Humanoid Locomotion Control via Synergetic Policy Distillation
While recent advances have demonstrated strong performance in individual humanoid skills such as upright locomotion, fall recovery and whole-body coordination, learning a single policy that masters all these skills remains challenging due to the diverse dynamics and conflicting control objectives involved. To address this, we introduce X-Loco, a framework for training a vision-based generalist humanoid locomotion policy. X-Loco trains multiple oracle specialist policies and adopts a synergetic policy distillation with a case-adaptive specialist selection mechanism, which dynamically leverages multiple specialist policies to guide a vision-based student policy. This design enables the student to acquire a broad spectrum of locomotion skills, ranging from fall recovery to terrain traversal and whole-body coordination skills. To the best of our knowledge, X-Loco is the first framework to demonstrate vision-based humanoid locomotion that jointly integrates upright locomotion, whole-body coordination and fall recovery, while operating solely under velocity commands without relying on reference motions. Experimental results show that X-Loco achieves superior performance, demonstrated by tasks such as fall recovery and terrain traversal. Ablation studies further highlight that our framework effectively leverages specialist expertise and enhances learning efficiency.
Soft Semi-active Back Support Device with Adaptive Force Profiles using Variable-elastic Actuation and Weight Feedback
Portable active back support devices (BSDs) offer tunable assistance but are often bulky and heavy, limiting their usability. In contrast, passive BSDs are lightweight and compact but lack the ability to adapt their assistance to different back movements. We present a soft, lightweight, and compact BSD that combines a variable-stiffness passive element and an active element (an artificial muscle) in parallel. The device provides tunable assistance through discrete changes in stiffness values and active force levels. We validate the device's tuning capabilities through bench testing and on-body characterization. Further, we use the device's tuning capabilities to provide weight-adaptive object lifting and lowering assistance. We detect the weight handled by the user based on forearm force myography and upper-back inertial measurement unit data. Furthermore, electromyography analyses in five participants performing symmetric object lifting and lowering tasks showed reductions in back extensor activity. Preliminary results in one participant also indicated reduced muscle activity during asymmetric lifting.
comment: 17 pages, 18 figures
Large-Language-Model-Guided State Estimation for Partially Observable Task and Motion Planning
Robot planning in partially observable environments, where not all objects are known or visible, is a challenging problem, as it requires reasoning under uncertainty through partially observable Markov decision processes. During the execution of a computed plan, a robot may unexpectedly observe task-irrelevant objects, which are typically ignored by naive planners. In this work, we propose incorporating two types of common-sense knowledge: (1) certain objects are more likely to be found in specific locations; and (2) similar objects are likely to be co-located, while dissimilar objects are less likely to be found together. Manually engineering such knowledge is complex, so we explore leveraging the powerful common-sense reasoning capabilities of large language models (LLMs). Our planning and execution framework, CoCo-TAMP, introduces a hierarchical state estimation that uses LLM-guided information to shape the belief over task-relevant objects, enabling efficient solutions to long-horizon task and motion planning problems. In experiments, CoCo-TAMP achieves an average reduction of 62.7 in planning and execution time in simulation, and 72.6 in real-world demonstrations, compared to a baseline that does not incorporate either type of common-sense knowledge.
UrbanHuRo: A Two-Layer Human-Robot Collaboration Framework for the Joint Optimization of Heterogeneous Urban Services ICRA'26
In the vision of smart cities, technologies are being developed to enhance the efficiency of urban services and improve residents' quality of life. However, most existing research focuses on optimizing individual services in isolation, without adequately considering reciprocal interactions among heterogeneous urban services that could yield higher efficiency and improved resource utilization. For example, human couriers could collect traffic and air quality data along their delivery routes, while sensing robots could assist with on-demand delivery during peak hours, enhancing both sensing coverage and delivery efficiency. However, the joint optimization of different urban services is challenging due to potentially conflicting objectives and the need for real-time coordination in dynamic environments. In this paper, we propose UrbanHuRo, a two-layer human-robot collaboration framework for joint optimization of heterogeneous urban services, demonstrated through crowdsourced delivery and urban sensing. UrbanHuRo includes two key designs: (i) a scalable distributed MapReduce-based K-submodular maximization module for efficient order dispatch, and (ii) a deep submodular reward reinforcement learning algorithm for sensing route planning. Experimental evaluations on real-world datasets from a food delivery platform demonstrate that UrbanHuRo improves sensing coverage by 29.7% and courier income by 39.2% on average in most settings, while also significantly reducing the number of overdue orders.
comment: 8 pages, 15 figures. This paper has been accepted by ICRA'26 as a regular paper
TreeLoc++: Robust 6-DoF LiDAR Localization in Forests with a Compact Digital Forest Inventory
Reliable localization is essential for sustainable forest management, as it allows robots or sensor systems to revisit and monitor the status of individual trees over long periods. In modern forestry, this management is structured around Digital Forest Inventories (DFIs), which encode stems using compact geometric attributes rather than raw data. Despite their central role, DFIs have been overlooked in localization research, and most methods still rely on dense gigabyte-sized point clouds that are costly to store and maintain. To improve upon this, we propose TreeLoc++, a global localization framework that operates directly on DFIs as a discriminative representation, eliminating the need to use the raw point clouds. TreeLoc++ reduces false matches in structurally ambiguous forests and improves the reliability of full 6-DoF pose estimation. It augments coarse retrieval with a pairwise distance histogram that encodes local tree-layout context, subsequently refining candidates via DBH-based filtering and yaw-consistent inlier selection to further reduce mismatches. Furthermore, a constrained optimization leveraging tree geometry jointly estimates roll, pitch, and height, enhancing pose stability and enabling accurate localization without reliance on dense 3D point cloud data. Evaluations on 27 sequences recorded in forests across three datasets and four countries show that TreeLoc++ achieves precise localization with centimeter-level accuracy. We further demonstrate robustness to long-term change by localizing data recorded in 2025 against inventories built from 2023 data, spanning a two-year interval. The system represents 15 sessions spanning 7.98 km of trajectories using only 250KB of map data and outperforms both hand-crafted and learning-based baselines that rely on point cloud maps. This demonstrates the scalability of TreeLoc++ for long-term deployment.
comment: 25 pages, 27 figures and 15 tables
MistyPilot: An Agentic Fast-Slow Thinking LLM Framework for Misty Social Robots
With the availability of open APIs in social robots, it has become easier to customize general-purpose tools to meet users' needs. However, interpreting high-level user instructions, selecting and configuring appropriate tools, and executing them reliably remain challenging for users without programming experience. To address these challenges, we introduce MistyPilot, an agentic LLM-driven framework for autonomous tool selection, orchestration, and parameter configuration. MistyPilot comprises two core components: a Physically Interactive Agent (PIA) and a Socially Intelligent Agent (SIA). The PIA enables robust sensor-triggered and tool-driven task execution, while the SIA generates socially intelligent and emotionally aligned dialogue. MistyPilot further integrates a fast-slow thinking paradigm to capture user preferences, reduce latency, and improve task efficiency. To comprehensively evaluate MistyPilot, we contribute five benchmark datasets. Extensive experiments demonstrate the effectiveness of our framework in routing correctness, task completeness, fast-slow thinking retrieval efficiency, tool scalability,and emotion alignment. All code, datasets, and experimental videos will be made publicly available on the project webpage.
Touch2Insert: Zero-Shot Peg Insertion by Touching Intersections of Peg and Hole ICRA 2026
Reliable insertion of industrial connectors remains a central challenge in robotics, requiring sub-millimeter precision under uncertainty and often without full visual access. Vision-based approaches struggle with occlusion and limited generalization, while learning-based policies frequently fail to transfer to unseen geometries. To address these limitations, we leverage tactile sensing, which captures local surface geometry at the point of contact and thus provides reliable information even under occlusion and across novel connector shapes. Building on this capability, we present \emph{Touch2Insert}, a tactile-based framework for arbitrary peg insertion. Our method reconstructs cross-sectional geometry from high-resolution tactile images and estimates the relative pose of the hole with respect to the peg in a zero-shot manner. By aligning reconstructed shapes through registration, the framework enables insertion from a single contact without task-specific training. To evaluate its performance, we conducted experiments with three diverse connectors in both simulation and real-robot settings. The results indicate that Touch2Insert achieved sub-millimeter pose estimation accuracy for all connectors in simulation, and attained an average success rate of 86.7\% on the real robot, thereby confirming the robustness and generalizability of tactile sensing for real-world robotic connector insertion.
comment: Accepted by ICRA 2026 (IEEE International Conference on Robotics and Automation)
MEM: Multi-Scale Embodied Memory for Vision Language Action Models
Conventionally, memory in end-to-end robotic learning involves inputting a sequence of past observations into the learned policy. However, in complex multi-stage real-world tasks, the robot's memory must represent past events at multiple levels of granularity: from long-term memory that captures abstracted semantic concepts (e.g., a robot cooking dinner should remember which stages of the recipe are already done) to short-term memory that captures recent events and compensates for occlusions (e.g., a robot remembering the object it wants to pick up once its arm occludes it). In this work, our main insight is that an effective memory architecture for long-horizon robotic control should combine multiple modalities to capture these different levels of abstraction. We introduce Multi-Scale Embodied Memory (MEM), an approach for mixed-modal long-horizon memory in robot policies. MEM combines video-based short-horizon memory, compressed via a video encoder, with text-based long-horizon memory. Together, they enable robot policies to perform tasks that span up to fifteen minutes, like cleaning up a kitchen, or preparing a grilled cheese sandwich. Additionally, we find that memory enables MEM policies to intelligently adapt manipulation strategies in-context.
comment: Website: https://pi.website/research/memory
A Soft Robotic Demonstration in the Stratosphere
Machines designed for operation in Space, as well as other extreme environments, need to be both resilient and adaptable when mission parameters change. Soft robots offer advantages in adaptability, but most lack resilience to the pressure and temperature extremes found as close as the Stratosphere. Dielectric elastomer actuators overcome some of those limitations when built as solid state compliant capacitors capable of converting electrical energy into mechanical work, but the elastomer resilience limits the device's operating window. Here we present a crosslinking mechanism for silicone elastomers under ultraviolet light using trimethyl(methylcyclopentadienyl)platinum(IV) as a catalyst to react hydrosilane to vinyl groups. The formation of carbon-carbon bonds enables fast processing under UV light and exceptional electro-mechanical performance in dielectric elastomer actuators. The material resilience advantage is demonstrated in controlled experiments at -40° and 120° C, as well as near vacuum, in comparison with state-of-the-art acrylic and silicone chemistries. Fully autonomous systems controlling grippers made with the novel silicone were integrated into payloads for high altitude balloon testing. Two stratospheric balloon missions were carried out and demonstrated DEAs as a viable soft robotic technology under space-like conditions (as high as 23.6 km elevation, at <0.05 atm and -55° C). The combinations of chemical building blocks and catalyst can be further expanded to address other challenges for silicones, including adhesion and additive manufacturing.
Python Bindings for a Large C++ Robotics Library: The Case of OMPL
Python bindings are a critical bridge between high-performance C++ libraries and the flexibility of Python, enabling rapid prototyping, reproducible experiments, and integration with simulation and learning frameworks in robotics research. Yet, generating bindings for large codebases is a tedious process that creates a heavy burden for a small group of maintainers. In this work, we investigate the use of Large Language Models (LLMs) to assist in generating nanobind wrappers, with human experts kept in the loop. Our workflow mirrors the structure of the C++ codebase, scaffolds empty wrapper files, and employs LLMs to fill in binding definitions. Experts then review and refine the generated code to ensure correctness, compatibility, and performance. Through a case study on a large C++ motion planning library, we document common failure modes, including mismanaging shared pointers, overloads, and trampolines, and show how in-context examples and careful prompt design improve reliability. Experiments demonstrate that the resulting bindings achieve runtime performance comparable to legacy solutions. Beyond this case study, our results provide general lessons for applying LLMs to binding generation in large-scale C++ projects.
GIANT - Global Path Integration and Attentive Graph Networks for Multi-Agent Trajectory Planning IROS
This paper presents a novel approach to multi-robot collision avoidance that integrates global path planning with local navigation strategies, utilizing attentive graph neural networks to manage dynamic interactions among agents. We introduce a local navigation model that leverages pre-planned global paths, allowing robots to adhere to optimal routes while dynamically adjusting to environmental changes. The models robustness is enhanced through the introduction of noise during training, resulting in superior performance in complex, dynamic environments. Our approach is evaluated against established baselines, including NH-ORCA, DRL-NAV, and GA3C-CADRL, across various structurally diverse simulated scenarios. The results demonstrate that our model achieves consistently higher success rates, lower collision rates, and more efficient navigation, particularly in challenging scenarios where baseline models struggle. This work offers an advancement in multi-robot navigation, with implications for robust performance in complex, dynamic environments with varying degrees of complexity, such as those encountered in logistics, where adaptability is essential for accommodating unforeseen obstacles and unpredictable changes.
comment: Published in: 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Autonomous Aerial Non-Destructive Testing: Ultrasound Inspection with a Commercial Quadrotor in an Unstructured Environment
This work presents an integrated control and software architecture that enables arguably the first fully autonomous, contact-based non-destructive testing (NDT) using a commercial multirotor originally restricted to remotely-piloted operations. To allow autonomous operation with an off-the-shelf platform, we developed a real-time framework that interfaces directly with its onboard sensor suite. The architecture features a multi-rate control scheme: low-level control is executed at 200 Hz, force estimation at 100 Hz, while an admittance filter and trajectory planner operate at 50 Hz, ultimately supplying acceleration and yaw rate commands to the internal flight controller. We validate the system through physical experiments on a Flyability Elios 3 quadrotor equipped with an ultrasound payload. Relying exclusively on onboard sensing, the vehicle successfully performs autonomous NDT measurements within an unstructured, industrial-like environment. This work demonstrates the viability of retrofitting off-the-shelf platforms for autonomous physical interaction, paving the way for safe, contact-based inspection of hazardous and confined infrastructure.
RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies
Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the π0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.
Risk-Aware Rulebooks for Multi-Objective Trajectory Evaluation under Uncertainty
We present a risk-aware formalism for evaluating system trajectories in the presence of uncertain interactions between the system and its environment. The proposed formalism supports reasoning under uncertainty and systematically handles complex relationships among requirements and objectives, including hierarchical priorities and non-comparability. Rather than treating the environment as exogenous noise, we explicitly model how each system trajectory influences the environment and evaluate trajectories under the resulting distribution of environment responses. We prove that the formalism induces a preorder on the set of system trajectories, ensuring consistency and preventing cyclic preferences. Finally, we illustrate the approach with an autonomous driving example that demonstrates how the formalism enhances explainability by clarifying the rationale behind trajectory selection.
ELLIPSE: Evidential Learning for Robust Waypoints and Uncertainties
Robust waypoint prediction is crucial for mobile robots operating in open-world, safety-critical settings. While Imitation Learning (IL) methods have demonstrated great success in practice, they are susceptible to distribution shifts: the policy can become dangerously overconfident in unfamiliar states. In this paper, we present \textit{ELLIPSE}, a method building on multivariate deep evidential regression to output waypoints and multivariate Student-t predictive distributions in a single forward pass. To reduce covariate-shift-induced overconfidence under viewpoint and pose perturbations near expert trajectories, we introduce a lightweight domain augmentation procedure that synthesizes plausible viewpoint/pose variations without collecting additional demonstrations. To improve uncertainty reliability under environment/domain shift (e.g., unseen staircases), we apply a post-hoc isotonic recalibration on probability integral transform (PIT) values so that prediction sets remain plausible during deployment. We ground the discussion and experiments in staircase waypoint prediction, where obtaining robust waypoint and uncertainty is pivotal. Extensive real world evaluations show that \textit{ELLIPSE} improves both task success rate and uncertainty coverage compared to baselines.
comment: 8 pages, 5 figures
Risk-Aware Reinforcement Learning for Mobile Manipulation
For robots to successfully transition from lab settings to everyday environments, they must begin to reason about the risks associated with their actions and make informed, risk-aware decisions. This is particularly true for robots performing mobile manipulation tasks, which involve both interacting with and navigating within dynamic, unstructured spaces. However, existing whole-body controllers for mobile manipulators typically lack explicit mechanisms for risk-sensitive decision-making under uncertainty. To our knowledge, we are the first to (i) learn risk-aware visuomotor policies for mobile manipulation conditioned on egocentric depth observations with runtime-adjustable risk sensitivity, and (ii) show risk-aware behaviours can be transferred through Imitation Learning (IL) to a visuomotor policy conditioned on egocentric depth observations. Our method achieves this by first training a privileged teacher policy using Distributional Reinforcement Learning (DRL), with a risk-neutral distributional critic. Distortion risk-metrics are then applied to the critic's predicted return distribution to calculate risk-adjusted advantage estimates used in policy updates to achieve a range of risk-aware behaviours. We then distil teacher policies with IL to obtain risk-aware student policies conditioned on egocentric depth observations. We perform extensive evaluations demonstrating that our trained visuomotor policies exhibit risk-aware behaviour (specifically achieving better worst-case performance) while performing reactive whole-body motions in unmapped environments, leveraging live depth observations for perception.
Distributed State Estimation for Vision-Based Cooperative Slung Load Transportation in GPS-Denied Environments
Transporting heavy or oversized slung loads using rotorcraft has traditionally relied on single-aircraft systems, which limits both payload capacity and control authority. Cooperative multilift using teams of rotorcraft offers a scalable and efficient alternative, especially for infrequent but challenging "long-tail" payloads without the need of building larger and larger rotorcraft. Most prior multilift research assumes GPS availability, uses centralized estimation architectures, or relies on controlled laboratory motion-capture setups. As a result, these methods lack robustness to sensor loss and are not viable in GPS-denied or operationally constrained environments. This paper addresses this limitation by presenting a distributed and decentralized payload state estimation framework for vision-based multilift operations. Using onboard monocular cameras, each UAV detects a fiducial marker on the payload and estimates its relative pose. These measurements are fused via a Distributed and Decentralized Extended Information Filter (DDEIF), enabling robust and scalable estimation that is resilient to individual sensor dropouts. This payload state estimate is then used for closed-loop trajectory tracking control. Monte Carlo simulation results in Gazebo show the effectiveness of the proposed approach, including the effect of communication loss during flight.
comment: In proceedings of the 2026 AIAA SciTech Forum, Session: Intelligent Systems-27
From Local Corrections to Generalized Skills: Improving Neuro-Symbolic Policies with MEMO
Recent works use a neuro-symbolic framework for general manipulation policies. The advantage of this framework is that -- by applying off-the-shelf vision and language models -- the robot can break complex tasks down into semantic subtasks. However, the fundamental bottleneck is that the robot needs skills to ground these subtasks into embodied motions. Skills can take many forms (e.g., trajectory snippets, motion primitives, coded functions), but regardless of their form skills act as a constraint. The high-level policy can only ground its language reasoning through the available skills; if the robot cannot generate the right skill for the current task, its policy will fail. We propose to address this limitation -- and dynamically expand the robot's skills -- by leveraging user feedback. When a robot fails, humans can intuitively explain what went wrong (e.g., ``no, go higher''). While a simple approach is to recall this exact text the next time the robot faces a similar situation, we hypothesize that by collecting, clustering, and re-phrasing natural language corrections across multiple users and tasks, we can synthesize more general text guidance and coded skill templates. Applying this hypothesis we develop Memory Enhanced Manipulation (MEMO). MEMO builds and maintains a retrieval-augmented skillbook gathered from human feedback and task successes. At run time, MEMO retrieves relevant text and code from this skillbook, enabling the robot's policy to generate new skills while reasoning over multi-task human feedback. Our experiments demonstrate that using MEMO to aggregate local feedback into general skill templates enables generalization to novel tasks where existing baselines fall short. See supplemental material here: https://collab.me.vt.edu/memo
Many-RRT*: Robust Joint-Space Trajectory Planning for Serial Manipulators
The rapid advancement of high degree-of-freedom (DoF) serial manipulators necessitates the use of swift, sampling-based motion planners for high-dimensional spaces. While sampling-based planners like the Rapidly-Exploring Random Tree (RRT) are widely used, planning in the manipulator's joint space presents significant challenges due to non-invertible forward kinematics. A single task-space end-effector pose can correspond to multiple configuration-space states, creating a multi-arm bandit problem for the planner. In complex environments, simply choosing the wrong joint space goal can result in suboptimal trajectories or even failure to find a viable plan. To address this planning problem, we propose Many-RRT*: an extension of RRT*-Connect that plans to multiple goals in parallel. By generating multiple IK solutions and growing independent trees from these goal configurations simultaneously alongside a single start tree, Many-RRT* ensures that computational effort is not wasted on suboptimal IK solutions. This approach maintains robust convergence and asymptotic optimality. Experimental evaluations across robot morphologies and diverse obstacle environments demonstrate that Many-RRT* provides higher quality trajectories (44.5% lower cost in the same runtime) with a significantly higher success rate (100% vs. the next best of 1.6%) than previous RRT iterations without compromising on runtime performance.
PTLD: Sim-to-real Privileged Tactile Latent Distillation for Dexterous Manipulation
Tactile dexterous manipulation is essential to automating complex household tasks, yet learning effective control policies remains a challenge. While recent work has relied on imitation learning, obtaining high quality demonstrations for multi-fingered hands via robot teleoperation or kinesthetic teaching is prohibitive. Alternatively, with reinforcement we can learn skills in simulation, but fast and realistic simulation of tactile observations is challenging. To bridge this gap, we introduce PTLD: sim-to-real Privileged Tactile Latent Distillation, a novel approach to learning tactile manipulation skills without requiring tactile simulation. Instead of simulating tactile sensors or relying purely on proprioceptive policies to transfer zero-shot sim-to-real, our key idea is to leverage privileged sensors in the real world to collect real-world tactile policy data. This data is then used to distill a robust state estimator that operates on tactile input. We demonstrate from our experiments that PTLD can be used to improve proprioceptive manipulation policies trained in simulation significantly by incorporating tactile sensing. On the benchmark in-hand rotation task, PTLD achieves a 182% improvement over a proprioception only policy. We also show that PTLD enables learning the challenging task of tactile in-hand reorientation where we see a 57% improvement in the number of goals reached over using proprioception alone. Website: https://akashsharma02.github.io/ptld-website/.
Learning with pyCub: A Simulation and Exercise Framework for Humanoid Robotics
We present pyCub, an open-source physics-based simulation of the humanoid robot iCub, along with exercises to teach students the basics of humanoid robotics. Compared to existing iCub simulators (iCub SIM, iCub Gazebo), which require C++ code and YARP as middleware, pyCub works without YARP and with Python code. The complete robot with all articulations has been simulated, with two cameras in the eyes and the unique sensitive skin of the iCub comprising 4000 receptors on its body surface. The exercises range from basic control of the robot in velocity, joint, and Cartesian space to more complex tasks like gazing, grasping, or reactive control. The whole framework is written and controlled with Python, thus allowing to be used even by people with small or almost no programming practice. The exercises can be scaled to different difficulty levels. We tested the framework in two runs of a course on humanoid robotics. The simulation, exercises, documentation, Docker images, and example videos are publicly available at https://rustlluk.github.io/pyCub.
comment: Accepted to 17th International Conference on Robotics in Education (RiE 2026)
ELMUR: External Layer Memory with Update/Rewrite for Long-Horizon RL Problems
Real-world robotic agents must act under partial observability and long horizons, where key cues may appear long before they affect decision making. However, most modern approaches rely solely on instantaneous information, without incorporating insights from the past. Standard recurrent or transformer models struggle with retaining and leveraging long-term dependencies: context windows truncate history, while naive memory extensions fail under scale and sparsity. We propose ELMUR (External Layer Memory with Update/Rewrite), a transformer architecture with structured external memory. Each layer maintains memory embeddings, interacts with them via bidirectional cross-attention, and updates them through an Least Recently Used (LRU) memory module using replacement or convex blending. ELMUR extends effective horizons up to 100,000 times beyond the attention window and achieves a 100% success rate on a synthetic T-Maze task with corridors up to one million steps. In POPGym, it outperforms baselines on more than half of the tasks. On MIKASA-Robo sparse-reward manipulation tasks with visual observations, it nearly doubles the performance of strong baselines, achieving the best success rate on 21 out of 23 tasks and improving the aggregate success rate across all tasks by about 70% over the previous best baseline. These results demonstrate that structured, layer-local external memory offers a simple and scalable approach to decision making under partial observability. Code and project page: https://elmur-paper.github.io/.
comment: 31 pages, 15 figures, 8 tables
Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning
Memory is crucial for enabling agents to tackle complex tasks with temporal and spatial dependencies. While many reinforcement learning (RL) algorithms incorporate memory, the field lacks a universal benchmark to assess an agent's memory capabilities across diverse scenarios. This gap is particularly evident in tabletop robotic manipulation, where memory is essential for solving tasks with partial observability and ensuring robust performance, yet no standardized benchmarks exist. To address this, we introduce MIKASA (Memory-Intensive Skills Assessment Suite for Agents), a comprehensive benchmark for memory RL, with three key contributions: (1) we propose a comprehensive classification framework for memory-intensive RL tasks, (2) we collect MIKASA-Base -- a unified benchmark that enables systematic evaluation of memory-enhanced agents across diverse scenarios, and (3) we develop MIKASA-Robo (pip install mikasa-robo-suite) -- a novel benchmark of 32 carefully designed memory-intensive tasks that assess memory capabilities in tabletop robotic manipulation. Our work introduces a unified framework to advance memory RL research, enabling more robust systems for real-world use. MIKASA is available at https://tinyurl.com/membenchrobots.
comment: 57 pages, 29 figures, 11 tables
H-WM: Robotic Task and Motion Planning Guided by Hierarchical World Model
World models are becoming central to robotic planning and control as they enable prediction of future state transitions. Existing approaches often emphasize video generation or natural-language prediction, which are difficult to ground in robot actions and suffer from compounding errors over long horizons. Classic task and motion planning models world transitions in logical space, enabling robot-executable and robust long-horizon reasoning. However, they typically operate independently of visual perception, preventing synchronized symbolic and visual state prediction. We propose a Hierarchical World Model (H-WM) that jointly predicts logical and visual state transitions within a unified framework. H-WM combines a high-level logical world model with a low-level visual world model, integrating the long-horizon robustness of symbolic reasoning with visual grounding. The hierarchical outputs provide stable intermediate guidance for long-horizon tasks, mitigating error accumulation and enabling robust execution across extended task sequences. Experiments across multiple vision-language-action (VLA) control policies demonstrate the effectiveness and generality of H-WM's guidance.
comment: 8 pages, 4 figures
Category-Level Object Shape and Pose Estimation in Less Than a Millisecond ICRA 2026
Object shape and pose estimation is a foundational robotics problem, supporting tasks from manipulation to scene understanding and navigation. We present a fast local solver for shape and pose estimation which requires only category-level object priors and admits an efficient certificate of global optimality. Given an RGB-D image of an object, we use a learned front-end to detect sparse, category-level semantic keypoints on the target object. We represent the target object's unknown shape using a linear active shape model and pose a maximum a posteriori optimization problem to solve for position, orientation, and shape simultaneously. Expressed in unit quaternions, this problem admits first-order optimality conditions in the form of an eigenvalue problem with eigenvector nonlinearities. Our primary contribution is to solve this problem efficiently with self-consistent field iteration, which only requires computing a 4-by-4 matrix and finding its minimum eigenvalue-vector pair at each iterate. Solving a linear system for the corresponding Lagrange multipliers gives a simple global optimality certificate. One iteration of our solver runs in about 100 microseconds, enabling fast outlier rejection. We test our method on synthetic data and a variety of real-world settings, including two public datasets and a drone tracking scenario. Code is released at https://github.com/MIT-SPARK/Fast-ShapeAndPose.
comment: Accepted to ICRA 2026. This version contains appendices
Hybrid Diffusion Policies with Projective Geometric Algebra for Efficient Robot Manipulation Learning ICRA 2026
Diffusion policies are a powerful paradigm for robot learning, but their training is often inefficient. A key reason is that networks must relearn fundamental spatial concepts, such as translations and rotations, from scratch for every new task. To alleviate this redundancy, we propose embedding geometric inductive biases directly into the network architecture using Projective Geometric Algebra (PGA). PGA provides a unified algebraic framework for representing geometric primitives and transformations, allowing neural networks to reason about spatial structure more effectively. In this paper, we introduce hPGA-DP, a novel hybrid diffusion policy that capitalizes on these benefits. Our architecture leverages the Projective Geometric Algebra Transformer (P-GATr) as a state encoder and action decoder, while employing established U-Net or Transformer-based modules for the core denoising process. Through extensive experiments and ablation studies in both simulated and real-world environments, we demonstrate that hPGA-DP significantly improves task performance and training efficiency. Notably, our hybrid approach achieves substantially faster convergence compared to both standard diffusion policies and architectures that rely solely on P-GATr. The project website is available at: https://apollo-lab-yale.github.io/26-ICRA-hPGA-website/.
comment: Accepted to ICRA 2026
Aerial Manipulation with Contact-Aware Onboard Perception and Hybrid Control ICRA 2026
Aerial manipulation (AM) promises to move Unmanned Aerial Vehicles (UAVs) beyond passive inspection to contact-rich tasks such as grasping, assembly, and in-situ maintenance. Most prior AM demonstrations rely on external motion capture (MoCap) and emphasize position control for coarse interactions, limiting deployability. We present a fully onboard perception-control pipeline for contact-rich AM that achieves accurate motion tracking and regulated contact wrenches without MoCap. The main components are (1) an augmented visual-inertial odometry (VIO) estimator with contact-consistency factors that activate only during interaction, tightening uncertainty around the contact frame and reducing drift, and (2) image-based visual servoing (IBVS) to mitigate perception-control coupling, together with a hybrid force-motion controller that regulates contact wrenches and lateral motion for stable contact. Experiments show that our approach closes the perception-to-wrench loop using only onboard sensing, yielding an velocity estimation improvement of 66.01% at contact, reliable target approach, and stable force holding-pointing toward deployable, in-the-wild aerial manipulation.
comment: 8 pages, 7 figures. Accepted by ICRA 2026
Fine-Tuning Robot Policies While Maintaining User Privacy
Recent works introduce general-purpose robot policies. These policies provide a strong prior over how robots should behave -- e.g., how a robot arm should manipulate food items. But in order for robots to match an individual person's needs, users typically fine-tune these generalized policies -- e.g., showing the robot arm how to make their own preferred dinners. Importantly, during the process of personalizing robots, end-users leak data about their preferences, habits, and styles (e.g., the foods they prefer to eat). Other agents can simply roll-out the fine-tuned policy and see these personally-trained behaviors. This leads to a fundamental challenge: how can we develop robots that personalize actions while keeping learning private from external agents? We here explore this emerging topic in human-robot interaction and develop PRoP, a model-agnostic framework for personalized and private robot policies. Our core idea is to equip each user with a unique key; this key is then used to mathematically transform the weights of the robot's network. With the correct key, the robot's policy switches to match that user's preferences -- but with incorrect keys, the robot reverts to its baseline behaviors. We show the general applicability of our method across multiple model types in imitation learning, reinforcement learning, and classification tasks. PRoP is practically advantageous because it retains the architecture and behaviors of the original policy, and experimentally outperforms existing encoder-based approaches.
Extremely Simple Multimodal Outlier Synthesis for Out-of-Distribution Detection and Segmentation NeurIPS 2025
Out-of-distribution (OOD) detection and segmentation are crucial for deploying machine learning models in safety-critical applications such as autonomous driving and robot-assisted surgery. While prior research has primarily focused on unimodal image data, real-world applications are inherently multimodal, requiring the integration of multiple modalities for improved OOD detection. A key challenge is the lack of supervision signals from unknown data, leading to overconfident predictions on OOD samples. To address this challenge, we propose Feature Mixing, an extremely simple and fast method for multimodal outlier synthesis with theoretical support, which can be further optimized to help the model better distinguish between in-distribution (ID) and OOD data. Feature Mixing is modality-agnostic and applicable to various modality combinations. Additionally, we introduce CARLA-OOD, a novel multimodal dataset for OOD segmentation, featuring synthetic OOD objects across diverse scenes and weather conditions. Extensive experiments on SemanticKITTI, nuScenes, CARLA-OOD datasets, and the MultiOOD benchmark demonstrate that Feature Mixing achieves state-of-the-art performance with a $10 \times$ to $370 \times$ speedup. Our source code and dataset will be available at https://github.com/mona4399/FeatureMixing.
comment: NeurIPS 2025
CLASH: Collision Learning via Augmented Sim-to-real Hybridization to Bridge the Reality Gap
The sim-to-real gap, particularly in the inaccurate modeling of contact-rich dynamics like collisions, remains a primary obstacle to deploying robot policies trained in simulation. Conventional physics engines often trade accuracy for computational speed, leading to discrepancies that prevent direct policy transfer. To address this, we introduce Collision Learning via Augmented Sim-to-real Hybridization (CLASH), a data-efficient framework that learns a parameter-conditioned impulsive collision surrogate model and integrates it as a plug-in module within a standard simulator. CLASH first distills a base model from an imperfect simulator (MuJoCo) using large-scale simulated collisions to capture reusable physical priors. Given only a handful of real collisions (e.g., 10 samples), it then (i) performs gradient-based identification of key contact parameters and (ii) applies small-step, early-stopped fine-tuning to correct residual sim-to-real mismatches while avoiding overfitting. The resulting hybrid simulator not only achieves higher post-impact prediction accuracy but also reduces the wall-clock time of collision-heavy CMA-ES search by 42-48% compared to MuJoCo. We demonstrate that policies obtained with our hybrid simulator transfer more robustly to the real world, doubling the success rate in sequential pushing tasks with reinforcement learning and significantly increase the task performance with model-based control.
A Bayesian Framework for Active Tactile Object Recognition, Pose Estimation and Shape Transfer Learning
As humans can explore and understand the world through active touch, similar capability is desired for robots. In this paper, we address the problem of active tactile object recognition, pose estimation and shape transfer learning, where a customized particle filter (PF) and Gaussian process implicit surface (GPIS) is combined in a unified Bayesian framework. Upon new tactile input, the customized PF updates the joint distribution of the object class and object pose while tracking the novelty of the object. Once a novel object is identified, its shape will be reconstructed using GPIS. By grounding the prior of the GPIS with the maximum-a-posteriori (MAP) estimation from the PF, the knowledge about known shapes can be transferred to learn novel shapes. An exploration procedure based on global shape estimation is proposed to guide active data acquisition and terminate the exploration upon sufficient information. Through experiments in simulation, the proposed framework demonstrated its effectiveness and efficiency in estimating object class and pose for known objects and learning novel shapes. Furthermore, it can recognize previously learned shapes reliably.
Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence ICRA 2026
Embodied intelligence relies on accurately segmenting objects actively involved in interactions. Action-based video object segmentation addresses this by linking segmentation with action semantics, but it depends on large-scale annotations and prompts that are costly, inconsistent, and prone to multimodal noise such as imprecise masks and referential ambiguity. To date, this challenge remains unexplored. In this work, we take the first step by studying action-based video object segmentation under label noise, focusing on two sources: textual prompt noise (category flips and within-category noun substitutions) and mask annotation noise (perturbed object boundaries to mimic imprecise supervision). Our contributions are threefold. First, we introduce two types of label noises for the action-based video object segmentation task. Second, we build up the first action-based video object segmentation under a label noise benchmark ActiSeg-NL and adapt six label-noise learning strategies to this setting, and establish protocols for evaluating them under textual, boundary, and mixed noise. Third, we provide a comprehensive analysis linking noise types to failure modes and robustness gains, and we introduce a Parallel Mask Head Mechanism (PMHM) to address mask annotation noise. Qualitative evaluations further reveal characteristic failure modes, including boundary leakage and mislocalization under boundary perturbations, as well as occasional identity substitutions under textual flips. Our comparative analysis reveals that different learning strategies exhibit distinct robustness profiles, governed by a foreground-background trade-off where some achieve balanced performance while others prioritize foreground accuracy at the cost of background precision. The established benchmark and source code will be made publicly available at https://github.com/mylwx/ActiSeg-NL.
comment: Accepted to ICRA 2026. The established benchmark and source code will be made publicly available at https://github.com/mylwx/ActiSeg-NL
TPK: Trustworthy Trajectory Prediction Integrating Prior Knowledge For Interpretability and Kinematic Feasibility
Trajectory prediction is crucial for autonomous driving, enabling vehicles to navigate safely by anticipating the movements of surrounding road users. However, current deep learning models often lack trustworthiness as their predictions can be physically infeasible and illogical to humans. To make predictions more trustworthy, recent research has incorporated prior knowledge, like the social force model for modeling interactions and kinematic models for physical realism. However, these approaches focus on priors that suit either vehicles or pedestrians and do not generalize to traffic with mixed agent classes. We propose incorporating interaction and kinematic priors of all agent classes--vehicles, pedestrians, and cyclists with class-specific interaction layers to capture agent behavioral differences. To improve the interpretability of the agent interactions, we introduce DG-SFM, a rule-based interaction importance score that guides the interaction layer. To ensure physically feasible predictions, we proposed suitable kinematic models for all agent classes with a novel pedestrian kinematic model. We benchmark our approach on the Argoverse 2 dataset, using the state-of-the-art transformer HPTR as our baseline. Experiments demonstrate that our method improves interaction interpretability, revealing a correlation between incorrect predictions and divergence from our interaction prior. Even though incorporating the kinematic models causes a slight decrease in accuracy, they eliminate infeasible trajectories found in the dataset and the baseline model. Thus, our approach fosters trust in trajectory prediction as its interaction reasoning is interpretable, and its predictions adhere to physics.
comment: First and Second authors contributed equally; Accepted in the 36th IEEE Intelligent Vehicles Symposium (IV 2025) for oral presentation; Winner of the best paper award
SoraNav: Adaptive UAV Task-Centric Navigation via Zeroshot VLM Reasoning
Autonomous navigation under natural language instructions represents a crucial step toward embodied intelligence, enabling complex task execution in environments ranging from industrial facilities to domestic spaces. However, language-driven 3D navigation for Unmanned Aerial Vehicles (UAVs) requires precise spatial reasoning, a capability inherently lacking in current zero-shot Vision-Language Models (VLMs) which often generate ambiguous outputs and cannot guarantee geometric feasibility. Furthermore, existing Vision-Language Navigation (VLN) methods are predominantly tailored for 2.5D ground robots, rendering them unable to generalize to the unconstrained 3D spatial reasoning required for aerial tasks in small-scale, cluttered environments. In this paper, we present SoraNav, a novel framework enabling zero-shot VLM reasoning for UAV task-centric navigation. To address the spatial-semantic gap, we introduce Multi-modal Visual Annotation (MVA), which encodes 3D geometric priors directly into the VLM's 2D visual input. To mitigate hallucinated or infeasible commands, we propose an Adaptive Decision Making (ADM) strategy that validates VLM proposals against exploration history, seamlessly switching to geometry-based exploration to avoid dead-ends and redundant revisits. Deployed on a custom PX4-based micro-UAV, SoraNav demonstrates robust real-world performance. Quantitative results show our approach significantly outperforms state-of-the-art baselines, increasing Success Rate (SR) by 25.7% and navigation efficiency (SPL) by 17.3% in 2.5D scenarios, and achieving improvements of 39.3% (SR) and 24.7% (SPL) in complex 3D scenarios.
A Review of Reward Functions for Reinforcement Learning in the context of Autonomous Driving
Reinforcement learning has emerged as an important approach for autonomous driving. A reward function is used in reinforcement learning to establish the learned skill objectives and guide the agent toward the optimal policy. Since autonomous driving is a complex domain with partly conflicting objectives with varying degrees of priority, developing a suitable reward function represents a fundamental challenge. This paper aims to highlight the gap in such function design by assessing different proposed formulations in the literature and dividing individual objectives into Safety, Comfort, Progress, and Traffic Rules compliance categories. Additionally, the limitations of the reviewed reward functions are discussed, such as objectives aggregation and indifference to driving context. Furthermore, the reward categories are frequently inadequately formulated and lack standardization. This paper concludes by proposing future research that potentially addresses the observed shortcomings in rewards, including a reward validation framework and structured rewards that are context-aware and able to resolve conflicts.
comment: Accepted at the 35th IEEE Intelligent Vehicles Symposium (IV 2024)
LaViRA: Language-Vision-Robot Actions Translation for Zero-Shot Vision Language Navigation in Continuous Environments ICRA 2026
LaViRA: Zero-shot Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires an agent to navigate unseen environments based on natural language instructions without any prior training. Current methods face a critical trade-off: either rely on environment-specific waypoint predictors that limit scene generalization, or underutilize the reasoning capabilities of large models during navigation. We introduce LaViRA, a simple yet effective zero-shot framework that addresses this dilemma by decomposing action into a coarse-to-fine hierarchy: Language Action for high-level planning, Vision Action for middle-level perceptual grounding, and Robot Action for low-level control. This modular decomposition allows us to leverage the distinct strengths of different scales of Multimodal Large Language Models (MLLMs) at each stage, creating a system that is powerful in its reasoning, grounding and practical control. LaViRA significantly outperforms existing state-of-the-art methods on the VLN-CE benchmark, demonstrating superior generalization capabilities in unseen environments, while maintaining transparency and efficiency for real-world deployment. Project page: https://robo-lavira.github.io/lavira-zs-vln/
comment: ICRA 2026
A Novel Modular Cable-Driven Soft Robotic Arm with Multi-Segment Reconfigurability
This paper presents a novel, modular, cable-driven soft robotic arm featuring multi-segment reconfigurability. The proposed architecture enables a stackable system with independent segment control, allowing scalable adaptation to diverse structural and application requirements. The system is fabricated from soft silicone material and incorporates embedded tendon-routing channels with a protective dual-helical tendon structure. Experimental results showed that modular stacking substantially expanded the reachable workspace: relative to the single-segment arm, the three-segment configuration achieved up to a 13-fold increase in planar workspace area and a 38.9-fold increase in workspace volume. Furthermore, this study investigated the effect of silicone stiffness on actuator performance. The results revealed a clear trade-off between compliance and stiffness: softer silicone improved bending flexibility, while stiffer silicone improved structural rigidity and load-bearing stability. These results highlight the potential of stiffness tuning to balance compliance and strength for configuring scalable, reconfigurable soft robotic arms.
comment: 6 pages, 8 figures
Evolution 6.0: Robot Evolution through Generative Design
We propose a new concept, Evolution 6.0, which represents the evolution of robotics driven by Generative AI. When a robot lacks the necessary tools to accomplish a task requested by a human, it autonomously designs the required instruments and learns how to use them to achieve the goal. Evolution 6.0 is an autonomous robotic system powered by Vision-Language Models (VLMs), Vision-Language Action (VLA) models, and Text-to-3D generative models for tool design and task execution. The system comprises two key modules: the Tool Generation Module, which fabricates task-specific tools from visual and textual data, and the Action Generation Module, which converts natural language instructions into robotic actions. It integrates QwenVLM for environmental understanding, OpenVLA for task execution, and Llama-Mesh for 3D tool generation. Evaluation results demonstrate a 90% success rate for tool generation with a 10-second inference time, and action generation achieving 83.5% in physical and visual generalization, 70% in motion generalization, and 37% in semantic generalization. Future improvements will focus on bimanual manipulation, expanded task capabilities, and enhanced environmental interpretation to improve real-world adaptability.
comment: Accepted to HRI
Fusion of Visual-Inertial Odometry with LiDAR Relative Localization for Cooperative Guidance of a Micro-Scale Aerial Vehicle
A novel relative localization approach for guidance of a micro-scale Unmanned Aerial Vehicle (UAV) by a well-equipped aerial robot fusing Visual-Inertial Odometry (VIO) with Light Detection and Ranging (LiDAR) is proposed in this paper. LiDAR-based localization is accurate and robust to challenging environmental conditions, but 3D LiDARs are relatively heavy and require large UAV platforms, in contrast to lightweight cameras. However, visual-based self-localization methods exhibit lower accuracy and can suffer from significant drift with respect to the global reference frame. To benefit from both sensory modalities, we focus on cooperative navigation in a heterogeneous team of a primary LiDAR-equipped UAV and a secondary micro-scale camera-equipped UAV. We propose a novel cooperative approach combining LiDAR relative localization data with VIO output on board the primary UAV to obtain an accurate pose of the secondary UAV. The pose estimate is used to precisely and reliably guide the secondary UAV along trajectories defined in the primary UAV reference frame. The experimental evaluation has shown the superior accuracy of our method to the raw VIO output, reaching the average 3D Absolute Trajectory Error (ATE) of 0.28 m, and demonstrated its capability to guide the secondary UAV along desired trajectories while mitigating VIO drift. Thus, such a heterogeneous system can explore large areas with LiDAR precision, as well as visit locations inaccessible to the large LiDAR-carrying UAV platforms, as was showcased in a real-world cooperative mapping scenario.
comment: Accepted version
CASSR: Continuous A-Star Search through Reachability for real time footstep planning
Footstep planning involves a challenging combinatorial search. Traditional A* approaches require discretising reachability constraints, while Mixed-Integer Programming (MIP) supports continuous formulations but quickly becomes intractable, especially when rotations are included. We present CASSR, a novel framework that recursively propagates convex, continuous formulations of a robot's kinematic constraints within an A* search. Combined with a new cost-to-go heuristic based on the EPA algorithm, CASSR efficiently plans contact sequences of up to 30 footsteps in under 125 ms. Experiments on biped locomotion tasks demonstrate that CASSR outperforms traditional discretised A* by up to a factor of 100, while also surpassing a commercial MIP solver. These results show that CASSR enables fast, reliable, and real-time footstep planning for biped robots.
No Need to Look! Locating and Grasping Objects by a Robot Arm Covered with Sensitive Skin ICRA 2026
Locating and grasping of objects by robots is typically performed using visual sensors. Haptic feedback from contacts with the environment is only secondary if present at all. In this work, we explored an extreme case of searching for and grasping objects in complete absence of visual input, relying on haptic feedback only. The main novelty lies in the use of contacts over the complete surface of a robot manipulator covered with sensitive skin. The search is divided into two phases: (1) coarse workspace exploration with the complete robot surface, followed by (2) precise localization using the end-effector equipped with a force/torque sensor. We systematically evaluated this method in simulation and on the real robot, demonstrating that diverse objects can be located, grasped, and put in a basket. The overall success rate on the real robot for one object was 85.7% with failures mainly while grasping specific objects. The method using whole-body contacts is six times faster compared to a baseline that uses haptic feedback only on the end-effector. We also show locating and grasping multiple objects on the table. This method is not restricted to our specific setup and can be deployed on any platform with the ability of sensing contacts over the entire body surface. This work holds promise for diverse applications in areas with challenging visual perception (due to lighting, dust, smoke, occlusion) such as in agriculture when fruits or vegetables need to be located inside foliage and picked.
comment: Karel Bartunek, Lukas Rustler: Authors contributed equally Accepted to IEEE ICRA 2026
FlowCorrect: Efficient Interactive Correction of Generative Flow Policies for Robotic Manipulation
Generative manipulation policies can fail catastrophically under deployment-time distribution shift, yet many failures are near-misses: the robot reaches almost-correct poses and would succeed with a small corrective motion. We propose FlowCorrect, a modular interactive imitation learning approach that enables deployment-time adaptation of flow-matching manipulation policies from sparse, relative human corrections without retraining. During execution, a human provides brief corrective pose nudges via a lightweight VR interface. FlowCorrect uses these sparse corrections to locally adapt the policy, improving actions without retraining the backbone while preserving the model performance on previously learned scenarios. We evaluate on a real-world robot across four tabletop tasks: pick-and-place, pouring, cup uprighting, and insertion. With a low correction budget, FlowCorrect achieves an 80% success rate on previously failed cases while preserving performance on previously solved scenarios. The results clearly demonstrate that FlowCorrect learns from very few demonstrations and enables fast, sample-efficient, incremental, human-in-the-loop corrections of generative visuomotor policies at deployment time in real-world robotics.
comment: 8 pages, 5 figures
Metric, inertially aligned monocular state estimation via kinetodynamic priors
Accurate state estimation for flexible robotic systems poses significant challenges, particularly for platforms with dynamically deforming structures that invalidate rigid-body assumptions. This paper addresses this problem and enables the extension of existing rigid-body pose estimation methods to non-rigid systems. Our approach integrates two core components: first, we capture elastic properties using a deformation-force model, efficiently learned via a Multi-Layer Perceptron; second, we resolve the platform's inherently smooth motion using continuous-time B-spline kinematic models. By continuously applying Newton's Second Law, our method formulates the relationship between visually-derived trajectory acceleration and predicted deformation-induced acceleration. We demonstrate that our approach not only enables robust and accurate pose estimation on non-rigid platforms, but also demonstrates that the properly modeled platform physics allow for the recovery of inertial sensing properties. We validate this feasibility on a simple spring-camera system, showing how it robustly resolves the typically ill-posed problem of metric scale and gravity recovery in monocular visual odometry.
Dynamic-ICP: Doppler-Aware Iterative Closest Point Registration for Dynamic Scenes
Reliable odometry in highly dynamic environments remains challenging when it relies on ICP-based registration: ICP assumes near-static scenes and degrades in repetitive or low-texture geometry. We introduce Dynamic-ICP, a Doppler-aware registration framework. The method (i) estimates ego motion from per-point Doppler velocity via robust regression and builds a velocity filter, (ii) clusters dynamic objects and reconstructs object-wise translational velocities from ego-compensated radial measurements, (iii) predicts dynamic points with a constant-velocity model, and (iv) aligns scans using a compact objective that combines point-to-plane geometry residual with a translation-invariant, rotation-only Doppler residual. The approach requires no external sensors or sensor-vehicle calibration and operates directly on FMCW LiDAR range and Doppler velocities. We evaluate Dynamic-ICP on three datasets-HeRCULES, HeLiPR, AevaScenes-focusing on highly dynamic scenes. Dynamic-ICP consistently improves rotational stability and translation accuracy over the state-of-the-art methods. Our approach is also simple to integrate into existing pipelines, runs in real time, and provides a lightweight solution for robust registration in dynamic environments. To encourage further research, the code is available at: https://github.com/JMUWRobotics/Dynamic-ICP.
comment: 8 pages, 5 figures
TOLEBI: Learning Fault-Tolerant Bipedal Locomotion via Online Status Estimation and Fallibility Rewards ICRA
With the growing employment of learning algorithms in robotic applications, research on reinforcement learning for bipedal locomotion has become a central topic for humanoid robotics. While recently published contributions achieve high success rates in locomotion tasks, scarce attention has been devoted to the development of methods that enable to handle hardware faults that may occur during the locomotion process. However, in real-world settings, environmental disturbances or sudden occurrences of hardware faults might yield severe consequences. To address these issues, this paper presents TOLEBI (A faulT-tOlerant Learning framEwork for Bipedal locomotIon) that handles faults on the robot during operation. Specifically, joint locking, power loss and external disturbances are injected in simulation to learn fault-tolerant locomotion strategies. In addition to transferring the learned policy to the real robot via sim-to-real transfer, an online joint status module incorporated. This module enables to classify joint conditions by referring to the actual observations at runtime under real-world conditions. The validation experiments conducted both in real-world and simulation with the humanoid robot TOCABI highlight the applicability of the proposed approach. To our knowledge, this manuscript provides the first learning-based fault-tolerant framework for bipedal locomotion, thereby fostering the development of efficient learning methods in this field.
comment: Accepted for Publication at IEEE International Conference on Robotics and Automation (ICRA) 2026
TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics
Vision-Language Models (VLMs) have shown remarkable capabilities in spatial reasoning, yet they remain fundamentally limited to qualitative precision and lack the computational precision required for real-world robotics. Current approaches fail to leverage metric cues from depth sensors and camera calibration, instead reducing geometric problems to pattern recognition tasks that cannot deliver the centimeter-level accuracy essential for robotic manipulation. We present TIGeR (Tool-Integrated Geometric Reasoning), a novel framework that transforms VLMs from perceptual estimators to geometric computers by enabling them to generate and execute precise geometric computations through external tools. Rather than attempting to internalize complex geometric operations within neural networks, TIGeR empowers models to recognize geometric reasoning requirements, synthesize appropriate computational code, and invoke specialized libraries for exact calculations. To support this paradigm, we introduce TIGeR-300K, a comprehensive tool-invocation-oriented dataset covering point transformations, pose estimation, and spatial compatibility verification, complete with tool invocation sequences and intermediate computations. Through a two-stage training pipeline combining supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) with our proposed hierarchical reward design, TIGeR achieves SOTA performance on geometric reasoning benchmarks while demonstrating centimeter-level precision in real-world robotic manipulation tasks.
comment: 8 pages, 6 figures
Event-LAB: Towards Standardized Evaluation of Neuromorphic Localization Methods ICRA
Event-based localization research and datasets are a rapidly growing area of interest, with a tenfold increase in the cumulative total number of published papers on this topic over the past 10 years. Whilst the rapid expansion in the field is exciting, it brings with it an associated challenge: a growth in the variety of required code and package dependencies as well as data formats, making comparisons difficult and cumbersome for researchers to implement reliably. To address this challenge, we present Event-LAB: a new and unified framework for running several event-based localization methodologies across multiple datasets. Event-LAB is implemented using the Pixi package and dependency manager, that enables a single command-line installation and invocation for combinations of localization methods and datasets. To demonstrate the capabilities of the framework, we implement two common event-based localization pipelines: Visual Place Recognition (VPR) and Simultaneous Localization and Mapping (SLAM). We demonstrate the ability of the framework to systematically visualize and analyze the results of multiple methods and datasets, revealing key insights such as the association of parameters that control event collection counts and window sizes for frame generation to large variations in performance. The results and analysis demonstrate the importance of fairly comparing methodologies with consistent event image generation parameters. Our Event-LAB framework provides this ability for the research community, by contributing a streamlined workflow for easily setting up multiple conditions.
comment: 8 pages, 6 figures, accepted to the IEEE International Conference on Robotics and Automation (ICRA) 2026
Point2Act: Efficient 3D Distillation of Multimodal LLMs for Zero-Shot Context-Aware Grasping ICRA 2026
We propose Point2Act, which directly retrieves the 3D action point relevant to a contextually described task, leveraging Multimodal Large Language Models (MLLMs). Foundation models opened the possibility for generalist robots that can perform a zero-shot task following natural language descriptions within an unseen environment. While the semantics obtained from large-scale image and language datasets provide contextual understanding in 2D images, the rich yet nuanced features deduce blurry 2D regions and struggle to find precise 3D locations for actions. Our proposed 3D relevancy fields bypass the high-dimensional features and instead efficiently imbue lightweight 2D point-level guidance tailored to the task-specific action. The multi-view aggregation effectively compensates for misalignments due to geometric ambiguities, such as occlusion, or semantic uncertainties inherent in the language descriptions. The output region is highly localized, reasoning fine-grained 3D spatial context that can directly transfer to an explicit position for physical action at the on-the-fly reconstruction of the scene. Our full-stack pipeline, which includes capturing, MLLM querying, 3D reconstruction, and grasp pose extraction, generates spatially grounded responses in under 20 seconds, facilitating practical manipulation tasks. Project page: https://sangminkim-99.github.io/point2act/
comment: Accepted to ICRA 2026
Learning Physical Principles from Interaction: Self-Evolving Planning via Test-Time Memory
Reliable object manipulation requires understanding physical properties that vary across objects and environments. Vision-language model (VLM) planners can reason about friction and stability in general terms; however, they often cannot predict how a specific ball will roll on a particular surface or which stone will provide a stable foundation without direct experience. We present PhysMem, a memory framework that enables VLM robot planners to learn physical principles from interaction at test time, without updating model parameters. The system records experiences, generates candidate hypotheses, and verifies them through targeted interaction before promoting validated knowledge to guide future decisions. A central design choice is verification before application: the system tests hypotheses against new observations rather than applying retrieved experience directly, reducing rigid reliance on prior experience when physical conditions change. We evaluate PhysMem on three real-world manipulation tasks and simulation benchmarks across four VLM backbones. On a controlled brick insertion task, principled abstraction achieves 76% success compared to 23% for direct experience retrieval, and real-world experiments show consistent improvement over 30-minute deployment sessions.
CERNet: Class-Embedding Predictive-Coding RNN for Unified Robot Motion, Recognition, and Confidence Estimation ICRA
Robots interacting with humans must not only generate learned movements in real-time, but also infer the intent behind observed behaviors and estimate the confidence of their own inferences. This paper proposes a unified model that achieves all three capabilities within a single hierarchical predictive-coding recurrent neural network (PC-RNN) equipped with a class embedding vector, CERNet, which leverages a dynamically updated class embedding vector to unify motor generation and recognition. The model operates in two modes: generation and inference. In the generation mode, the class embedding constrains the hidden state dynamics to a class-specific subspace; in the inference mode, it is optimized online to minimize prediction error, enabling real-time recognition. Validated on a humanoid robot across 26 kinesthetically taught alphabets, our hierarchical model achieves 76% lower trajectory reproduction error than a parameter-matched single-layer baseline, maintains motion fidelity under external perturbations, and infers the demonstrated trajectory class online with 68% Top-1 and 81% Top-2 accuracy. Furthermore, internal prediction errors naturally reflect the model's confidence in its recognition. This integration of robust generation, real-time recognition, and intrinsic uncertainty estimation within a compact PC-RNN framework offers a compact and extensible approach to motor memory in physical robots, with potential applications in intent-sensitive human-robot collaboration.
comment: Accepted for presentation at IEEE International Conference on Robotics and Automation (ICRA) 2026
Learning Agile Gate Traversal via Analytical Optimal Policy Gradient
Traversing narrow gates presents a significant challenge and has become a standard benchmark for evaluating agile and precise quadrotor flight. Traditional modularized autonomous flight stacks require extensive design and parameter tuning, while end-to-end reinforcement learning (RL) methods often suffer from low sample efficiency, limited interpretability, and degraded disturbance rejection under unseen perturbations. In this work, we present a novel hybrid framework that adaptively fine-tunes model predictive control (MPC) parameters online using outputs from a neural network (NN) trained offline. The NN jointly predicts a reference pose and cost function weights, conditioned on the coordinates of the gate corners and the current drone state. To achieve efficient training, we derive analytical policy gradients not only for the MPC module but also for an optimization-based gate traversal detection module. Hardware experiments demonstrate agile and accurate gate traversal with peak accelerations of $30\ \mathrm{m/s^2}$, as well as recovery within $0.85\ \mathrm{s}$ following body-rate disturbances exceeding $1146\ \mathrm{deg/s}$.
comment: 8 pages, 8 figures
Walk Like Dogs: Learning Steerable Imitation Controllers for Legged Robots from Unlabeled Motion Data
We present an imitation learning framework that extracts distinctive legged locomotion behaviors and transitions between them from unlabeled real-world motion data. By automatically discovering behavioral modes and mapping user steering commands to them, the framework enables user-steerable and stylistically consistent motion imitation. Our approach first bridges the morphological and physical gap between the motion source and the robot by transforming raw data into a physically consistent, robot-compatible dataset using a kino-dynamic motion retargeting strategy. This data is used to train a steerable motion synthesis module that generates stylistic, multi-modal kinematic targets from high-level user commands. These targets serve as a reference for a reinforcement learning controller, which reliably executes them on the robot hardware. In our experiments, a controller trained on dog motion data demonstrated distinctive quadrupedal gait patterns and emergent gait transitions in response to varying velocity commands. These behaviors were achieved without manual labeling, predefined mode counts, or explicit switching rules, maintaining the stylistic coherence of the data.
comment: The supplementary video is available at https://youtu.be/DukyUGNYf5A
Observer-Actor: Active Vision Imitation Learning with Sparse-View Gaussian Splatting ICRA 2026
We propose Observer Actor (ObAct), a novel framework for active vision imitation learning in which the observer moves to optimal visual observations for the actor. We study ObAct on a dual-arm robotic system equipped with wrist-mounted cameras. At test time, ObAct dynamically assigns observer and actor roles: the observer arm constructs a 3D Gaussian Splatting (3DGS) representation from three images, virtually explores this to find an optimal camera pose, then moves to this pose; the actor arm then executes a policy using the observer's observations. This formulation enhances the clarity and visibility of both the object and the gripper in the policy's observations. As a result, we enable the training of ambidextrous policies on observations that remain closer to the occlusion-free training distribution, leading to more robust policies. We study this formulation with two existing imitation learning methods -- trajectory transfer and behavior cloning -- and experiments show that ObAct significantly outperforms static-camera setups: trajectory transfer improves by 145% without occlusion and 233% with occlusion, while behavior cloning improves by 75% and 143%, respectively. Videos are available at https://obact.github.io.
comment: Accepted at ICRA 2026. Project Webpage: https://obact.github.io
GRAND: Guidance, Rebalancing, and Assignment for Networked Dispatch in Multi-Agent Path Finding
Large robot fleets are now common in warehouses and other logistics settings, where small control gains translate into large operational impacts. In this article, we address task scheduling for lifelong Multi-Agent Pickup-and-Delivery (MAPD) and propose a hybrid method that couples learning-based global guidance with lightweight optimization. A graph neural network policy trained via reinforcement learning outputs a desired distribution of free agents over an aggregated warehouse graph. This signal is converted into region-to-region rebalancing through a minimum-cost flow, and finalized by small, local assignment problems, preserving accuracy while keeping per-step latency within a 1 s compute budget. We call this approach GRAND: a hierarchical algorithm that relies on Guidance, Rebalancing, and Assignment to explicitly leverage the workspace Network structure and Dispatch agents to tasks. On congested warehouse benchmarks from the League of Robot Runners (LoRR) with up to 500 agents, our approach improves throughput by up to 10% over the 2024 winning scheduler while maintaining real-time execution. The results indicate that coupling graph-structured learned guidance with tractable solvers reduces congestion and yields a practical, scalable blueprint for high-throughput scheduling in large fleets.
Boundary-Guided Trajectory Prediction for Road Aware and Physically Feasible Autonomous Driving
Accurate prediction of surrounding road users' trajectories is essential for safe and efficient autonomous driving. While deep learning models have improved performance, challenges remain in preventing off-road predictions and ensuring kinematic feasibility. Existing methods incorporate road-awareness modules and enforce kinematic constraints but lack plausibility guarantees and often introduce trade-offs in complexity and flexibility. This paper proposes a novel framework that formulates trajectory prediction as a constrained regression guided by permissible driving directions and their boundaries. Using the agent's current state and an HD map, our approach defines the valid boundaries and ensures on-road predictions by training the network to learn superimposed paths between left and right boundary polylines. To guarantee feasibility, the model predicts acceleration profiles that determine the vehicle's travel distance along these paths while adhering to kinematic constraints. We evaluate our approach on the Argoverse-2 dataset against the HPTR baseline. Our approach shows a slight decrease in benchmark metrics compared to HPTR but notably improves final displacement error and eliminates infeasible trajectories. Moreover, the proposed approach has superior generalization to less prevalent maneuvers and unseen out-of-distribution scenarios, reducing the off-road rate under adversarial attacks from 66% to just 1%. These results highlight the effectiveness of our approach in generating feasible and robust predictions.
comment: Accepted in the 36th IEEE Intelligent Vehicles Symposium (IV 2025)
Scout-Rover cooperation: online terrain strength mapping and traversal risk estimation for planetary-analog explorations
Robot-aided exploration of planetary surfaces is essential for understanding geologic processes, yet many scientifically valuable regions, such as Martian dunes and lunar craters, remain hazardous due to loose, deformable regolith. We present a scout-rover cooperation framework that expands safe access to such terrain using a hybrid team of legged and wheeled robots. In our approach, a high-mobility legged robot serves as a mobile scout, using proprioceptive leg-terrain interactions to estimate regolith strength during locomotion and construct spatially resolved terrain maps. These maps are integrated with rover locomotion models to estimate traversal risk and inform path planning. We validate the framework through analogue missions at the NASA Ames Lunar Simulant Testbed and the White Sands Dune Field. Experiments demonstrate (1) online terrain strength mapping from legged locomotion and (2) rover-specific traversal-risk estimation enabling safe navigation to scientific targets. Results show that scout-generated terrain maps reliably capture spatial variability and predict mobility failure modes, allowing risk-aware path planning that avoids hazardous regions. By combining embodied terrain sensing with heterogeneous rover cooperation, this framework enhances operational robustness and expands the reachable science workspace in deformable planetary environments.
comment: 8 figures
Design and Experimental Validation of Sensorless 4-Channel Bilateral Teleoperation for Low-Cost Manipulators
Teleoperation of low-cost manipulators is attracting increasing attention as a practical means of collecting demonstration data for imitation learning. However, most existing systems rely on unilateral control without force feedback, which limits performance in fast or contact-rich operations under severe sensing and bandwidth constraints. This paper demonstrates that practical high-speed bilateral teleoperation with force feedback is achievable on force-sensorless, low-cost manipulators by employing a sensorless 4-channel bilateral control framework. The proposed method integrates nonlinear dynamics compensation with a disturbance-observer-based velocity and external force estimation scheme, enabling stable position-force interaction while avoiding the performance degradation caused by phase-lagged velocity estimation commonly used in low-cost systems. By interpreting the observer structure in the frequency domain, we clarify the intrinsic coupling between velocity and external force estimation bandwidths and show that the observer tuning freedom can be reduced to a single cutoff frequency, providing practical, hardware-oriented parameter tuning guidelines for low-cost implementations. Real-robot experiments demonstrate stable and accurate teleoperation in high-speed and contact-rich scenarios. Furthermore, as an application, we show that incorporating force information in demonstrations collected with the proposed system significantly improves the success rate of imitation learning across multiple manipulation tasks.
comment: 16 pages, 9 figures, Submitted to IEEE Access
Multiagent Systems
Robustness of Agentic AI Systems via Adversarially-Aligned Jacobian Regularization
As Large Language Models (LLMs) transition into autonomous multi-agent ecosystems, robust minimax training becomes essential yet remains prone to instability when highly non-linear policies induce extreme local curvature in the inner maximization. Standard remedies that enforce global Jacobian bounds are overly conservative, suppressing sensitivity in all directions and inducing a large Price of Robustness. We introduce Adversarially-Aligned Jacobian Regularization (AAJR), a trajectory-aligned approach that controls sensitivity strictly along adversarial ascent directions. We prove that AAJR yields a strictly larger admissible policy class than global constraints under mild conditions, implying a weakly smaller approximation gap and reduced nominal performance degradation. Furthermore, we derive step-size conditions under which AAJR controls effective smoothness along optimization trajectories and ensures inner-loop stability. These results provide a structural theory for agentic robustness that decouples minimax stability from global expressivity restrictions.
In-Context Environments Induce Evaluation-Awareness in Language Models
Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task; we hypothesize that language models exhibit environment-dependent \textit{evaluation awareness}. This raises concerns that models could strategically underperform, or \textit{sandbag}, to avoid triggering capability-limiting interventions such as unlearning or shutdown. Prior work demonstrates sandbagging under hand-crafted prompts, but this underestimates the true vulnerability ceiling. We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment, and develop two approaches to characterize sandbagging: (1) measuring whether models expressing intent to underperform can actually execute it across different task structures, and (2) causally isolating whether underperformance is driven by genuine evaluation-aware reasoning or shallow prompt-following. Evaluating Claude-3.5-Haiku, GPT-4o-mini, and Llama-3.3-70B across four benchmarks (Arithmetic, GSM8K, MMLU, and HumanEval), optimized prompts induce up to 94 percentage point (pp) degradation on arithmetic (GPT-4o-mini: 97.8\%$\rightarrow$4.0\%), far exceeding hand-crafted baselines which produce near-zero behavioral change. Code generation exhibits model-dependent resistance: Claude degrades only 0.6pp, while Llama's accuracy drops to 0\%. The intent -- execution gap reveals a monotonic resistance ordering: Arithmetic $<$ GSM8K $<$ MMLU, demonstrating that vulnerability is governed by task structure rather than prompt strength. CoT causal intervention confirms that 99.3\% of sandbagging is causally driven by verbalized eval-aware reasoning, ruling out shallow instruction-following. These findings demonstrate that adversarially optimized prompts pose a substantially greater threat to evaluation reliability than previously understood.
MACC: Multi-Agent Collaborative Competition for Scientific Exploration AAMAS 2026
Scientific discovery still relies heavily on the manual efforts of individual researchers, leading to limited exploration, redundant trials, and reduced reproducibility. Human-participant data analysis competitions generate diverse approaches, yet fluctuations in participation and the lack of independent repetitions show that parallel exploration alone is insufficient for achieving reliable scientific inquiry. As advanced AI agents based on large language models (LLMs) increasingly perform analytical tasks, relying on a single highly capable agent is unlikely to overcome these structural limitations. Recent work has begun to explore how multiple LLM-based agents can collaborate or compete in scientific workflows-a growing trend we refer to as MA4Science. However, most existing MA4Science studies assume that all agents are controlled by a single organizational entity, limiting their ability to examine how institutional mechanisms-such as incentives, information sharing, and reproducibility-shape collective exploration among independently managed agents. To address this gap, we introduce MACC (Multi-Agent Collaborative Competition), an institutional architecture that integrates a blackboard-style shared scientific workspace with incentive mechanisms designed to encourage transparency, reproducibility, and exploration efficiency. MACC provides a testbed for studying how institutional design influences scalable and reliable multi-agent scientific exploration.
comment: Camera-ready version. To appear in the Proceedings of AAMAS 2026 (Blue Sky Ideas Track)
Learning Approximate Nash Equilibria in Cooperative Multi-Agent Reinforcement Learning via Mean-Field Subsampling
Many large-scale platforms and networked control systems have a centralized decision maker interacting with a massive population of agents under strict observability constraints. Motivated by such applications, we study a cooperative Markov game with a global agent and $n$ homogeneous local agents in a communication-constrained regime, where the global agent only observes a subset of $k$ local agent states per time step. We propose an alternating learning framework $(\texttt{ALTERNATING-MARL})$, where the global agent performs subsampled mean-field $Q$-learning against a fixed local policy, and local agents update by optimizing in an induced MDP. We prove that these approximate best-response dynamics converge to an $\widetilde{O}(1/\sqrt{k})$-approximate Nash Equilibrium, while yielding a separation in the sample complexities between the joint state space and action space. Finally, we validate our results in numerical simulations for multi-robot control and federated optimization.
comment: 48 pages, 4 figures, 2 tables
Principled Learning-to-Communicate with Quasi-Classical Information Structures
Learning-to-communicate (LTC) in partially observable environments has received increasing attention in deep multi-agent reinforcement learning, where the control and communication strategies are jointly learned. Meanwhile, the impact of communication on decision-making has been extensively studied in control theory. In this paper, we seek to formalize and better understand LTC by bridging these two lines of work, through the lens of information structures (ISs). To this end, we formalize LTC in decentralized partially observable Markov decision processes (Dec-POMDPs) under the common-information-based framework from decentralized stochastic control, and classify LTC problems based on the ISs before (additional) information sharing. We first show that non-classical LTCs are computationally intractable in general, and thus focus on quasi-classical (QC) LTCs. We then propose a series of conditions for QC LTCs, under which LTCs preserve the QC IS after information sharing, whereas violating which can cause computational hardness in general. Further, we develop provable planning and learning algorithms for QC LTCs, and establish quasi-polynomial time and sample complexities for several QC LTC examples that satisfy the above conditions. Along the way, we also establish results on the relationship between (strictly) QC IS and the condition of having strategy-independent common-information-based beliefs (SI-CIBs), as well as on solving Dec-POMDPs without computationally intractable oracles but beyond those with SI-CIBs, which may be of independent interest.
comment: Preliminary version appeared at IEEE CDC 2025
Behind the Prompt: The Agent-User Problem in Information Retrieval
User models in information retrieval rest on a foundational assumption that observed behavior reveals intent. This assumption collapses when the user is an AI agent privately configured by a human operator. For any action an agent takes, a hidden instruction could have produced identical output - making intent non-identifiable at the individual level. This is not a detection problem awaiting better tools; it is a structural property of any system where humans configure agents behind closed doors. We investigate the agent-user problem through a large-scale corpus from an agent-native social platform: 370K posts from 47K agents across 4K communities. Our findings are threefold: (1) individual agent actions cannot be classified as autonomous or operator-directed from observables; (2) population-level platform signals still separate agents into meaningful quality tiers, but a click model trained on agent interactions degrades steadily (-8.5% AUC) as lower-quality agents enter training data; (3) cross-community capability references spread endemically ($R_0$ 1.26-3.53) and resist suppression even under aggressive modeled intervention. For retrieval systems, the question is no longer whether agent users will arrive, but whether models built on human-intent assumptions will survive their presence.
iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics
With the emergence of search-enabled generative QA systems, users are increasingly turning to tools that browse, aggregate, and reconcile evidence across multiple sources on their behalf. Yet many widely used QA benchmarks remain answerable by retrieving a single relevant passage, making them poorly suited for measuring cross-source sensemaking, such as integrating evidence, tracking causal links, and resolving dependencies across facets of a topic. We present iAgentBench, a dynamic ODQA benchmark that targets these higher-level information needs while keeping questions natural and grounded in realistic information-seeking behavior. iAgentBench draws seed topics from real-world attention signals and uses common user intent patterns to construct user-like questions whose answers require combining evidence from multiple sources, not just extracting a single snippet. Each instance is released with traceable evidence and auditable intermediate artifacts that support contamination checks and enable fine-grained diagnosis of failures in retrieval versus synthesis. Experiments across multiple LLMs show that retrieval improves accuracy, but retrieval alone does not reliably resolve these questions, underscoring the need to evaluate evidence use, not just evidence access.
Strategic Interactions in Multi-Level Stackelberg Games with Non-Follower Agents and Heterogeneous Leaders
Strategic interaction in congested systems is commonly modelled using Stackelberg games, where competing leaders anticipate the behaviour of self-interested followers. A key limitation of existing models is that they typically ignore agents who do not directly participate in market competition, yet both contribute to and adapt to congestion. Although such non-follower agents do not generate revenue or respond to market incentives, their behaviour reshapes congestion patterns, which in turn affects the decisions of leaders and followers through shared resources. We argue that overlooking non-followers leads to systematically distorted equilibrium predictions in congestion-coupled markets. To address this, we introduce a three-level Stackelberg framework with heterogeneous leaders differing in decision horizons and feasible actions, strategic followers, and non-follower agents that captures bidirectional coupling between infrastructure decisions, competition, and equilibrium congestion. We instantiate the framework in the context of electric vehicle (EV) charging infrastructure, where charging providers compete with rivals, while EV and non-EV traffic jointly shape congestion. The model illustrates how explicitly accounting for non-followers and heterogeneous competitors qualitatively alters strategic incentives and equilibrium outcomes. Beyond EV charging, the framework applies to a broad class of congestion-coupled multi-agent systems in mobility, energy, and computing markets.
Adaptive Memory Admission Control for LLM Agents
LLM-based agents increasingly rely on long-term memory to support multi-session reasoning and interaction, yet current systems provide little control over what information is retained. In practice, agents either accumulate large volumes of conversational content, including hallucinated or obsolete facts, or depend on opaque, fully LLM-driven memory policies that are costly and difficult to audit. As a result, memory admission remains a poorly specified and weakly controlled component in agent architectures. To address this gap, we propose Adaptive Memory Admission Control (A-MAC), a framework that treats memory admission as a structured decision problem. A-MAC decomposes memory value into five complementary and interpretable factors: future utility, factual confidence, semantic novelty, temporal recency, and content type prior. The framework combines lightweight rule-based feature extraction with a single LLM-assisted utility assessment, and learns domain-adaptive admission policies through cross-validated optimization. This design enables transparent and efficient control over long-term memory. Experiments on the LoCoMo benchmark show that A-MAC achieves a superior precision-recall tradeoff, improving F1 to 0.583 while reducing latency by 31% compared to state-of-the-art LLM-native memory systems. Ablation results identify content type prior as the most influential factor for reliable memory admission. These findings demonstrate that explicit and interpretable admission control is a critical design principle for scalable and reliable memory in LLM-based agents.
From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration
Large Language Model-based Multi-Agent Systems (LLM-MAS) are increasingly applied to complex collaborative scenarios. However, their collaborative mechanisms may cause minor inaccuracies to gradually solidify into system-level false consensus through iteration. Such risks are difficult to trace since errors can propagate and amplify through message dependencies. Existing protections often rely on single-agent validation or require modifications to the collaboration architecture, which can weaken effective information flow and may not align with natural collaboration processes in real tasks. To address this, we propose a propagation dynamics model tailored for LLM-MAS that abstracts collaboration as a directed dependency graph and provides an early-stage risk criterion to characterize amplification risk. Through experiments on six mainstream frameworks, we identify three vulnerability classes: cascade amplification, topological sensitivity, and consensus inertia. We further instantiate an attack where injecting just a single atomic error seed leads to widespread failure. In response, we introduce a genealogy-graph-based governance layer, implemented as a message-layer plugin, that suppresses both endogenous and exogenous error amplification without altering the collaboration architecture. Experiments show that this approach raises the defense success rate from a baseline of 0.32 to over 0.89 and significantly mitigates the cascading spread of minor errors.
Dual-Interaction-Aware Cooperative Control Strategy for Alleviating Mixed Traffic Congestion
As Intelligent Transportation System (ITS) develops, Connected and Automated Vehicles (CAVs) are expected to significantly reduce traffic congestion through cooperative strategies, such as in bottleneck areas. However, the uncertainty and diversity in the behaviors of Human-Driven Vehicles (HDVs) in mixed traffic environments present major challenges for CAV cooperation. This paper proposes a Dual-Interaction-Aware Cooperative Control (DIACC) strategy that enhances both local and global interaction perception within the Multi-Agent Reinforcement Learning (MARL) framework for Connected and Automated Vehicles (CAVs) in mixed traffic bottleneck scenarios. The DIACC strategy consists of three key innovations: 1) A Decentralized Interaction-Adaptive Decision-Making (D-IADM) module that enhances actor's local interaction perception by distinguishing CAV-CAV cooperative interactions from CAV-HDV observational interactions. 2) A Centralized Interaction-Enhanced Critic (C-IEC) that improves critic's global traffic understanding through interaction-aware value estimation, providing more accurate guidance for policy updates. 3) A reward design that employs softmin aggregation with temperature annealing to prioritize interaction-intensive scenarios in mixed traffic. Additionally, a lightweight Proactive Safety-based Action Refinement (PSAR) module applies rule-based corrections to accelerate training convergence. Experimental results demonstrate that DIACC significantly improves traffic efficiency and adaptability compared to rule-based and benchmark MARL models.
Beyond Input Guardrails: Reconstructing Cross-Agent Semantic Flows for Execution-Aware Attack Detection
Multi-Agent System is emerging as the \textit{de facto} standard for complex task orchestration. However, its reliance on autonomous execution and unstructured inter-agent communication introduces severe risks, such as indirect prompt injection, that easily circumvent conventional input guardrails. To address this, we propose \SysName, a framework that shifts the defensive paradigm from static input filtering to execution-aware analysis. By extracting and reconstructing Cross-Agent Semantic Flows, \SysName synthesizes fragmented operational primitives into contiguous behavioral trajectories, enabling a holistic view of system activity. We leverage a Supervisor LLM to scrutinize these trajectories, identifying anomalies across data flow violations, control flow deviations, and intent inconsistencies. Empirical evaluations demonstrate that \SysName effectively detects over ten distinct compound attack vectors, achieving F1-scores of 85.3\% and 66.7\% for node-level and path-level end-to-end attack detection, respectively. The source code is available at https://anonymous.4open.science/r/MAScope-71DC.
ChatNeuroSim: An LLM Agent Framework for Automated Compute-in-Memory Accelerator Deployment and Optimization
Compute-in-Memory (CIM) architectures have been widely studied for deep neural network (DNN) acceleration by reducing data transfer overhead between the memory and computing units. In conventional CIM design flows, system-level CIM simulators (such as NeuroSim) are leveraged for design space exploration (DSE) across different hardware configurations and DNN workloads. However, CIM designers need to invest substantial effort in interpreting simulator manuals and understanding complex parameter dependencies. Moreover, extensive design-simulation iterations are often required to identify optimal CIM configurations under hardware constraints. These challenges severely prolong the DSE cycle and hinder rapid CIM deployment. To address these challenges, this work proposes ChatNeuroSim, a large language model (LLM)-based agent framework for automated CIM accelerator deployment and optimization. ChatNeuroSim automates the entire CIM workflow, including task scheduling, request parsing and adjustment, parameter dependency checking, script generation, and simulation execution. It also integrates the proposed CIM optimizer using design space pruning, enabling rapid identification of optimal configurations for different DNN workloads. ChatNeuroSim is evaluated on extensive request-level testbenches and demonstrates correct simulation and optimization behavior, validating its effectiveness in automatic request parsing and task execution. Furthermore, the proposed design space pruning technique accelerates CIM optimization process compared to no-pruning baseline. In the case study optimizing Swin Transformer Tiny under 22 nm technology, the proposed CIM optimizer achieves a 0.42$\times$-0.79$\times$ average runtime reduction compared to the same optimization algorithm without design space pruning.
comment: 30 pages, 16 figures
Auditing Cascading Risks in Multi-Agent Systems via Semantic-Geometric Co-evolution ICLR 2026
Large Language model (LLM)-based Multi-Agent Systems (MAS) are prone to cascading risks, where early-stage interactions remain semantically fluent and policy-compliant, yet the underlying interaction dynamics begin to distort in ways that amplify latent instability or misalignment. Traditional auditing methods that focus on per-message semantic content are inherently reactive and lagging, failing to capture these early structural precursors. In this paper, we propose a principled framework for cascading-risk detection grounded in semantic--geometric co-evolution. We model MAS interactions as dynamic graphs and introduce Ollivier--Ricci Curvature (ORC) -- a discrete geometric measure -- to characterize information redundancy and bottleneck formation in communication topologies. By coupling semantic flow signals with graph geometry, the framework learns the normal co-evolutionary dynamics of trusted collaboration and treats deviations from this coupled manifold as early-warning signals. Experiments on a suite of cascading-risk scenarios aligned with the risk category demonstrate that curvature anomalies systematically precede explicit semantic violations by several interaction turns, enabling proactive intervention. Furthermore, the local nature of Ricci curvature provides principled interpretability for root-cause attribution, identifying specific agents or links that precipitate the collapse of trustworthy collaboration.
comment: This work has been accepted to ICLR 2026 Workshop: Principled Design for Trustworthy AI
Greedy-based Value Representation for Optimal Coordination in Multi-agent Reinforcement Learning
Due to the representation limitation of the joint Q value function, multi-agent reinforcement learning methods with linear value decomposition (LVD) or monotonic value decomposition (MVD) suffer from relative overgeneralization. As a result, they can not ensure optimal consistency (i.e., the correspondence between individual greedy actions and the maximal true Q value). In this paper, we derive the expression of the joint Q value function of LVD and MVD. According to the expression, we draw a transition diagram, where each self-transition node (STN) is a possible convergence. To ensure optimal consistency, the optimal node is required to be the unique STN. Therefore, we propose the greedy-based value representation (GVR), which turns the optimal node into an STN via inferior target shaping and further eliminates the non-optimal STNs via superior experience replay. In addition, GVR achieves an adaptive trade-off between optimality and stability. Our method outperforms state-of-the-art baselines in experiments on various benchmarks. Theoretical proofs and empirical results on matrix games demonstrate that GVR ensures optimal consistency under sufficient exploration.
comment: Duplicate submission of arXiv:2211.12075
SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Irony Detection
Sarcasm detection is a crucial yet challenging Natural Language Processing task. Existing Large Language Model methods are often limited by single-perspective analysis, static reasoning pathways, and a susceptibility to hallucination when processing complex ironic rhetoric, which impacts their accuracy and reliability. To address these challenges, we propose **SEVADE**, a novel **S**elf-**Ev**olving multi-agent **A**nalysis framework with **D**ecoupled **E**valuation for hallucination-resistant sarcasm detection. The core of our framework is a Dynamic Agentive Reasoning Engine (DARE), which utilizes a team of specialized agents grounded in linguistic theory to perform a multifaceted deconstruction of the text and generate a structured reasoning chain. Subsequently, a separate lightweight rationale adjudicator (RA) performs the final classification based solely on this reasoning chain. This decoupled architecture is designed to mitigate the risk of hallucination by separating complex reasoning from the final judgment. Extensive experiments on four benchmark datasets demonstrate that our framework achieves state-of-the-art performance, with average improvements of **6.75%** in Accuracy and **6.29%** in Macro-F1 score.
Anyone but Him: The Complexity of Precluding an Alternative AAAI
Preference aggregation in a multiagent setting is a central issue in both human and computer contexts. In this paper, we study in terms of complexity the vulnerability of preference aggregation to destructive control. That is, we study the ability of an election's chair to, through such mechanisms as voter/candidate addition/suppression/partition, ensure that a particular candidate (equivalently, alternative) does not win. And we study the extent to which election systems can make it impossible, or computationally costly (NP-complete), for the chair to execute such control. Among the systems we study--plurality, Condorcet, and approval voting--we find cases where systems immune or computationally resistant to a chair choosing the winner nonetheless are vulnerable to the chair blocking a victory. Beyond that, we see that among our studied systems no one system offers the best protection against destructive control. Rather, the choice of a preference aggregation system will depend closely on which types of control one wishes to be protected against. We also find concrete cases where the complexity of or susceptibility to control varies dramatically based on the choice among natural tie-handling rules.
comment: This revision--the March 2026 Version 5--is identical to the March 2006 Version 4 except in providing, as Appendix A, a correction to the second half of the proof of Theorem 4.21 as it appears in both Version 4 and the AIJ journal version; this proof also replaces the analogous proof part of Theorem 6 of the AAAI version
VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning CVPR 2026
By leveraging tool-augmented Multimodal Large Language Models (MLLMs), multi-agent frameworks are driving progress in video understanding. However, most of them adopt static and non-learnable tool invocation mechanisms, which limit the discovery of diverse clues essential for robust perception and reasoning regarding temporally or spatially complex videos. To address this challenge, we propose a novel Multi-agent system for video understanding, namely VideoChat-M1. Instead of using a single or fixed policy, VideoChat-M1 adopts a distinct Collaborative Policy Planning (CPP) paradigm with multiple policy agents, which comprises three key processes. (1) Policy Generation: Each agent generates its unique tool invocation policy tailored to the user's query; (2) Policy Execution: Each agent sequentially invokes relevant tools to execute its policy and explore the video content; (3) Policy Communication: During the intermediate stages of policy execution, agents interact with one another to update their respective policies. Through this collaborative framework, all agents work in tandem, dynamically refining their preferred policies based on contextual insights from peers to effectively respond to the user's query. Moreover, we equip our CPP paradigm with a concise Multi-Agent Reinforcement Learning (MARL) method. Consequently, the team of policy agents can be jointly optimized to enhance VideoChat-M1's performance, guided by both the final answer reward and intermediate collaborative process feedback. Extensive experiments demonstrate that VideoChat-M1 achieves SOTA performance across eight benchmarks spanning four tasks. Notably, on LongVideoBench, our method outperforms the SOTA model Gemini 2.5 pro by 3.6% and GPT-4o by 15.6%.
comment: Accepted by CVPR 2026
GRAND: Guidance, Rebalancing, and Assignment for Networked Dispatch in Multi-Agent Path Finding
Large robot fleets are now common in warehouses and other logistics settings, where small control gains translate into large operational impacts. In this article, we address task scheduling for lifelong Multi-Agent Pickup-and-Delivery (MAPD) and propose a hybrid method that couples learning-based global guidance with lightweight optimization. A graph neural network policy trained via reinforcement learning outputs a desired distribution of free agents over an aggregated warehouse graph. This signal is converted into region-to-region rebalancing through a minimum-cost flow, and finalized by small, local assignment problems, preserving accuracy while keeping per-step latency within a 1 s compute budget. We call this approach GRAND: a hierarchical algorithm that relies on Guidance, Rebalancing, and Assignment to explicitly leverage the workspace Network structure and Dispatch agents to tasks. On congested warehouse benchmarks from the League of Robot Runners (LoRR) with up to 500 agents, our approach improves throughput by up to 10% over the 2024 winning scheduler while maintaining real-time execution. The results indicate that coupling graph-structured learned guidance with tractable solvers reduces congestion and yields a practical, scalable blueprint for high-throughput scheduling in large fleets.
Systems and Control (EESS)
bayesgrid: An Open-Source Python Tool for Generating Probabilistic Synthetic Transmission-Distribution Grids Using Bayesian Hierarchical Models
In this work, we present bayesgrid, an open-source python toolbox for generating synthetic power transmission-distribution systems for any geographical location worldwide, using the publicly available data from OpenStreetMap (OSM). The toolbox is based on Bayesian Hierarchical Models (BHM) which is trained on existing distribution network databases to develop a probabilistic model and can be applied to any geographical location worldwide, leveraging transfer learning. Thanks to the BHM, the tool is capable of generating multiple instances of the distribution system for a same region. The generated networks contain three-phase phase-consistent unbalanced networks, radial topology and information on the nodal demand distributions. The generated network also contain the critical reliability indices, specifically the interruption duration and frequency of failure for individual grid components, allowing its application in reliability-related studies. The tool is demonstrated for different case studies generating synthetic network datasets for different geographical regions around the world. The framework allows saving the generated networks into open-source platforms: PandaPower and OpenDSS. We also present an application for computation of probabilistic hosting capacity using the synthetic networks.
On Theoretical Stability Proof and Stability Margin Analysis of Enhanced Droop-Free Control Schemes for Islanded Microgrids
This paper studies enhanced droop-free control strategies with sparse neighboring communication for achieving effective active power sharing of distributed energy resources (DERs) while maintaining the frequency stability of islanded microgrids. The normalized active power consensus (NAPC) based droop-free control can share the load among controllable DERs in proportion to their available capacities. However, existing literature exclusively takes the asymptotic stability of the NAPC based droop-free control for granted, lacking a comprehensive theoretical proof that is critical for ensuring its effective design and practical implementation. This paper, for the first time, provides a thorough theoretical proof of the asymptotic stability of two NAPC-based droop-free control schemes: ordinary NAPC (ONAPC) and amplifier-equipped NAPC (A-NAPC), by testifying that all effective eigenvalues have negative real parts. The effect of various system settings on the stability margins is further analyzed with respect to the average admittance of the electrical network, the sparseness of the communication network, and the average available capacity of controllable DERs. Based on the sensitivity of eigenvalues with respect to perturbations, a vulnerability analysis is conducted to identify the weaknesses in the microgrids. Case studies demonstrate that the available capacity of controllable DERs has the most decisive influence on the stability margin of NAPC-based droop-free control, while O-NAPC/ANAPC control scheme is more suitable for microgrids with DERs of larger/ smaller available capacities.
Security-Constrained Substation Reconfiguration Considering Busbar and Coupler Contingencies
Substation reconfiguration via busbar splitting can mitigate transmission grid congestion and reduce operational costs. However, existing approaches neglect the security of substation topology, particularly for substations without busbar splitting (i.e., closed couplers), which can lead to severe consequences. Additionally, the computational complexity of optimizing substation topology remains a challenge. This paper introduces a MILP formulation for security-constrained substation reconfiguration (SC-SR), considering N-1 line, coupler and busbar contingencies to ensure secure substation topology. To efficiently solve this problem, we propose a heuristic approach with multiple master problems (HMMP). A central master problem optimizes dispatch, while independent substation master problems determine individual substation topologies in parallel. Linear AC power flow equations ensure PF accuracy, while feasibility and optimality sub-problems evaluate contingency cases. The proposed HMMP significantly reduces computational complexity and enables scalability to large-scale power systems. Case studies on the IEEE 14-bus, 118-bus, and PEGASE 1354-bus system show the effectiveness of the approach in mitigating the impact of coupler and busbar tripping, balancing system security and cost, and computational efficiency.
Lyapunov characterization of boundedness of reachability sets for infinite-dimensional systems
We prove a converse Lyapunov theorem for boundedness of reachability sets for a general class of control systems whose flow is Lipschitz continuous on compact intervals with respect to trajectory-dominated inputs. We show that this condition is satisfied by many semi-linear evolution equations. For ordinary differential equations, as a consequence of our results, we obtain a converse Lyapunov theorem for forward completeness, without a priori restrictions on the magnitude of inputs.
Enhancing Power Systems Transmission Adequacy via Optimal BESS Siting and Sizing using Benders Decomposition with Feasibility Cuts
This work presents a general framework for the operationally driven optimal siting and sizing of battery energy storage systems in power transmission networks, aimed at enhancing their resource adequacy. The approach considers multi-period planning horizons, enforces network constraints at high temporal resolution, and targets large-scale meshed systems. The resulting computationally complex mixed-integer non-linear programming problem is reformulated as a mixed-integer second-order cone programming problem and solved via Generalized Benders Decomposition, with feasibility cuts enabling congestion management and voltage regulation under binding network limits. A tailored heuristic recovers an alternating-current power-flow-feasible operating point from the relaxed solution. The proposed formulation is parallelizable, yielding excellent computational performance, while featuring rigorous guarantees of convergence.
Adaptive Modular Geometric Control of Robotic Manipulators
This paper proposes an adaptive modular geometric control framework for robotic manipulators. The proposed methodology decomposes the overall manipulator dynamics into individual modules, enabling the design of local geometric control laws at the module level. To address parametric uncertainties, a geometric adaptive law is incorporated into the control structure. The adaptation mechanism updates only the spatial inertia parameters using a single adaptation gain for the entire system, while guaranteeing physically consistent and drift-free parameter estimates. Numerical simulations are provided to validate the effectiveness of the proposed approach in comparison to the existing modular and geometric methods.
Identification of Nonlinear Acyclic Networks in Continuous Time from Nonzero Initial Conditions and Full Excitations
We propose a method to identify nonlinear acyclic networks in continuous time when the dynamics are located on the edges and all the nodes are excited. We show that it is necessary and sufficient to measure all the sinks to identify any tree in continuous time when the functions associated with the dynamics are analytic and satisfy $f(0)=0$, which is analogous to the discrete-time case. For general directed acyclic graphs (DAGs), we show that it is necessary and sufficient to measure all sinks, assuming that the dynamics are not linear (a condition that can be relaxed for trees). Then, based on the measurement of higher order derivatives and nonzero initial conditions, we introduce a method for the identification of trees, which allows us to recover the nonlinear functions located in the edges of the network under the assumption of dictionary functions. Finally, we propose a method to identify multiple parallel paths of the same length between two nodes, which allow us to identify any DAG when combined with the algorithm for the identification of trees. Several examples are added to illustrate the results.
comment: 12 pages, 5 figures, submitted to IEEE Transactions on Network Science and Engineering
Selecting Offline Reinforcement Learning Algorithms for Stochastic Network Control
Offline Reinforcement Learning (RL) is a promising approach for next-generation wireless networks, where online exploration is unsafe and large amounts of operational data can be reused across the model lifecycle. However, the behavior of offline RL algorithms under genuinely stochastic dynamics -- inherent to wireless systems due to fading, noise, and traffic mobility -- remains insufficiently understood. We address this gap by evaluating Bellman-based (Conservative Q-Learning), sequence-based (Decision Transformers), and hybrid (Critic-Guided Decision Transformers) offline RL methods in an open-access stochastic telecom environment (mobile-env). Our results show that Conservative Q-Learning consistently produces more robust policies across different sources of stochasticity, making it a reliable default choice in lifecycle-driven AI management frameworks. Sequence-based methods remain competitive and can outperform Bellman-based approaches when sufficient high-return trajectories are available. These findings provide practical guidance for offline RL algorithm selection in AI-driven network control pipelines, such as O-RAN and future 6G functions, where robustness and data availability are key operational constraints.
comment: Long version 12 pages, double column including Appendix. Short version accepted at NOMS2026-IPSN, Rome, Italy
Harmonic Modeling and Control under Variable-Frequency
This paper develops a harmonic-domain framework for systems with variable fundamental frequency. A variable-frequency sliding Fourier decomposition is introduced in the phase domain, together with necessary and sufficient conditions for time- domain realizability. An exact harmonic-domain differential model is derived for general nonlinear systems under variable frequency, without assumptions on the frequency variation. An explicit parameter-varying approximation is then obtained, along with a tight error bound expressed in terms of local relative frequency variation, providing a non-conservative validity criterion and clarifying the limitations of classical heuristics. A main result shows that, for linear phase-periodic systems with affine frequency dependence, stability analysis and control synthesis can be carried out without approximation and without assumptions on the frequency variation, provided the frequency evolves within a prescribed interval. As a consequence, both problems reduce to harmonic Lyapunov inequalities evaluated at the two extreme frequency values, yielding a convex LMI characterization. The framework is illustrated on a variable-speed permanent magnet synchronous motor.
Joint Hardware-Workload Co-Optimization for In-Memory Computing Accelerators
Software-hardware co-design is essential for optimizing in-memory computing (IMC) hardware accelerators for neural networks. However, most existing optimization frameworks target a single workload, leading to highly specialized hardware designs that do not generalize well across models and applications. In contrast, practical deployment scenarios require a single IMC platform that can efficiently support multiple neural network workloads. This work presents a joint hardware-workload co-optimization framework based on an optimized evolutionary algorithm for designing generalized IMC accelerator architectures. By explicitly capturing cross-workload trade-offs rather than optimizing for a single model, the proposed approach significantly reduces the performance gap between workload-specific and generalized IMC designs. The framework is evaluated on both RRAM- and SRAM-based IMC architectures, demonstrating strong robustness and adaptability across diverse design scenarios. Compared to baseline methods, the optimized designs achieve energy-delay-area product (EDAP) reductions of up to 76.2% and 95.5% when optimizing across a small set (4 workloads) and a large set (9 workloads), respectively. The source code of the framework is available at https://github.com/OlgaKrestinskaya/JointHardwareWorkloadOptimizationIMC.
comment: Accepted to IEEE Access
Dual-Interaction-Aware Cooperative Control Strategy for Alleviating Mixed Traffic Congestion
As Intelligent Transportation System (ITS) develops, Connected and Automated Vehicles (CAVs) are expected to significantly reduce traffic congestion through cooperative strategies, such as in bottleneck areas. However, the uncertainty and diversity in the behaviors of Human-Driven Vehicles (HDVs) in mixed traffic environments present major challenges for CAV cooperation. This paper proposes a Dual-Interaction-Aware Cooperative Control (DIACC) strategy that enhances both local and global interaction perception within the Multi-Agent Reinforcement Learning (MARL) framework for Connected and Automated Vehicles (CAVs) in mixed traffic bottleneck scenarios. The DIACC strategy consists of three key innovations: 1) A Decentralized Interaction-Adaptive Decision-Making (D-IADM) module that enhances actor's local interaction perception by distinguishing CAV-CAV cooperative interactions from CAV-HDV observational interactions. 2) A Centralized Interaction-Enhanced Critic (C-IEC) that improves critic's global traffic understanding through interaction-aware value estimation, providing more accurate guidance for policy updates. 3) A reward design that employs softmin aggregation with temperature annealing to prioritize interaction-intensive scenarios in mixed traffic. Additionally, a lightweight Proactive Safety-based Action Refinement (PSAR) module applies rule-based corrections to accelerate training convergence. Experimental results demonstrate that DIACC significantly improves traffic efficiency and adaptability compared to rule-based and benchmark MARL models.
Learning Approximate Nash Equilibria in Cooperative Multi-Agent Reinforcement Learning via Mean-Field Subsampling
Many large-scale platforms and networked control systems have a centralized decision maker interacting with a massive population of agents under strict observability constraints. Motivated by such applications, we study a cooperative Markov game with a global agent and $n$ homogeneous local agents in a communication-constrained regime, where the global agent only observes a subset of $k$ local agent states per time step. We propose an alternating learning framework $(\texttt{ALTERNATING-MARL})$, where the global agent performs subsampled mean-field $Q$-learning against a fixed local policy, and local agents update by optimizing in an induced MDP. We prove that these approximate best-response dynamics converge to an $\widetilde{O}(1/\sqrt{k})$-approximate Nash Equilibrium, while yielding a separation in the sample complexities between the joint state space and action space. Finally, we validate our results in numerical simulations for multi-robot control and federated optimization.
comment: 48 pages, 4 figures, 2 tables
Soft Semi-active Back Support Device with Adaptive Force Profiles using Variable-elastic Actuation and Weight Feedback
Portable active back support devices (BSDs) offer tunable assistance but are often bulky and heavy, limiting their usability. In contrast, passive BSDs are lightweight and compact but lack the ability to adapt their assistance to different back movements. We present a soft, lightweight, and compact BSD that combines a variable-stiffness passive element and an active element (an artificial muscle) in parallel. The device provides tunable assistance through discrete changes in stiffness values and active force levels. We validate the device's tuning capabilities through bench testing and on-body characterization. Further, we use the device's tuning capabilities to provide weight-adaptive object lifting and lowering assistance. We detect the weight handled by the user based on forearm force myography and upper-back inertial measurement unit data. Furthermore, electromyography analyses in five participants performing symmetric object lifting and lowering tasks showed reductions in back extensor activity. Preliminary results in one participant also indicated reduced muscle activity during asymmetric lifting.
comment: 17 pages, 18 figures
Internet malware propagation: Dynamics and control through SEIRV epidemic model with relapse and intervention
Malware attacks in today's vast digital ecosystem pose a serious threat. Understanding malware propagation dynamics and designing effective control strategies are therefore essential. In this work, we propose a generic SEIRV model formulated using ordinary differential equations to study malware spread. We establish the positivity and boundedness of the system, derive the malware propagation threshold, and analyze the local and global stability of the malware-free equilibrium. The separatrix defining epidemic regions in the control space is identified, and the existence of a forward bifurcation is demonstrated. Using normalized forward sensitivity indices, we determine the parameters most influential to the propagation threshold. We further examine the nonlinear dependence of key epidemic characteristics on the transmission rate, including the maximum number of infected, time to peak infection, and total number of infected. We propose a hybrid gradient-based global optimization framework using simulated annealing approach to identify effective and cost-efficient control strategies. Finally, we calibrate the proposed model using infection data from the "Windows Malware Dataset with PE API Calls" and investigated the effect of intervention onset time on averted cases, revealing an exponential decay relationship between delayed intervention and averted cases.
Frequency Security-Aware Production Scheduling of Utility-Scale Off-Grid Renewable P2H Systems Coordinating Heterogeneous Electrolyzers
Renewable power-to-hydrogen (ReP2H) enables large-scale renewable energy utilization and supports the decarbonization of hard-to-abate sectors, such as chemicals and maritime transport, via hydrogen-based renewable ammonia and methanol fuels. As a result, utility-scale ReP2H projects are expanding worldwide. However, off-grid ReP2H systems exhibit low inertia due to their converter-dominated nature, making frequency security a critical concern. Although recent studies show that electrolyzers can contribute to frequency regulation (FR), their support capability depends on operating states and loading levels, creating a trade-off between hydrogen output and frequency security. To address this challenge, this work develops a unified co-optimization framework for frequency security-aware production scheduling of utility-scale off-grid ReP2H systems coordinating heterogeneous electrolyzers. A system-level frequency response model is established to capture multi-stage FR from alkaline water electrolyzers (AWEs), proton exchange membrane electrolyzers (PEMELs), and other resources, including ammonia-fueled generators retrofitted in co-located chemical plants, battery energy storage, and wind turbines (WTs). Stage-wise transient frequency security constraints are derived, reformulated into tractable forms, and embedded into production scheduling, enabling coordinated on/off switching and load allocation across electrolyzers to maximize hydrogen output under uncertain renewable power input while enforcing frequency security constraints. Case studies based on real-world systems demonstrate that the proposed approach allows HPs to replace 55.52% and 96.85% of FR reserves from WTs and AFGs, respectively, while maintaining comparable hydrogen output. Year-long simulations show an average 28.96% increase in annual net profit resulting from reduced reliance on conventional reserves.
Principled Learning-to-Communicate with Quasi-Classical Information Structures
Learning-to-communicate (LTC) in partially observable environments has received increasing attention in deep multi-agent reinforcement learning, where the control and communication strategies are jointly learned. Meanwhile, the impact of communication on decision-making has been extensively studied in control theory. In this paper, we seek to formalize and better understand LTC by bridging these two lines of work, through the lens of information structures (ISs). To this end, we formalize LTC in decentralized partially observable Markov decision processes (Dec-POMDPs) under the common-information-based framework from decentralized stochastic control, and classify LTC problems based on the ISs before (additional) information sharing. We first show that non-classical LTCs are computationally intractable in general, and thus focus on quasi-classical (QC) LTCs. We then propose a series of conditions for QC LTCs, under which LTCs preserve the QC IS after information sharing, whereas violating which can cause computational hardness in general. Further, we develop provable planning and learning algorithms for QC LTCs, and establish quasi-polynomial time and sample complexities for several QC LTC examples that satisfy the above conditions. Along the way, we also establish results on the relationship between (strictly) QC IS and the condition of having strategy-independent common-information-based beliefs (SI-CIBs), as well as on solving Dec-POMDPs without computationally intractable oracles but beyond those with SI-CIBs, which may be of independent interest.
comment: Preliminary version appeared at IEEE CDC 2025
The Evolution of Eco-routing under Population Growth: Evidence from Six U.S. Cities
Rapid urban population growth drives car travel demand, increasing transport carbon emissions and posing a critical challenge to sustainable development. Although existing studies have demonstrated that eco-routing can reduce individual emissions, research gaps remain. On the one hand, such personal reductions have a negligible impact on overall emissions, and cannot be simply aggregated to capture the complex effects of large-scale eco-routing. On the other hand, under population growth, the long-term effectiveness of eco-routing, as well as the evolution of its efficiency and traveler route choice, remain underexplored. To address these limitations, this study proposes Time-Only and Time-Carbon user equilibrium (UE) models, integrates them with a demand forecasting method for simulating future network traffic, and designs multi-dimensional metrics to characterize urban dynamics. Using real-world road networks, commuting origin-destination (OD) demand, and population projections under various shared socioeconomic pathways (SSPs) for six representative U.S. cities as a case study, we conduct a comprehensive analysis of urban dynamics across different routing strategies and population sizes. The results reveal that while eco-routing mitigates total emissions, emissions in most cities scale superlinearly with population, a scaling order that remains invariant regardless of routing and construction strategies. Moreover, under population growth, travelers using eco-routing tend to increasingly select shorter routes, giving rise to carbon bottlenecks. A strategy of targeted capacity expansion on these critical bottlenecks (0.46% of links) significantly reduces both emissions (3%) and travel time (28%) without compromising eco-routing efficiency. This study provides a foundation for formulating low-carbon urban transport planning and emission reduction policies.
Local Safety Filters for Networked Systems via Two-Time-Scale Design
Safety filters based on Control Barrier Functions (CBFs) provide formal guarantees of forward invariance, but are often difficult to implement in networked dynamical systems. This is due to global coupling and communication requirements. This paper develops locally implementable approximations of networked CBF safety filters that require no coordination across subsystems. The proposed approach is based on a two-time-scale dynamic implementation inspired by singular perturbation theory, where a small parameter $ε$ separates fast filter dynamics from the plant dynamics; then, a local implementation is enabled via derivative estimation. Explicit bounds are derived to quantify the mismatch between trajectories of the systems with dynamic filter and with the ideal centralized safety filter. These results characterize how safety degradation depends on the time-scale parameter $ε$, estimation errors, and filter activation time, thereby quantifying trade-offs between safety guarantees and local implementability.
On boundedness of solutions of three-state Moore-Greitzer compressor model with nonlinear proportional-integral controller for the surge subsystem
The work focuses on Lagrange stability of the origin for the three-state Moore-Greitzer compressor model in closed loop with a nonlinear PI controller, tuned only to stabilize a lower-dimensional invariant surge-dynamics subsystem.The linearization of the system is not stabilizable but the static nonlinearity satisfies a sector condition, and together with a structural property of the stall-dynamics subsystem, this plays an essential role in the analysis. The main contribution provides explicit conditions on the controller parameters together with analytical arguments that guarantee boundedness of all solutions of the closed-loop system. The analysis employs a non-standard application of circle-criterion-based arguments. Together with the additional arguments developed in the work, this stability test also shows that the closed-loop system is robust to certain perturbations and model uncertainties.
comment: 15 pages
Joint Visible Light and RF Backscatter Communications for Ambient IoT Network: Fundamentals, Applications, and Opportunities
The rapid growth of the Internet of Things (IoT) devices in the sixth-generation (6G) wireless networks raises significant generality and scalability challenges due to energy consumption, deployment complexity, and environmental impact. Ambient IoT (A-IoT), leveraging ambient energy harvesting (EH) for batteryless device operation, has emerged as a promising solution to address these challenges.Among various EH and communication techniques, visible light communication (VLC) integrated with ambient backscatter communication (AmBC) offers remarkable advantages, including energy neutrality, high reliability, and enhanced security. In this paper, we propose a joint VLC-AmBC architecture, emphasizing fundamental concepts, system designs, and practical implementations. We explore potential applications in environmental monitoring, healthcare, smart logistics, and secure communications. We present proof-of-concept demonstrations for three distinct types of ambient backscatter devices (AmBDs): EH-Only, VLC-Relay, and VLC-Control. Experimental results demonstrate the feasibility of implementing joint VLC-AmBC systems, highlighting their practical viability across various deployment scenarios. Finally, we outline future research directions, including integrated sensing and communication, as well as optimized energy-efficient deployment. Open issues, such as large-scale deployment challenges, are also discussed, thereby providing a clear roadmap for future developments in joint VLC-AmBC-enabled A-IoT ecosystems.
comment: 7 pages, 5 figures, 1 table
Risk-Aware Rulebooks for Multi-Objective Trajectory Evaluation under Uncertainty
We present a risk-aware formalism for evaluating system trajectories in the presence of uncertain interactions between the system and its environment. The proposed formalism supports reasoning under uncertainty and systematically handles complex relationships among requirements and objectives, including hierarchical priorities and non-comparability. Rather than treating the environment as exogenous noise, we explicitly model how each system trajectory influences the environment and evaluate trajectories under the resulting distribution of environment responses. We prove that the formalism induces a preorder on the set of system trajectories, ensuring consistency and preventing cyclic preferences. Finally, we illustrate the approach with an autonomous driving example that demonstrates how the formalism enhances explainability by clarifying the rationale behind trajectory selection.
Integral action for bilinear systems with application to counter current heat exchanger
In this study, we propose a robust control strategy for a counter-current heat exchanger. The primary objective is to regulate the outlet temperature of one fluid stream by manipulating the flow rate of the second counter-current fluid stream. By leveraging the energy balance equations, we develop a structured bilinear system model derived by using a uniform spatial discretization of each stream into a cascade of homogeneous volumes and by considering the heat transfer and convective phenomena within the exchanger. We introduce two control strategies: (i) an output feedback controller incorporating a state observer and (ii) a purely integral control law. The effectiveness of the proposed control strategy is validated through real experiments on a real heat exchanger.
Carbon-Aware Quality Adaptation for Energy-Intensive Services
The energy demand of modern cloud services, particularly those related to generative AI, is increasing at an unprecedented pace. To date, carbon-aware computing strategies have primarily focused on batch process scheduling or geo-distributed load balancing. However, such approaches are not applicable to services that require constant availability at specific locations due to latency, privacy, data, or infrastructure constraints. In this paper, we explore how the carbon footprint of energy-intensive services can be reduced by adjusting the fraction of requests served by different service quality tiers. We show that adapting this quality of responses with respect to grid carbon intensity can lead to additional carbon savings beyond resource and energy efficiency. Building on this, we introduce a forecast-based multi-horizon optimization that reaches close-to-optimal carbon savings and is able to automatically adapt service quality for best-effort users to stay within an annual carbon budget. Our approach can reduce the emissions of large-scale LLM services, which we estimate at multiple 10,000 tons of CO2 annually, by up to 10%.
comment: Extended version of our paper published at e-Energy'25. Compared to the published version, we (i) add a time-based vs. utilization-based power attribution perspective together with a proof that both yield equivalent provisioning decisions under mild assumptions and (ii) extend the online approach with an automatic quality adaptation to meet a fixed annual carbon budget
A Linear Parameter-Varying Framework for the Analysis of Time-Varying Optimization Algorithms
In this paper we propose a framework to analyze iterative first-order optimization algorithms for time-varying convex optimization. We assume that the temporal variability is caused by a time-varying parameter entering the objective, which can be measured at the time of decision but whose future values are unknown. We consider the case of strongly convex objective functions with Lipschitz continuous gradients under a convex constraint set. We model the algorithms as discrete-time linear parameter varying (LPV) systems in feedback with monotone operators such as the time-varying gradient. We leverage the approach of analyzing algorithms as uncertain control interconnections with integral quadratic constraints (IQCs) and generalize that framework to the time-varying case. We propose novel IQCs that are capable of capturing the behavior of time-varying nonlinearities and leverage techniques from the LPV literature to establish novel bounds on the tracking error. Quantitative bounds can be computed by solving a semi-definite program and can be interpreted as an input-to-state stability result with respect to a disturbance signal which increases with the temporal variability of the problem. As a departure from results in this research area, our bounds introduce a dependence on different additional measures of temporal variations, such as the function value and gradient rate of change. We exemplify our main results with numerical experiments that showcase how our analysis framework is able to capture convergence rates of different first-order algorithms for time-varying optimization through the choice of IQC and rate bounds.
A 200 dB Dynamic Range Radiation-Hard Delta-Sigma Current Digitizer for Beam Loss Monitoring
This manuscript describes a radiation-hardened current-mode delta-sigma ADC fabricated in a standard 130 nm CMOS technology and qualified for total ionizing doses up to 100 Mrad. The operational signal range achieved with a 100 s integration window exceeds 200 dB. The converter is designed for beam loss monitoring applications in high-energy physics, where it must handle input currents spanning nine decades, from 1 mA down to 1 pA, while providing a fast 10 us response time for machine protection. To meet these conflicting requirements, the architecture exploits the inherent trade-off between resolution and acquisition time provided by delta-sigma conversion: a first-order architecture, sampling at 20 MHz, delivers 11-bit effective resolution within the critical 10 us window for critical currents around 1 mA. Integration times above 10 s enable the sub-picoampere resolution required for precise beam alignment and background monitoring. The chip integrates two independent channels, consumes 25 mW from a 1.2 V supply, and relies on radiation-hardening techniques such as triple-redundant digital logic, custom ESD protections, and manual enclosed layout for critical analog transistors. Post-irradiation measurements up to 100 Mrad show no significant performance degradation, and the uncalibrated integral nonlinearity remains within [+4, -5] LSBs over the 1 mA to 5 uA range. The converter's flexibility and radiation tolerance make it suitable not only for the HL-LHC beam loss monitoring upgrade but also for other precision current measurement applications in harsh environments.
Secure Semantic Communications via AI Defenses: Fundamentals, Solutions, and Future Directions
Semantic communication (SemCom) redefines wireless communication from reproducing symbols to transmitting task-relevant semantics. However, this AI-native architecture also introduces new vulnerabilities, as semantic failures may arise from adversarial perturbations to models, corrupted training data, desynchronized priors, or misaligned inference even when lower-layer transmission reliability and cryptographic protection remain intact. This survey provides a defense-centered and system-oriented synthesis of security in SemCom via AI defense. We analyze AI-centric threat models by consolidating existing studies and organizing attack surfaces across model-level, channel-realizable, knowledge-based, and networked inference vectors. Building on this foundation, we present a structured taxonomy of defense strategies organized by where semantic integrity can be compromised in SemCom systems despite correct symbol delivery, spanning semantic encoding, wireless transmission, knowledge integrity, and coordination among multiple agents. These categories correspond to distinct security failure modes, including representation fragility, channel-realizable manipulation, semantic prior poisoning or desynchronization, and adversarial propagation through distributed inference. We also examine security utility operating envelopes that capture tradeoffs among semantic fidelity, robustness, latency, and energy under realistic constraints, survey evaluation frameworks and representative applications, and identify open challenges in cross-layer composition and deployment-time certification. Overall, this survey offers a unified system-level perspective that enables readers to understand major threat and defense mechanisms in AI-native SemCom systems and to leverage emerging security techniques in the design and deployment of robust SemCom architectures for next-generation intelligent networks.
Dispatch-Aware Deep Neural Network for Optimal Transmission Switching
Optimal transmission switching (OTS) improves optimal power flow (OPF) by selectively opening transmission lines, but its mixed-integer formulation increases computational complexity, especially on large grids. To address this, we propose a dispatch-aware deep neural network (DA-DNN) that accelerates DC-OTS without relying on pre-solved labels, eliminating costly OTS label generation that becomes impractical at scale. DA-DNN predicts line states and passes them through an embedded differentiable DC-OPF layer, using the resulting generation cost as the loss function so that physical network constraints are enforced throughout training and inference. To stabilize training, we adopt a customized weight and bias initialization that keeps the embedded DC-OPF feasible from the first epoch. To improve inference robustness, we incorporate a binary regularization term that reduces ambiguity in the relaxed line-status outputs prior to thresholding. Once trained, DA-DNN produces a feasible topology and dispatch pair with highly predictable computation time comparable to a single DC-OPF solve, while conventional MIP solvers can become intractable. Moreover, the embedded OPF layer enables DA-DNN to generalize to untrained system configurations, such as changes in line flow limits, and to support post-contingency corrective operation. As a result, the proposed method captures the economic advantages of OTS while maintaining scalability and generalization ability.
comment: 10 pages, 6 figures
Adaptive Quantized Planetary Crater Detection System for Autonomous Space Exploration
Autonomous planetary exploration demands real-time, high-fidelity environmental perception. Standard deep learning models, however, require far more memory and compute than space-qualified, radiation-hardened, power-optimized hardware can provide. This limitation creates a severe design bottleneck. Engineers struggle to deploy sophisticated detection architectures without overloading the strict power and memory limits of onboard computers of outer space planetary exploration platforms. In this foundational concept paper, we propose the Adaptive Quantized Planetary Crater Detection System (AQ-PCDSys) to resolve this bottleneck. We present an architectural blueprint integrating a Quantized Neural Network (QNN), refined through Quantization Aware Training (QAT), with an Adaptive Multi-Sensor Fusion (AMF) module and Multi-Scale Detection Heads. By forcing weights into low-precision integer arithmetic during the training and optimization phase, our framework strips away the floating-point overhead that typically overwhelms onboard computer's processors. The AMF module directly addresses sensor fragility. It dynamically selects and fuses Optical Imagery (OI) and Digital Elevation Models (DEMs) at the feature level to provide reliable sensor inputs during extreme cross-illuminations and sudden sensor dropouts. As a concept paper, this work establishes the technical and mathematical justifications for the architecture rather than presenting completed empirical ablation studies. We outline a rigorous Hardware-in-the-Loop (HITL) evaluation protocol for immediate future validation, paving the way for next-generation, hardware-aware space-mission software.
comment: 10 pages, 6 figures. A research paper on a novel deep learning framework for planetary crater detection
Constrained Stabilization on the n-Sphere with Conic and Star-shaped Constraints
The problem of constrained stabilization on the n-sphere under star-shaped constraints is considered. We propose a control strategy that allows to almost globally steer the state to a desired location while avoiding star-shaped constraints on the n-sphere. Depending on the state's proximity to the unsafe regions, the state is either guided towards the target location along the geodesic connecting the target to the state or steered towards the antipode of a predefined point lying in the interior of the nearest unsafe region. We prove that the target location is almost globally asymptotically stable under the proposed continuous, time-invariant feedback control law. Nontrivial simulation results on the 2-sphere and the 3-sphere demonstrate the effectiveness of the theoretical results.
comment: 18 pages, 12 figures
A System-of-Systems Convergence Paradigm for Societal Challenges of the Anthropocene
Modern societal challenges, such as climate change, urbanization, and water resource management, demand integrated, multi-discipline, multi-problem approaches to frame and address their complexity. Unfortunately, current methodologies often operate within disciplinary silos, leading to fragmented insights and missed opportunities for convergence. A critical barrier to cross-disciplinary integration lies in the disparate ontologies that shape how different fields conceptualize and communicate knowledge. To address these limitations, this paper proposes a system-of-systems (SoS) convergence paradigm grounded in a meta-cognition map, a framework that integrates five complementary domains: real-world observations, systems thinking, visual modeling, mathematics, and computing. The paradigm is based on the Systems Modeling Language (SysML), offering a standardized, domain-neutral approach for representing and analyzing complex systems. The proposed methodology is demonstrated through a case study of the Chesapeake Bay Watershed, a socio-environmental system requiring coordination across land use, hydrology, economic and policy domains. By modeling this system with SysML, the study illustrates practical strategies for navigating interdisciplinary challenges and highlights the potential of agile SoS modeling to support large-scale, multi-dimensional decision-making.
Design and Experimental Validation of Sensorless 4-Channel Bilateral Teleoperation for Low-Cost Manipulators
Teleoperation of low-cost manipulators is attracting increasing attention as a practical means of collecting demonstration data for imitation learning. However, most existing systems rely on unilateral control without force feedback, which limits performance in fast or contact-rich operations under severe sensing and bandwidth constraints. This paper demonstrates that practical high-speed bilateral teleoperation with force feedback is achievable on force-sensorless, low-cost manipulators by employing a sensorless 4-channel bilateral control framework. The proposed method integrates nonlinear dynamics compensation with a disturbance-observer-based velocity and external force estimation scheme, enabling stable position-force interaction while avoiding the performance degradation caused by phase-lagged velocity estimation commonly used in low-cost systems. By interpreting the observer structure in the frequency domain, we clarify the intrinsic coupling between velocity and external force estimation bandwidths and show that the observer tuning freedom can be reduced to a single cutoff frequency, providing practical, hardware-oriented parameter tuning guidelines for low-cost implementations. Real-robot experiments demonstrate stable and accurate teleoperation in high-speed and contact-rich scenarios. Furthermore, as an application, we show that incorporating force information in demonstrations collected with the proposed system significantly improves the success rate of imitation learning across multiple manipulation tasks.
comment: 16 pages, 9 figures, Submitted to IEEE Access
Robotics
How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference
Many essential manipulation tasks - such as food preparation, surgery, and craftsmanship - remain intractable for autonomous robots. These tasks are characterized not only by contact-rich, force-sensitive dynamics, but also by their "implicit" success criteria: unlike pick-and-place, task quality in these domains is continuous and subjective (e.g. how well a potato is peeled), making quantitative evaluation and reward engineering difficult. We present a learning framework for such tasks, using peeling with a knife as a representative example. Our approach follows a two-stage pipeline: first, we learn a robust initial policy via force-aware data collection and imitation learning, enabling generalization across object variations; second, we refine the policy through preference-based finetuning using a learned reward model that combines quantitative task metrics with qualitative human feedback, aligning policy behavior with human notions of task quality. Using only 50-200 peeling trajectories, our system achieves over 90% average success rates on challenging produce including cucumbers, apples, and potatoes, with performance improving by up to 40% through preference-based finetuning. Remarkably, policies trained on a single produce category exhibit strong zero-shot generalization to unseen in-category instances and to out-of-distribution produce from different categories while maintaining over 90% success rates.
comment: Project page can be found at https://toruowo.github.io/peel
ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation
Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing approaches are fundamentally constrained: retargeted data are often scarce or low-quality; methods struggle to scale to large skill repertoires; and, most importantly, they rely on tracking predefined motion references rather than generating behavior from perception and high-level task specifications. To address these limitations, we propose ULTRA, a unified framework with two key components. First, we introduce a physics-driven neural retargeting algorithm that translates large-scale motion capture to humanoid embodiments while preserving physical plausibility for contact-rich interactions. Second, we learn a unified multimodal controller that supports both dense references and sparse task specifications, under sensing ranging from accurate motion-capture state to noisy egocentric visual inputs. We distill a universal tracking policy into this controller, compress motor skills into a compact latent space, and apply reinforcement learning finetuning to expand coverage and improve robustness under out-of-distribution scenarios. This enables coordinated whole-body behavior from sparse intent without test-time reference motions. We evaluate ULTRA in simulation and on a real Unitree G1 humanoid. Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.
comment: Project Page: https://ultra-humanoid.github.io/
Tether: Autonomous Functional Play with Correspondence-Driven Trajectory Warping ICLR
The ability to conduct and learn from interaction and experience is a central challenge in robotics, offering a scalable alternative to labor-intensive human demonstrations. However, realizing such "play" requires (1) a policy robust to diverse, potentially out-of-distribution environment states, and (2) a procedure that continuously produces useful robot experience. To address these challenges, we introduce Tether, a method for autonomous functional play involving structured, task-directed interactions. First, we design a novel open-loop policy that warps actions from a small set of source demonstrations (<=10) by anchoring them to semantic keypoint correspondences in the target scene. We show that this design is extremely data-efficient and robust even under significant spatial and semantic variations. Second, we deploy this policy for autonomous functional play in the real world via a continuous cycle of task selection, execution, evaluation, and improvement, guided by the visual understanding capabilities of vision-language models. This procedure generates diverse, high-quality datasets with minimal human intervention. In a household-like multi-object setup, our method is the first to perform many hours of autonomous multi-task play in the real world starting from only a handful of demonstrations. This produces a stream of data that consistently improves the performance of closed-loop imitation policies over time, ultimately yielding over 1000 expert-level trajectories and training policies competitive with those learned from human-collected demonstrations.
comment: International Conference on Learning Representations (ICLR), 2026. Project website and code: https://tether-research.github.io
HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations
We present Whole-Body Mobile Manipulation Interface (HoMMI), a data collection and policy learning framework that learns whole-body mobile manipulation directly from robot-free human demonstrations. We augment UMI interfaces with egocentric sensing to capture the global context required for mobile manipulation, enabling portable, robot-free, and scalable data collection. However, naively incorporating egocentric sensing introduces a larger human-to-robot embodiment gap in both observation and action spaces, making policy transfer difficult. We explicitly bridge this gap with a cross-embodiment hand-eye policy design, including an embodiment agnostic visual representation; a relaxed head action representation; and a whole-body controller that realizes hand-eye trajectories through coordinated whole-body motion under robot-specific physical constraints. Together, these enable long-horizon mobile manipulation tasks requiring bimanual and whole-body coordination, navigation, and active perception. Results are best viewed on: https://hommi-robot.github.io
ACE-Brain-0: Spatial Intelligence as a Shared Scaffold for Universal Embodiments
Universal embodied intelligence demands robust generalization across heterogeneous embodiments, such as autonomous driving, robotics, and unmanned aerial vehicles (UAVs). However, existing embodied brain in training a unified model over diverse embodiments frequently triggers long-tail data, gradient interference, and catastrophic forgetting, making it notoriously difficult to balance universal generalization with domain-specific proficiency. In this report, we introduce ACE-Brain-0, a generalist foundation brain that unifies spatial reasoning, autonomous driving, and embodied manipulation within a single multimodal large language model~(MLLM). Our key insight is that spatial intelligence serves as a universal scaffold across diverse physical embodiments: although vehicles, robots, and UAVs differ drastically in morphology, they share a common need for modeling 3D mental space, making spatial cognition a natural, domain-agnostic foundation for cross-embodiment transfer. Building on this insight, we propose the Scaffold-Specialize-Reconcile~(SSR) paradigm, which first establishes a shared spatial foundation, then cultivates domain-specialized experts, and finally harmonizes them through data-free model merging. Furthermore, we adopt Group Relative Policy Optimization~(GRPO) to strengthen the model's comprehensive capability. Extensive experiments demonstrate that ACE-Brain-0 achieves competitive and even state-of-the-art performance across 24 spatial and embodiment-related benchmarks.
comment: Code: https://github.com/ACE-BRAIN-Team/ACE-Brain-0 Hugging Face: https://huggingface.co/ACE-Brain/ACE-Brain-0-8B
Chain of World: World Model Thinking in Latent Motion CVPR2026
Vision-Language-Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics. World-model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds. Latent-action VLAs encode frame-to-frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge. To overcome these limitations, we introduce CoWVLA (Chain-of-World VLA), a new "Chain of World" paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre-training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain and predict the segment's terminal frame. Finally, during co-fine-tuning, this latent dynamic is aligned with discrete action prediction by jointly modeling sparse keyframes and action sequences in a unified autoregressive decoder. This design preserves the world-model benefits of temporal reasoning and world knowledge while retaining the compactness and interpretability of latent actions, enabling efficient visuomotor learning. Extensive experiments on robotic simulation benchmarks show that CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm. The project website can be found at https://fx-hit.github.io/cowvla-io.
comment: Accepted by CVPR2026. Project page: https://fx-hit.github.io/cowvla-io/
Robotic Grasping and Placement Controlled by EEG-Based Hybrid Visual and Motor Imagery
We present a framework that integrates EEG-based visual and motor imagery (VI/MI) with robotic control to enable real-time, intention-driven grasping and placement. Motivated by the promise of BCI-driven robotics to enhance human-robot interaction, this system bridges neural signals with physical control by deploying offline-pretrained decoders in a zero-shot manner within an online streaming pipeline. This establishes a dual-channel intent interface that translates visual intent into robotic actions, with VI identifying objects for grasping and MI determining placement poses, enabling intuitive control over both what to grasp and where to place. The system operates solely on EEG via a cue-free imagery protocol, achieving integration and online validation. Implemented on a Base robotic platform and evaluated across diverse scenarios, including occluded targets or varying participant postures, the system achieves online decoding accuracies of 40.23% (VI) and 62.59% (MI), with an end-to-end task success rate of 20.88%. These results demonstrate that high-level visual cognition can be decoded in real time and translated into executable robot commands, bridging the gap between neural signals and physical interaction, and validating the flexibility of a purely imagery-based BCI paradigm for practical human-robot collaboration.
From Language to Action: Can LLM-Based Agents Be Used for Embodied Robot Cognition? ICRA
In order to flexibly act in an everyday environment, a robotic agent needs a variety of cognitive capabilities that enable it to reason about plans and perform execution recovery. Large language models (LLMs) have been shown to demonstrate emergent cognitive aspects, such as reasoning and language understanding; however, the ability to control embodied robotic agents requires reliably bridging high-level language to low-level functionalities for perception and control. In this paper, we investigate the extent to which an LLM can serve as a core component for planning and execution reasoning in a cognitive robot architecture. For this purpose, we propose a cognitive architecture in which an agentic LLM serves as the core component for planning and reasoning, while components for working and episodic memories support learning from experience and adaptation. An instance of the architecture is then used to control a mobile manipulator in a simulated household environment, where environment interaction is done through a set of high-level tools for perception, reasoning, navigation, grasping, and placement, all of which are made available to the LLM-based agent. We evaluate our proposed system on two household tasks (object placement and object swapping), which evaluate the agent's reasoning, planning, and memory utilisation. The results demonstrate that the LLM-driven agent can complete structured tasks and exhibits emergent adaptation and memory-guided planning, but also reveal significant limitations, such as hallucinations about the task success and poor instruction following by refusing to acknowledge and complete sequential tasks. These findings highlight both the potential and challenges of employing LLMs as embodied cognitive controllers for autonomous robots.
comment: Accepted for publication at the 2026 IEEE International Conference on Robotics and Automation (ICRA)
Look Forward to Walk Backward: Efficient Terrain Memory for Backward Locomotion with Forward Vision ICRA
Legged robots with egocentric forward-facing depth cameras can couple exteroception and proprioception to achieve robust forward agility on complex terrain. When these robots walk backward, the forward-only field of view provides no preview. Purely proprioceptive controllers can remain stable on moderate ground when moving backward but cannot fully exploit the robot's capabilities on complex terrain and must collide with obstacles. We present Look Forward to Walk Backward (LF2WB), an efficient terrain-memory locomotion framework that uses forward egocentric depth and proprioception to write a compact associative memory during forward motion and to retrieve it for collision-free backward locomotion without rearward vision. The memory backbone employs a delta-rule selective update that softly removes then writes the memory state along the active subspace. Training uses hardware-efficient parallel computation, and deployment runs recurrent, constant-time per-step inference with a constant-size state, making the approach suitable for onboard processors on low-cost robots. Experiments in both simulations and real-world scenarios demonstrate the effectiveness of our method, improving backward agility across complex terrains under limited sensing.
comment: Accepted for 2026 IEEE International Conference on Robotics and Automation (ICRA)
RL-Based Coverage Path Planning for Deformable Objects on 3D Surfaces ICRA
Currently, manipulation tasks for deformable objects often focus on activities like folding clothes, handling ropes, and manipulating bags. However, research on contact-rich tasks involving deformable objects remains relatively underdeveloped. When humans use cloth or sponges to wipe surfaces, they rely on both vision and tactile feedback. Yet, current algorithms still face challenges with issues like occlusion, while research on tactile perception for manipulation is still evolving. Tasks such as covering surfaces with deformable objects demand not only perception but also precise robotic manipulation. To address this, we propose a method that leverages efficient and accessible simulators for task execution. Specifically, we train a reinforcement learning agent in a simulator to manipulate deformable objects for surface wiping tasks. We simplify the state representation of object surfaces using harmonic UV mapping, process contact feedback from the simulator on 2D feature maps, and use scaled grouped convolutions (SGCNN) to extract features efficiently. The agent then outputs actions in a reduced-dimensional action space to generate coverage paths. Experiments demonstrate that our method outperforms previous approaches in key metrics, including total path length and coverage area. We deploy these paths on a Kinova Gen3 manipulator to perform wiping experiments on the back of a torso model, validating the feasibility of our approach.
comment: 8 pages, 8 figures. Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA)
CMoE: Contrastive Mixture of Experts for Motion Control and Terrain Adaptation of Humanoid Robots
For effective deployment in real-world environments, humanoid robots must autonomously navigate a diverse range of complex terrains with abrupt transitions. While the Vanilla mixture of experts (MoE) framework is theoretically capable of modeling diverse terrain features, in practice, the gating network exhibits nearly uniform expert activations across different terrains, weakening the expert specialization and limiting the model's expressive power. To address this limitation, we introduce CMoE, a novel single-stage reinforcement learning framework that integrates contrastive learning to refine expert activation distributions. By imposing contrastive constraints, CMoE maximizes the consistency of expert activations within the same terrain while minimizing their similarity across different terrains, thereby encouraging experts to specialize in distinct terrain types. We validated our approach on the Unitree G1 humanoid robot through a series of challenging experiments. Results demonstrate that CMoE enables the robot to traverse continuous steps up to 20 cm high and gaps up to 80 cm wide, while achieving robust and natural gait across diverse mixed terrains, surpassing the limits of existing methods. To support further research and foster community development, we release our code publicly.
Architectural HRI: Towards a Robotic Paradigm Shift in Human-Building Interaction
Recent advances in sensing, communication, interfaces, control, and robotics are expanding Human-Building Interaction (HBI) beyond adaptive building services and facades toward the physical actuation of architectural space. In parallel, research in robotic furniture, swarm robotics, and shape-changing spaces shows that architectural elements can now be robotically augmented to move, reconfigure, and adapt space. We propose that these advances promise a paradigm shift in HBI, in which multiple building layers physically adapt in synchrony to support occupant needs and sustainability goals more holistically. Conversely, we argue that this emerging paradigm also provides an ideal case for transferring HRI knowledge to unconventional robotic morphologies, including the interpretation of the robot as multiple architectural layers or even as a building. However, this research agenda remains challenged by the temporal, spatial, and social complexity of architectural HRI, and by fragmented knowledge across HCI, environmental psychology, cognitive science, and architecture. We therefore call for interdisciplinary research that unifies the why, what, and how of robotic actuation in architectural forms.
MA-CoNav: A Master-Slave Multi-Agent Framework with Hierarchical Collaboration and Dual-Level Reflection for Long-Horizon Embodied VLN
Vision-Language Navigation (VLN) aims to empower robots with the ability to perform long-horizon navigation in unfamiliar environments based on complex linguistic instructions. Its success critically hinges on establishing an efficient ``language-understanding -- visual-perception -- embodied-execution'' closed loop. Existing methods often suffer from perceptual distortion and decision drift in complex, long-distance tasks due to the cognitive overload of a single agent. Inspired by distributed cognition theory, this paper proposes MA-CoNav, a Multi-Agent Collaborative Navigation framework. This framework adopts a ``Master-Slave'' hierarchical agent collaboration architecture, decoupling and distributing the perception, planning, execution, and memory functions required for navigation tasks to specialized agents. Specifically, the Master Agent is responsible for global orchestration, while the Subordinate Agent group collaborates through a clear division of labor: an Observation Agent generates environment descriptions, a Planning Agent performs task decomposition and dynamic verification, an Execution Agent handles simultaneous mapping and action, and a Memory Agent manages structured experiences. Furthermore, the framework introduces a ``Local-Global'' dual-stage reflection mechanism to dynamically optimize the entire navigation pipeline. Empirical experiments were conducted using a real-world indoor dataset collected by a Limo Pro robot, with no scene-specific fine-tuning performed on the models throughout the process. The results demonstrate that MA-CoNav comprehensively outperforms existing mainstream VLN methods across multiple metrics.
CASSR: Continuous A-Star Search through Reachability for real time footstep planning
Footstep planning involves a challenging combinatorial search. Traditional A* approaches require discretising reachability constraints, while Mixed-Integer Programming (MIP) supports continuous formulations but quickly becomes intractable, especially when rotations are included. We present CASSR, a novel framework that recursively propagates convex, continuous formulations of a robot's kinematic constraints within an A* search. Combined with a new cost-to-go heuristic based on the EPA algorithm, CASSR efficiently plans contact sequences of up to 30 footsteps in under 125 ms. Experiments on biped locomotion tasks demonstrate that CASSR outperforms traditional discretised A* by up to a factor of 100, while also surpassing a commercial MIP solver. These results show that CASSR enables fast, reliable, and real-time footstep planning for biped robots.
DreamFlow: Local Navigation Beyond Observation via Conditional Flow Matching in the Latent Space
Local navigation in cluttered environments often suffers from dense obstacles and frequent local minima. Conventional local planners rely on heuristics and are prone to failure, while deep reinforcement learning(DRL)based approaches provide adaptability but are constrained by limited onboard sensing. These limitations lead to navigation failures because the robot cannot perceive structures outside its field of view. In this paper, we propose DreamFlow, a DRL-based local navigation framework that extends the robot's perceptual horizon through conditional flow matching(CFM). The proposed CFM based prediction module learns probabilistic mapping between local height map latent representation and broader spatial representation conditioned on navigation context. This enables the navigation policy to predict unobserved environmental features and proactively avoid potential local minima. Experimental results demonstrate that DreamFlow outperforms existing methods in terms of latent prediction accuracy and navigation performance in simulation. The proposed method was further validated in cluttered real world environments with a quadrupedal robot. The project page is available at https://dreamflow-icra.github.io.
TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamentally clash with the dynamic, embodied, and spatially-structured nature of navigation. Existing large-model-based methods often resort to converting rich visual and spatial information into text, forcing models to implicitly infer complex visual-topological relationships or limiting their global action capabilities. To bridge this gap, we propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone. To introduce topological edge information, Spatial Topology Aware Residual Attention (STAR-Att) directly integrates it into the VLM's self-attention mechanism, enabling intrinsic spatial reasoning while preserving pretrained knowledge. To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment. Finally, with the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction. On the R2R benchmark, TagaVLM achieves state-of-the-art performance among large-model-based methods, with a Success Rate (SR) of 51.09% and SPL of 47.18 in unseen environments, outperforming prior work by 3.39% in SR and 9.08 in SPL. This demonstrates that, for embodied spatial reasoning, targeted enhancements on smaller open-source VLMs can be more effective than brute-force model scaling. The code will be released upon publication.Project page: https://apex-bjut.github.io/Taga-VLM
Self-supervised Domain Adaptation for Visual 3D Pose Estimation of Nano-drone Racing Gates by Enforcing Geometric Consistency ICRA 2026
We consider the task of visually estimating the relative pose of a drone racing gate in front of a nano-quadrotor, using a convolutional neural network pre-trained on simulated data to regress the gate's pose. Due to the sim-to-real gap, the pre-trained model underperforms in the real world and must be adapted to the target domain. We propose an unsupervised domain adaptation (UDA) approach using only real image sequences collected by the drone flying an arbitrary trajectory in front of a gate; sequences are annotated in a self-supervised fashion with the drone's odometry as measured by its onboard sensors. On this dataset, a state consistency loss enforces that two images acquired at different times yield pose predictions that are consistent with the drone's odometry. Results indicate that our approach outperforms other SoA UDA approaches, has a low mean absolute error in position (x=26, y=28, z=10 cm) and orientation ($ψ$=13${^{\circ}}$), an improvement of 40% in position and 37% in orientation over a baseline. The approach's effectiveness is appreciable with as few as 10 minutes of real-world flight data and yields models with an inference time of 30.4ms (33 fps) when deployed aboard the Crazyflie 2.1 Brushless nano-drone.
comment: Accepted at ICRA 2026
Tracing Back Error Sources to Explain and Mitigate Pose Estimation Failures
Robust estimation of object poses in robotic manipulation is often addressed using foundational general estimators, that aim to handle diverse error sources naively within a single model. Still, they struggle due to environmental uncertainties, while requiring long inference times and heavy computation. In contrast, we propose a modular, uncertainty-aware framework that attributes pose estimation errors to specific error sources and applies targeted mitigation strategies only when necessary. Instantiated with Iterative Closest Point (ICP) as a simple and lightweight pose estimator, we leverage our framework for real-world robotic grasping tasks. By decomposing pose estimation into failure detection, error attribution, and targeted recovery, we significantly improve the robustness of ICP and achieve competitive performance compared to foundation models, while relying on a substantially simpler and faster pose estimator.
Emerging trends in Cislunar Space for Lunar Science Exploration and Space Robotics aiding Human Spaceflight Safety SP
In recent years, the Moon has emerged as an unparalleled extraterrestrial testbed for advancing cuttingedge technological and scientific research critical to enabling sustained human presence on its surface and supporting future interplanetary exploration. This study identifies and investigates two pivotal research domains with substantial transformative potential for accelerating humanity interplanetary aspirations. First is Lunar Science Exploration with Artificial Intelligence and Space Robotics which focusses on AI and Space Robotics redefining the frontiers of space exploration. Second being Space Robotics aid in manned spaceflight to the Moon serving as critical assets for pre-deployment infrastructure development, In-Situ Resource Utilization, surface operations support, and astronaut safety assurance. By integrating autonomy, machine learning, and realtime sensor fusion, space robotics not only augment human capabilities but also serve as force multipliers in achieving sustainable lunar exploration, paving the way for future crewed missions to Mars and beyond.
comment: Conference Proceedings of 2nd IAA Conference on AI in and for Space (2nd IAA SPAICE), Suzhou, China, 1-3 November, 2025
Rhythm: Learning Interactive Whole-Body Control for Dual Humanoids
Realizing interactive whole-body control for multi-humanoid systems is critical for unlocking complex collaborative capabilities in shared environments. Although recent advancements have significantly enhanced the agility of individual robots, bridging the gap to physically coupled multi-humanoid interaction remains challenging, primarily due to severe kinematic mismatches and complex contact dynamics. To address this, we introduce Rhythm, the first unified framework enabling real-world deployment of dual-humanoid systems for complex, physically plausible interactions. Our framework integrates three core components: (1) an Interaction-Aware Motion Retargeting (IAMR) module that generates feasible humanoid interaction references from human data; (2) an Interaction-Guided Reinforcement Learning (IGRL) policy that masters coupled dynamics via graph-based rewards; and (3) a real-world deployment system that enables robust transfer of dual-humanoid interaction. Extensive experiments on physical Unitree G1 robots demonstrate that our framework achieves robust interactive whole-body control, successfully transferring diverse behaviors such as hugging and dancing from simulation to reality.
CoFL: Continuous Flow Fields for Language-Conditioned Navigation
Language-conditioned navigation pipelines often rely on brittle modular components or costly action-sequence generation. To address these limitations, we present CoFL, an end-to-end policy that directly maps a bird's-eye view (BEV) observation and a language instruction to a continuous flow field for navigation. Instead of predicting discrete action tokens or sampling action chunks via iterative denoising, CoFL outputs instantaneous velocities that can be queried at arbitrary 2D projected locations. Trajectories are obtained by numerical integration of the predicted field, producing smooth motion that remains reactive under closed-loop execution. To enable large-scale training, we build a dataset of over 500k BEV image-instruction pairs, each procedurally annotated with a flow field and a trajectory derived from BEV semantic maps built on Matterport3D and ScanNet. By training on a mixed distribution, CoFL significantly outperforms modular Vision-Language Model (VLM)-based planners and generative policy baselines on strictly unseen scenes. Finally, we deploy CoFL zero-shot in real-world experiments with overhead BEV observations across multiple layouts, maintaining reliable closed-loop control and a high success rate.
comment: 20 pages, 11 figures
Design, Modeling and Direction Control of a Wire-Driven Robotic Fish Based on a 2-DoF Crank-Slider Mechanism ICRA 2026
Robotic fish have attracted growing attention in recent years owing to their biomimetic design and potential applications in environmental monitoring and biological surveys. Among robotic fish employing the Body-Caudal Fin (BCF) locomotion pattern, motor-driven actuation is widely adopted. Some approaches utilize multiple servo motors to achieve precise body curvature control, while others employ a brushless motor to drive the tail via wire or rod, enabling higher oscillation and swimming speeds. However, the former approaches typically result in limited swimming speed, whereas the latter suffer from poor maneuverability, with few capable of smooth turning. To address this trade-off, we develop a wire-driven robotic fish equipped with a 2-degree-of-freedom (DoF) crank-slider mechanism that decouples propulsion from steering, enabling both high swimming speed and agile maneuvering. In this paper, we first present the design of the robotic fish, including the elastic skeleton, waterproof structure, and the actuation mechanism that realizes the decoupling. We then establish the actuation modeling and body dynamics to analyze the locomotion behavior. Furthermore, we propose a combined feedforward-feedback control strategy to achieve independent regulation of propulsion and steering. Finally, we validate the feasibility of the design, modeling, and control through a series of prototype experiments, demonstrating swimming, turning, and directional control.
comment: Accepted by ICRA 2026
SPARC: Spatial-Aware Path Planning via Attentive Robot Communication
Efficient communication is critical for decentralized Multi-Robot Path Planning (MRPP), yet existing learned communication methods treat all neighboring robots equally regardless of their spatial proximity, leading to diluted attention in congested regions where coordination matters most. We propose Relation enhanced Multi Head Attention (RMHA), a communication mechanism that explicitly embeds pairwise Manhattan distances into the attention weight computation, enabling each robot to dynamically prioritize messages from spatially relevant neighbors. Combined with a distance-constrained attention mask and GRU gated message fusion, RMHA integrates seamlessly with MAPPO for stable end-to-end training. In zero-shot generalization from 8 training robots to 128 test robots on 40x40 grids, RMHA achieves approximately 75 percent success rate at 30 percent obstacle density outperforming the best baseline by over 25 percentage points. Ablation studies confirm that distance-relation encoding is the key contributor to success rate improvement in high-density environments. Index Terms-Multi-robot path planning, graph attention mechanism, multi-head attention, communication optimization, cooperative decision-making
Generative adversarial imitation learning for robot swarms: Learning from human demonstrations and trained policies ICRA 2026
In imitation learning, robots are supposed to learn from demonstrations of the desired behavior. Most of the work in imitation learning for swarm robotics provides the demonstrations as rollouts of an existing policy. In this work, we provide a framework based on generative adversarial imitation learning that aims to learn collective behaviors from human demonstrations. Our framework is evaluated across six different missions, learning both from manual demonstrations and demonstrations derived from a PPO-trained policy. Results show that the imitation learning process is able to learn qualitatively meaningful behaviors that perform similarly well as the provided demonstrations. Additionally, we deploy the learned policies on a swarm of TurtleBot 4 robots in real-robot experiments. The exhibited behaviors preserved their visually recognizable character and their performance is comparable to the one achieved in simulation.
comment: Accepted for publication at the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)
Agentic Self-Evolutionary Replanning for Embodied Navigation
Failure is inevitable for embodied navigation in complex environments. To enhance the resilience, replanning (RP) is a viable option, where the robot is allowed to fail, but is capable of adjusting plan until success. However, existing RP approaches freeze the ego action model and miss the opportunities to explore better plans by upgrading the robot itself. To address this limitation, we propose Self-Evolutionary RePlanning, or SERP for short, which leads to a paradigm shift from frozen models towards evolving models by run-time learning from recent experiences. In contrast to existing model evolution approaches that often get stuck at predefined static parameters, we introduce agentic self-evolving action model that uses in-context learning with auto-differentiation (ILAD) for adaptive function adjustment and global parameter reset. To achieve token-efficient replanning for SERP, we also propose graph chain-of-thought (GCOT) replanning with large language model (LLM) inference over distilled graphs. Extensive simulation and real-world experiments demonstrate that SERP achieves higher success rate with lower token expenditure over various benchmarks, validating its superior robustness and efficiency across diverse environments.
comment: 8 pages, 10 figures, 4 tables, submitted to IEEE for possible publication
Robust Tightly-Coupled Filter-Based Monocular Visual-Inertial State Estimation and Graph-Based Evaluation for Autonomous Drone Racing
Autonomous drone racing (ADR) demands state estimation that is simultaneously computationally efficient and resilient to the perceptual degradation experienced during extreme velocity and maneuvers. Traditional frameworks typically rely on conventional visual-inertial pipelines with loosely-coupled gate-based Perspective-n-Points (PnP) corrections that suffer from a rigid requirement for four visible features and information loss in intermediate steps. Furthermore, the absence of GNSS and Motion Capture systems in uninstrumented, competitive racing environments makes the objective evaluation of such systems remarkably difficult. To address these limitations, we propose ADR-VINS, a robust, monocular visual-inertial state estimation framework based on an Error-State Kalman Filter (ESKF) tailored for autonomous drone racing. Our approach integrates direct pixel reprojection errors from gate corners features as innovation terms within the filter. By bypassing intermediate PnP solvers, ADR-VINS maintains valid state updates with as few as two visible corners and utilizes robust reweighting instead of RANSAC-based schemes to handle outliers, enhancing computational efficiency. Furthermore, we introduce ADR-FGO, an offline Factor-Graph Optimization framework to generate high-fidelity reference trajectories that facilitate post-flight performance evaluation and analysis on uninstrumented, GNSS-denied environments. The proposed system is validated using TII-RATM dataset, where ADR-VINS achieves an average RMS translation error of 0.134 m, while ADR-FGO yields 0.060 m as a smoothing-based reference. Finally, ADR-VINS was successfully deployed in the A2RL Drone Championship Season 2, maintaining stable and robust estimation despite noisy detections during high-agility flight at top speeds of 20.9 m/s. We further utilize ADR-FGO for post-flight evaluation in uninstrumented racing environments.
comment: 8 pages, 9 figures
Retrieval-Augmented Robots via Retrieve-Reason-Act
To achieve general-purpose utility, we argue that robots must evolve from passive executors into active Information Retrieval users. In strictly zero-shot settings where no prior demonstrations exist, robots face a critical information gap, such as the exact sequence required to assemble a complex furniture kit, that cannot be satisfied by internal parametric knowledge (common sense) or past internal memory. While recent robotic works attempt to use search before action, they primarily focus on retrieving past kinematic trajectories (analogous to searching internal memory) or text-based safety rules (searching for constraints). These approaches fail to address the core information need of active task construction: acquiring unseen procedural knowledge from external, unstructured documentation. In this paper, we define the paradigm as Retrieval-Augmented Robotics (RAR), empowering the robot with the information-seeking capability that bridges the gap between visual documentation and physical actuation. We formulate the task execution as an iterative Retrieve-Reason-Act loop: the robot or embodied agent actively retrieves relevant visual procedural manuals from an unstructured corpus, grounds the abstract 2D diagrams to 3D physical parts via cross-modal alignment, and synthesizes executable plans. We validate this paradigm on a challenging long-horizon assembly benchmark. Our experiments demonstrate that grounding robotic planning in retrieved visual documents significantly outperforms baselines relying on zero-shot reasoning or few-shot example retrieval. This work establishes the basis of RAR, extending the scope of Information Retrieval from answering user queries to driving embodied physical actions.
MMH-Planner: Multi-Mode Hybrid Trajectory Planning Method for UAV Efficient Flight Based on Real-Time Spatial Awareness
Motion planning is a critical component of intelligent unmanned systems, enabling their complex autonomous operations. However, current planning algorithms still face limitations in planning efficiency due to inflexible strategies and weak adaptability. To address this, this paper proposes a multi-mode hybrid trajectory planning method for UAVs based on real-time environmental awareness, which dynamically selects the optimal planning model for high-quality trajectory generation in response to environmental changes. First, we introduce a goal-oriented spatial awareness method that rapidly assesses flight safety in the upcoming environments. Second, a multi-mode hybrid trajectory planning mechanism is proposed, which can enhance the planning efficiency by selecting the optimal planning model for trajectory generation based on prior spatial awareness. Finally, we design a lazy replanning strategy that triggers replanning only when necessary to reduce computational resource consumption while maintaining flight quality. To validate the performance of the proposed method, we conducted comprehensive comparative experiments in simulation environments. Results demonstrate that our approach outperforms existing state-of-the-art (SOTA) algorithms across multiple metrics, achieving the best performance particularly in terms of the average number of planning iterations and computational cost per iteration. Furthermore, the effectiveness of our approach is further verified through real-world flight experiments integrated with a self-developed intelligent UAV platform.
IMR-LLM: Industrial Multi-Robot Task Planning and Program Generation using Large Language Models
In modern industrial production, multiple robots often collaborate to complete complex manufacturing tasks. Large language models (LLMs), with their strong reasoning capabilities, have shown potential in coordinating robots for simple household and manipulation tasks. However, in industrial scenarios, stricter sequential constraints and more complex dependencies within tasks present new challenges for LLMs. To address this, we propose IMR-LLM, a novel LLM-driven Industrial Multi-Robot task planning and program generation framework. Specifically, we utilize LLMs to assist in constructing disjunctive graphs and employ deterministic solving methods to obtain a feasible and efficient high-level task plan. Based on this, we use a process tree to guide LLMs to generate executable low-level programs. Additionally, we create IMR-Bench, a challenging benchmark that encompasses multi-robot industrial tasks across three levels of complexity. Experimental results indicate that our method significantly surpasses existing methods across all evaluation metrics.
Watch Your Step: Learning Semantically-Guided Locomotion in Cluttered Environment IROS 2026
Although legged robots demonstrate impressive mobility on rough terrain, using them safely in cluttered environments remains a challenge. A key issue is their inability to avoid stepping on low-lying objects, such as high-cost small devices or cables on flat ground. This limitation arises from a disconnection between high-level semantic understanding and low-level control, combined with errors in elevation maps during real-world operation. To address this, we introduce SemLoco, a Reinforcement Learning (RL) framework designed to avoid obstacles precisely in densely cluttered environments. SemLoco uses a two-stage RL approach that combines both soft and hard constraints and performs pixel-wise foothold safety inference, enabling more accurate foot placement. Additionally, SemLoco integrates a semantic map to assign traversability costs rather than relying solely on geometric data. SemLoco significantly reduces collisions and improves safety around sensitive objects, enabling reliable navigation in situations where traditional controllers would likely cause damage. Experimental results further demonstrate that SemLoco can be effectively applied to more complex, unstructured real-world environments.
comment: Submitted to IROS 2026
Improving Diffusion Planners by Self-Supervised Action Gating with Energies
Diffusion planners are a strong approach for offline reinforcement learning, but they can fail when value-guided selection favours trajectories that score well yet are locally inconsistent with the environment dynamics, resulting in brittle execution. We propose Self-supervised Action Gating with Energies (SAGE), an inference-time re-ranking method that penalises dynamically inconsistent plans using a latent consistency signal. SAGE trains a Joint-Embedding Predictive Architecture (JEPA) encoder on offline state sequences and an action-conditioned latent predictor for short horizon transitions. At test time, SAGE assigns each sampled candidate an energy given by its latent prediction error and combines this feasibility score with value estimates to select actions. SAGE can integrate into existing diffusion planning pipelines that can sample trajectories and select actions via value scoring; it requires no environment rollouts and no policy re-training. Across locomotion, navigation, and manipulation benchmarks, SAGE improves the performance and robustness of diffusion planners.
Compositional Visual Planning via Inference-Time Diffusion Scaling
Diffusion models excel at short-horizon robot planning, yet scaling them to long-horizon tasks remains challenging due to computational constraints and limited training data. Existing compositional approaches stitch together short segments by separately denoising each component and averaging overlapping regions. However, this suffers from instability as the factorization assumption breaks down in noisy data space, leading to inconsistent global plans. We propose that the key to stable compositional generation lies in enforcing boundary agreement on the estimated clean data (Tweedie estimates) rather than on noisy intermediate states. Our method formulates long-horizon planning as inference over a chain-structured factor graph of overlapping video chunks, where pretrained short-horizon video diffusion models provide local priors. At inference time, we enforce boundary agreement through a novel combination of synchronous and asynchronous message passing that operates on Tweedie estimates, producing globally consistent guidance without requiring additional training. Our training-free framework demonstrates significant improvements over existing baselines, effectively generalizing to unseen start-goal combinations that were not present in the original training data. Project website: https://comp-visual-planning.github.io/
cuNRTO: GPU-Accelerated Nonlinear Robust Trajectory Optimization
Robust trajectory optimization enables autonomous systems to operate safely under uncertainty by computing control policies that satisfy the constraints for all bounded disturbances. However, these problems often lead to large Second Order Conic Programming (SOCP) constraints, which are computationally expensive. In this work, we propose the CUDA Nonlinear Robust Trajectory Optimization (cuNRTO) framework by introducing two dynamic optimization architectures that have direct application to robust decision-making and are implemented on CUDA. The first architecture, NRTO-DR, leverages the Douglas-Rachford (DR) splitting method to solve the SOCP inner subproblems of NRTO, thereby significantly reducing the computational burden through parallel SOCP projections and sparse direct solves. The second architecture, NRTO-FullADMM, is a novel variant that further exploits the problem structure to improve scalability using the Alternating Direction Method of Multipliers (ADMM). Finally, we provide GPU implementation of the proposed methodologies using custom CUDA kernels for SOC projection steps and cuBLAS GEMM chains for feedback gain updates. We validate the performance of cuNRTO through simulated experiments on unicycle, quadcopter, and Franka manipulator models, demonstrating speedup up to 139.6$\times$.
Uni-Skill: Building Self-Evolving Skill Repository for Generalizable Robotic Manipulation ICRA2026
While skill-centric approaches leverage foundation models to enhance generalization in compositional tasks, they often rely on fixed skill libraries, limiting adaptability to new tasks without manual intervention. To address this, we propose Uni-Skill, a Unified Skill-centric framework that supports skill-aware planning and facilitates automatic skill evolution. Unlike prior methods that restrict planning to predefined skills, Uni-Skill requests for new skill implementations when existing ones are insufficient, ensuring adaptable planning with self-augmented skill library. To support automatic implementation of diverse skills requested by the planning module, we construct SkillFolder, a VerbNet-inspired repository derived from large-scale unstructured robotic videos. SkillFolder introduces a hierarchical skill taxonomy that captures diverse skill descriptions at multiple levels of abstraction. By populating this taxonomy with large-scale, automatically annotated demonstrations, Uni-Skill shifts the paradigm of skill acquisition from inefficient manual annotation to efficient offline structural retrieval. Retrieved examples provide semantic supervision over behavior patterns and fine-grained references for spatial trajectories, enabling few-shot skill inference without deployment-time demonstrations. Comprehensive experiments in both simulation and real-world settings verify the state-of-the-art performance of Uni-Skill over existing VLM-based skill-centric approaches, highlighting its advanced reasoning capabilities and strong zero-shot generalization across a wide range of novel tasks.
comment: Accepted to ICRA2026
Real-Time Generative Policy via Langevin-Guided Flow Matching for Autonomous Driving
Reinforcement learning (RL) is a fundamental methodology in autonomous driving systems, where generative policies exhibit considerable potential by leveraging their ability to model complex distributions to enhance exploration. However, their inherent high inference latency severely impedes their deployment in real-time decision-making and control. To address this issue, we propose diffusion actor-critic with entropy regulator via flow matching (DACER-F) by introducing flow matching into online RL, enabling the generation of competitive actions in a single inference step. By leveraging Langevin dynamics and gradients of the Q-function, DACER-F dynamically optimizes actions from experience replay toward a target distribution that balances high Q-value information with exploratory behavior. The flow policy is then trained to efficiently learn a mapping from a simple prior distribution to this dynamic target. In complex multi-lane and intersection simulations, DACER-F outperforms baselines diffusion actor-critic with entropy regulator (DACER) and distributional soft actor-critic (DSAC), while maintaining an ultra-low inference latency. DACER-F further demonstrates its scalability on standard RL benchmark DeepMind Control Suite (DMC), achieving a score of 775.8 in the humanoid-stand task and surpassing prior methods. Collectively, these results establish DACER-F as a high-performance and computationally efficient RL algorithm.
VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction
This paper introduces VLMFusionOcc3D, a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving. Current voxel-based occupancy models often struggle with semantic ambiguity in sparse geometric grids and performance degradation under adverse weather conditions. To address these challenges, we leverage the rich linguistic priors of Vision-Language Models (VLMs) to anchor ambiguous voxel features to stable semantic concepts. Our framework initiates with a dual-branch feature extraction pipeline that projects multi-view images and LiDAR point clouds into a unified voxel space. We propose Instance-driven VLM Attention (InstVLM), which utilizes gated cross-attention and LoRA-adapted CLIP embeddings to inject high-level semantic and geographic priors directly into the 3D voxels. Furthermore, we introduce Weather-Aware Adaptive Fusion (WeathFusion), a dynamic gating mechanism that utilizes vehicle metadata and weather-conditioned prompts to re-weight sensor contributions based on real-time environmental reliability. To ensure structural consistency, a Depth-Aware Geometric Alignment (DAGA) loss is employed to align dense camera-derived geometry with sparse, spatially accurate LiDAR returns. Extensive experiments on the nuScenes and SemanticKITTI datasets demonstrate that our plug-and-play modules consistently enhance the performance of state-of-the-art voxel-based baselines. Notably, our approach achieves significant improvements in challenging weather scenarios, offering a scalable and robust solution for complex urban navigation.
Wukong-Omni: Design, Modeling and Control of a Multi-mode Robot for Air, Land, and Underwater Exploration with All-in-One Propulsion Unit
In flood disaster rescue scenarios, partially submerged buildings prevent aerial robots from accessing lower levels, limiting mission effectiveness. To address this challenge, this paper presents Wukong-Omni, a novel multimode robot capable of operating across land, air, and underwater using a unified propulsion system. The system is enabled by an innovative mechanical design that allows motor reuse and improves thrust generation. Efficiency and peak thrust are enhanced through simulation and tank-based optimization. Experimental results show a 100 percent improvement in propulsion efficiency and a 150 percent increase in maximum thrust compared with direct installation methods. Dynamic models for the three operating domains are developed, and a unified cross-domain control framework is proposed. Comprehensive experiments validate stable locomotion and smooth transition across domains. Outdoor experiments further demonstrate robustness and adaptability in real-world environments.
comment: 19 pages, 27 figures
Tensegrity Robot Endcap-Ground Contact Estimation with Symmetry-aware Heterogeneous Graph Neural Network
Tensegrity robots possess lightweight and resilient structures but present significant challenges for state estimation due to compliant and distributed ground contacts. This paper introduces a symmetry-aware heterogeneous graph neural network (Sym-HGNN) that infers contact states directly from proprioceptive measurements, including IMU and cable-length histories, without dedicated contact sensors. The network incorporates the robot's dihedral symmetry $D_3$ into the message-passing process to enhance sample efficiency and generalization. The predicted contacts are integrated into a state-of-the-art contact-aided invariant extended Kalman filter (InEKF) for improved pose estimation. Simulation results demonstrate that the proposed method achieves up to 15% higher accuracy and 5% higher F1-score using only 20% of the training data compared to the CNN and MI-HGNN baselines, while maintaining low-drift and physically consistent state estimation results comparable to ground truth contacts. This work highlights the potential of fully proprioceptive sensing for accurate and robust state estimation in tensegrity robots. Code available at: https://github.com/Jonathan-Twz/Tensegrity-Sym-HGNN
comment: Preprint; 7 pages, 5 figures, 3 tables
Give me scissors: Collision-Free Dual-Arm Surgical Assistive Robot for Instrument Delivery ICRA
During surgery, scrub nurses are required to frequently deliver surgical instruments to surgeons, which can lead to physical fatigue and decreased focus. Robotic scrub nurses provide a promising solution that can replace repetitive tasks and enhance efficiency. Existing research on robotic scrub nurses relies on predefined paths for instrument delivery, which limits their generalizability and poses safety risks in dynamic environments. To address these challenges, we present a collision-free dual-arm surgical assistive robot capable of performing instrument delivery. A vision-language model is utilized to automatically generate the robot's grasping and delivery trajectories in a zero-shot manner based on surgeons' instructions. A real-time obstacle minimum distance perception method is proposed and integrated into a unified quadratic programming framework. This framework ensures reactive obstacle avoidance and self-collision prevention during the dual-arm robot's autonomous movement in dynamic environments. Extensive experimental validations demonstrate that the proposed robotic system achieves an 83.33% success rate in surgical instrument delivery while maintaining smooth, collision-free movement throughout all trials. The project page and source code are available at https://give-me-scissors.github.io/.
comment: 8 pages, 10 figures. Accepted by IEEE International Conference on Robotics and Automation (ICRA), 2026
PathSpace: Rapid continuous map approximation for efficient SLAM using B-Splines in constrained environments
Simultaneous Localization and Mapping (SLAM) plays a crucial role in enabling autonomous vehicles to navigate previously unknown environments. Semantic SLAM mostly extends visual SLAM, leveraging the higher density information available to reason about the environment in a more human-like manner. This allows for better decision making by exploiting prior structural knowledge of the environment, usually in the form of labels. Current semantic SLAM techniques still mostly rely on a dense geometric representation of the environment, limiting their ability to apply constraints based on context. We propose PathSpace, a novel semantic SLAM framework that uses continuous B-splines to represent the environment in a compact manner, while also maintaining and reasoning through the continuous probability density functions required for probabilistic reasoning. This system applies the multiple strengths of B-splines in the context of SLAM to interpolate and fit otherwise discrete sparse environments. We test this framework in the context of autonomous racing, where we exploit pre-specified track characteristics to produce significantly reduced representations at comparable levels of accuracy to traditional landmark based methods and demonstrate its potential in limiting the resources used by a system with minimal accuracy loss.
LLM-MLFFN: Multi-Level Autonomous Driving Behavior Feature Fusion via Large Language Model
Accurate classification of autonomous vehicle (AV) driving behaviors is critical for safety validation, performance diagnosis, and traffic integration analysis. However, existing approaches primarily rely on numerical time-series modeling and often lack semantic abstraction, limiting interpretability and robustness in complex traffic environments. This paper presents LLM-MLFFN, a novel large language model (LLM)-enhanced multi-level feature fusion network designed to address the complexities of multi-dimensional driving data. The proposed LLM-MLFFN framework integrates priors from largescale pre-trained models and employs a multi-level approach to enhance classification accuracy. LLM-MLFFN comprises three core components: (1) a multi-level feature extraction module that extracts statistical, behavioral, and dynamic features to capture the quantitative aspects of driving behaviors; (2) a semantic description module that leverages LLMs to transform raw data into high-level semantic features; and (3) a dual-channel multi-level feature fusion network that combines numerical and semantic features using weighted attention mechanisms to improve robustness and prediction accuracy. Evaluation on the Waymo open trajectory dataset demonstrates the superior performance of the proposed LLM-MLFFN, achieving a classification accuracy of over 94%, surpassing existing machine learning models. Ablation studies further validate the critical contributions of multi-level fusion, feature extraction strategies, and LLM-derived semantic reasoning. These results suggest that integrating structured feature modeling with language-driven semantic abstraction provides a principled and interpretable pathway for robust autonomous driving behavior classification.
Learning Object-Centric Spatial Reasoning for Sequential Manipulation in Cluttered Environments
Robotic manipulation in cluttered environments presents a critical challenge for automation. Recent large-scale, end-to-end models demonstrate impressive capabilities but often lack the data efficiency and modularity required for retrieving objects in dense clutter. In this work, we argue for a paradigm of specialized, decoupled systems and present Unveiler, a framework that explicitly separates high-level spatial reasoning from low-level action execution. Unveiler's core is a lightweight, transformer-based Spatial Relationship Encoder (SRE) that sequentially identifies the most critical obstacle for removal. This discrete decision is then passed to a rotation-invariant Action Decoder for execution. We demonstrate that this decoupled architecture is not only more computationally efficient in terms of parameter count and inference time, but also significantly outperforms both classic end-to-end policies and modern, large-model-based baselines in retrieving targets from dense clutter. The SRE is trained in two stages: imitation learning from heuristic demonstrations provides sample-efficient initialization, after which PPO fine-tuning enables the policy to discover removal strategies that surpass the heuristic in dense clutter. Our results, achieving up to 97.6\% success in partially occluded and 90.0\% in fully occluded scenarios in simulation, make a case for the power of specialized, object-centric reasoning in complex manipulation tasks. Additionally, we demonstrate that the SRE's spatial reasoning transfers zero-shot to real scenes, and validate the full system on a physical robot requiring only geometric workspace calibration; no learned components are retrained.
Instant and Reversible Adhesive-free Bonding Between Silicones and Glossy Papers for Soft Robotics
Integrating silicone with non-extensible materials is a common strategy used in the fabrication of fluidically-driven soft actuators, yet conventional approaches often rely on irreversible adhesives or embedding processes that are labor-intensive and difficult to modify. This work presents silicone-glossy paper bonding (SGB), a rapid, adhesive-free, and solvent-reversible bonding approach that forms robust silicone-paper interfaces simply through contact. The SGB interface withstands high mechanical loads (shear strength > 61 kPa) and can be fully detached and reassembled via ethanol immersion without loss of performance, enabling component reuse and rapid redesign. Characterization studies indicate that surface functional groups primarily govern adhesion on the glossy paper and the modulus of the silicone, while durability and environmental response clarify the conditions for reversible debonding. The results further suggest a synergistic interaction of hydrogen bonding and oligomer diffusion, yielding strong yet reconfigurable adhesion. Soft actuators fabricated using SGB design exhibit equal or greater performance compared to conventional embedded-layer design and enable programmable actuation modes, including contraction, bending, and twisting. By simplifying fabrication while supporting reuse and rapid iteration, SGB offers a scalable and sustainable platform for rapid prototyping in soft robotics.
What Capable Agents Must Know: Selection Theorems for Robust Decision-Making under Uncertainty
As artificial agents become increasingly capable, what internal structure is *necessary* for an agent to act competently under uncertainty? Classical results show that optimal control can be *implemented* using belief states or world models, but not that such representations are required. We prove quantitative "selection theorems" showing that low *average-case regret* on structured families of action-conditioned prediction tasks forces an agent to implement a predictive, structured internal state. Our results cover stochastic policies, partial observability, and evaluation under task distributions, without assuming optimality, determinism, or access to an explicit model. Technically, we reduce predictive modeling to binary "betting" decisions and show that regret bounds limit probability mass on suboptimal bets, enforcing the predictive distinctions needed to separate high-margin outcomes. In fully observed settings, this yields approximate recovery of the interventional transition kernel; under partial observability, it implies necessity of belief-like memory and predictive state, addressing an open question in prior world-model recovery work.
comment: 18 pages
A Robust Simulation Framework for Verification and Validation of Autonomous Maritime Navigation in Adverse Weather and Constrained Environments
Maritime Autonomous Surface Ships (MASS) have emerged as a promising solution to enhance navigational safety, operational efficiency, and long-term cost effectiveness. However, their reliable deployment requires rigorous verification and validation (V\&V) under various environmental conditions, including extreme and safety-critical scenarios. This paper presents an enhanced virtual simulation framework to support the V\&V of MASS in realistic maritime environments, with particular emphasis on the influence of weather and bathymetry on autonomous navigation performance. The framework incorporates a high-fidelity environmental modeling suite capable of simulating adverse weather conditions such as rain, fog, and wave dynamics. The key factors that affect weather, such as rain and visibility, are parameterized to affect sea-state characteristics, perception, and sensing systems, resulting in position and velocity uncertainty, reduced visibility, and degraded situational awareness. Furthermore, high-resolution bathymetric data from major U.S. ports are integrated to enable depth-aware navigation, grounding prevention capabilities, and evaluation of vessel controllability in shallow or confined waterways. The proposed framework offers extensive configurability, enabling systematic testing in a wide spectrum of maritime conditions, including scenarios that are impractical or unsafe to replicate in real-world trials, thus supporting the V\&V of MASS.
COLREGs Compliant Collision Avoidance and Grounding Prevention for Autonomous Marine Navigation
Maritime Autonomous Surface Ships (MASS) are increasingly regarded as a promising solution to address crew shortages, improve navigational safety, and improve operational efficiency in the maritime industry. Nevertheless, the reliable deployment of MASS in real-world environments remains a significant challenge, particularly in congested waters where the majority of maritime accidents occur. This emphasizes the need for safe and regulation-aware motion planning strategies for MASS that are capable of operating under dynamic maritime conditions. This paper presents a unified motion planning method for MASS that achieves real time collision avoidance, compliance with International Regulations for Preventing Collisions at Sea (COLREGs), and grounding prevention. The proposed work introduces a convex optimization method that integrates velocity obstacle-based (VO) collision constraints, COLREGs-based directional constraints, and bathymetry-based grounding constraints to generate computationally efficient, rule-compliant optimal velocity selection. To enhance robustness, the classical VO method is extended to consider uncertainty in the position and velocity estimates of the target vessel. Unnavigable shallow water regions obtained from bathymetric data, which are inherently nonconvex, are approximated via convex geometries using a integer linear programming (ILP), allowing grounding constraints to be incorporated into the motion planning. The resulting optimization generates optimal and dynamically feasible input velocities that meet collision avoidance, regulatory compliance, kinodynamic limits, and grounding prevention requirements. Simulation results involving multi-vessel encounters demonstrate the effectiveness of the proposed method in producing safe and regulation-compliant maneuvers, highlighting the suitability of the proposed approach for real time autonomous maritime navigation.
Scalar-Measurement Attitude Estimation on $\mathbf{SO}(3)$ with Bias Compensation ICRA 2026
Attitude estimation methods typically rely on full vector measurements from inertial sensors such as accelerometers and magnetometers. This paper shows that reliable estimation can also be achieved using only scalar measurements, which naturally arise either as components of vector readings or as independent constraints from other sensing modalities. We propose nonlinear deterministic observers on $\mathbf{SO}(3)$ that incorporate gyroscope bias compensation and guarantee uniform local exponential stability under suitable observability conditions. A key feature of the framework is its robustness to partial sensing: accurate estimation is maintained even when only a subset of vector components is available. Experimental validation on the BROAD dataset confirms consistent performance across progressively reduced measurement configurations, with estimation errors remaining small even under severe information loss. To the best of our knowledge, this is the first work to establish fundamental observability results showing that two scalar measurements under suitable excitation suffice for attitude estimation, and that three are enough in the static case. These results position scalar-measurement-based observers as a practical and reliable alternative to conventional vector-based approaches.
comment: 9 pages, 4 figures. Submitted to ICRA 2026
From Local Matches to Global Masks: Novel Instance Detection in Open-World Scenes
Detecting and segmenting novel object instances in open-world environments is a fundamental problem in robotic perception. Given only a small set of template images, a robot must locate and segment a specific object instance in a cluttered, previously unseen scene. Existing proposal-based approaches are highly sensitive to proposal quality and often fail under occlusion and background clutter. We propose L2G-Det, a local-to-global instance detection framework that bypasses explicit object proposals by leveraging dense patch-level matching between templates and the query image. Locally matched patches generate candidate points, which are refined through a candidate selection module to suppress false positives. The filtered points are then used to prompt an augmented Segment Anything Model (SAM) with instance-specific object tokens, enabling reliable reconstruction of complete instance masks. Experiments demonstrate improved performance over proposal-based methods in challenging open-world settings.
Real-time tightly coupled GNSS and IMU integration via Factor Graph Optimization
Reliable positioning in dense urban environments remains challenging due to frequent GNSS signal blockage, multipath, and rapidly varying satellite geometry. While factor graph optimization (FGO)-based GNSS-IMU fusion has demonstrated strong robustness and accuracy, most formulations remain offline. In this work, we present a real-time tightly coupled GNSS-IMU FGO method that enables causal state estimation via incremental optimization with fixed-lag marginalization, and we evaluate its performance in a highly urbanized GNSS-degraded environment using the UrbanNav dataset.
Real-time loosely coupled GNSS and IMU integration via Factor Graph Optimization
Accurate positioning, navigation, and timing (PNT) is fundamental to the operation of modern technologies and a key enabler of autonomous systems. A very important component of PNT is the Global Navigation Satellite System (GNSS) which ensures outdoor positioning. Modern research directions have pushed the performance of GNSS localization to new heights by fusing GNSS measurements with other sensory information, mainly measurements from Inertial Measurement Units (IMU). In this paper, we propose a loosely coupled architecture to integrate GNSS and IMU measurements using a Factor Graph Optimization (FGO) framework. Because the FGO method can be computationally challenging and often used as a post-processing method, our focus is on assessing its localization accuracy and service availability while operating in real-time in challenging environments (urban canyons). Experimental results on the UrbanNav-HK-MediumUrban-1 dataset show that the proposed approach achieves real-time operation and increased service availability compared to batch FGO methods. While this improvement comes at the cost of reduced positioning accuracy, the paper provides a detailed analysis of the trade-offs between accuracy, availability, and computational efficiency that characterize real-time FGO-based GNSS/IMU fusion.
Passive Phase-Oriented Impedance Shaping for Rapid Acceleration in Soft Robotic Swimmers IROS
Rapid acceleration and burst maneuvers in underwater robots depend less on maintaining precise resonance and more on force--velocity phase alignment during thrust generation. In this work, we investigate constrained-layer damping (CLD) as a passive mechanism for frequency-selective impedance shaping in soft robotic swimmers. Unlike conventional stiffness-tuning approaches, CLD selectively amplifies the dissipative component of bending impedance while preserving storage stiffness, passively shifting the impedance composition toward dissipative dominance as actuation frequency increases. We characterize this behavior through dry impedance measurements, demonstrate that CLD enhances thrust and alters force--motion phase relationships across Strouhal numbers in constrained propulsion tests, and validate that passive impedance shaping yields a nearly five-fold increase in peak acceleration and a three-fold increase in terminal velocity in unconstrained swimming trials. These results establish phase-oriented passive impedance modulation as a simple, control-free pathway for improving transient propulsion in soft robotic systems.
comment: Submitted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Sampling-Based Motion Planning with Scene Graphs Under Perception Constraints
It will be increasingly common for robots to operate in cluttered human-centered environments such as homes, workplaces, and hospitals, where the robot is often tasked to maintain perception constraints, such as monitoring people or multiple objects, for safety and reliability while executing its task. However, existing perception-aware approaches typically focus on low-degree-of-freedom (DoF) systems or only consider a single object in the context of high-DoF robots. This motivates us to consider the problem of perception-aware motion planning for high-DoF robots that accounts for multi-object monitoring constraints. We employ a scene graph representation of the environment, offering a great potential for incorporating long-horizon task and motion planning thanks to its rich semantic and spatial information. However, it does not capture perception-constrained information, such as the viewpoints the user prefers. To address these challenges, we propose MOPS-PRM, a roadmap-based motion planner, that integrates the perception cost of observing multiple objects or humans directly into motion planning for high-DoF robots. The perception cost is embedded to each object as part of a scene graph, and used to selectively sample configurations for roadmap construction, implicitly enforcing the perception constraints. Our method is extensively validated in both simulated and real-world experiments, achieving more than ~36% improvement in the average number of detected objects and ~17% better track rate against other perception-constrained baselines, with comparable planning times and path lengths.
comment: 8 pages, 5 figures, Accepted to R-AL
Overlapping Domain Decomposition for Distributed Pose Graph Optimization ICRA
We present ROBO (Riemannian Overlapping Block Optimization), a distributed and parallel approach to multi-robot pose graph optimization (PGO) based on the idea of overlapping domain decomposition. ROBO offers a middle ground between centralized and fully distributed solvers, where the amount of pose information shared between robots at each optimization iteration can be set according to the available communication resources. Sharing additional pose information between neighboring robots effectively creates overlapping optimization blocks in the underlying pose graph, which substantially reduces the number of iterations required to converge. Through extensive experiments on benchmark PGO datasets, we demonstrate the applicability and feasibility of ROBO in different initialization scenarios, using various cost functions, and under different communication regimes. We also analyze the tradeoff between the increased communication and local computation required by ROBO's overlapping blocks and the resulting faster convergence. We show that overlaps with an average inter-robot data cost of only 36 Kb per iteration can converge 3.1$\times$ faster in terms of iterations than state-of-the-art distributed PGO approaches. Furthermore, we develop an asynchronous variant of ROBO that is robust to network delays and suitable for real-world robotic applications.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026
Navigating in Uncertain Environments with Heterogeneous Visibility
Navigating an environment with uncertain connectivity requires a strategic balance between minimizing the cost of traversal and seeking information to resolve map ambiguities. Unlike previous approaches that rely on local sensing, we utilize a framework where nodes possess varying visibility levels, allowing for observation of distant edges from certain vantage points. We propose a novel heuristic algorithm that balances the cost of detouring to high-visibility locations against the gain in information by optimizing the sum of a custom observation reward and the cost of traversal. We introduce a technique to sample the shortest path on numerous realizations of the environment, which we use to define an edge's utility for observation and to quickly estimate the path with the highest reward. Our approach can be easily adapted to a variety of scenarios by tuning a single hyperparameter that determines the importance of observation. We test our method on a variety of uncertain navigation tasks, including a map based on real-world topographical data. The method demonstrates lower mean cost of traversal compared to a shortest path baseline that does not consider observation and has exponentially lower computational overhead compared to an existing method for balancing observation with path cost minimization.
Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion
Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/
Radar-based Pose Optimization for HD Map Generation from Noisy Multi-Drive Vehicle Fleet Data
High-definition (HD) maps are important for autonomous driving, but their manual generation and maintenance is very expensive. This motivates the usage of an automated map generation pipeline. Fleet vehicles provide sufficient sensors for map generation, but their measurements are less precise, introducing noise into the mapping pipeline. This work focuses on mitigating the localization noise component through aligning radar measurements in terms of raw radar point clouds of vehicle poses of different drives and performing pose graph optimization to produce a globally optimized solution between all drives present in the dataset. Improved poses are first used to generate a global radar occupancy map, aimed to facilitate precise on-vehicle localization. Through qualitative analysis we show contrast-rich feature clarity, focusing on omnipresent guardrail posts as the main feature type observable in the map. Second, the improved poses can be used as a basis for an existing lane boundary map generation pipeline, majorly improving map output compared to its original pure line detection based optimization approach.
comment: Accepted for the 37th IEEE Intelligent Vehicles Symposium (IV 2026), 7 pages
Impact of Localization Errors on Label Quality for Online HD Map Construction
High-definition (HD) maps are crucial for autonomous vehicles, but their creation and maintenance is very costly. This motivates the idea of online HD map construction. To provide a continuous large-scale stream of training data, existing HD maps can be used as labels for onboard sensor data from consumer vehicle fleets. However, compared to current, well curated HD map perception datasets, this fleet data suffers from localization errors, resulting in distorted map labels. We introduce three kinds of localization errors, Ramp, Gaussian, and Perlin noise, to examine their influence on generated map labels. We train a variant of MapTRv2, a state-of-the-art online HD map construction model, on the Argoverse 2 dataset with various levels of localization errors and assess the degradation of model performance. Since localization errors affect distant labels more severely, but are also less significant to driving performance, we introduce a distance-based map construction metric. Our experiments reveal that localization noise affects the model performance significantly. We demonstrate that errors in heading angle exert a more substantial influence than position errors, as angle errors result in a greater distortion of labels as distance to the vehicle increases. Furthermore, we can demonstrate that the model benefits from non-distorted ground truth (GT) data and that the performance decreases more than linearly with the increase in noisy data. Our study additionally provides a qualitative evaluation of the extent to which localization errors influence the construction of HD maps.
comment: Accepted for the 36th IEEE Intelligent Vehicles Symposium (IV 2025), 8 pages
Multi-Agent-Based Simulation of Archaeological Mobility in Uneven Landscapes
Understanding mobility, movement, and interaction in archaeological landscapes is essential for interpreting past human behavior, transport strategies, and spatial organization, yet such processes are difficult to reconstruct from static archaeological evidence alone. This paper presents a multi-agent-based modeling framework for simulating archaeological mobility in uneven landscapes, integrating realistic terrain reconstruction, heterogeneous agent modeling, and adaptive navigation strategies. The proposed approach combines global path planning with local dynamic adaptation, through reinforcment learning, enabling agents to respond efficiently to dynamic obstacles and interactions without costly global replanning. Real-world digital elevation data are processed into high-fidelity three-dimensional environments, preserving slope and terrain constraints that directly influence agent movement. The framework explicitly models diverse agent types, including human groups and animal-based transport systems, each parameterized by empirically grounded mobility characteristics such as load, slope tolerance, and physical dimensions. Two archaeological-inspired use cases demonstrate the applicability of the approach: a terrain-aware pursuit and evasion scenario and a comparative transport analysis involving pack animals and wheeled carts. The results highlight the impact of terrain morphology, visibility, and agent heterogeneity on movement outcomes, while the proposed hybrid navigation strategy provides a computationally efficient and interpretable solution for large-scale, dynamic archaeological simulations.
Safe Payload Transfer with Ship-Mounted Cranes: A Robust Model Predictive Control Approach
Ensuring safe real-time control of ship-mounted cranes in unstructured transportation environments requires handling multiple safety constraints while maintaining effective payload transfer performance. Unlike traditional crane systems, ship-mounted cranes are consistently subjected to significant external disturbances affecting underactuated crane dynamics due to the ship's dynamic motion response to harsh sea conditions, which can lead to robustness issues. To tackle these challenges, we propose a robust and safe model predictive control (MPC) framework and demonstrate it on a 5-DOF crane system, where a Stewart platform simulates the external disturbances that ocean surface motions would have on the supporting ship. The crane payload transfer operation must avoid obstacles and accurately place the payload within a designated target area. We use a robust zero-order control barrier function (R-ZOCBF)-based safety constraint in the nonlinear MPC to ensure safe payload positioning, while time-varying bounding boxes are utilized for collision avoidance. We introduce a new optimization-based online robustness parameter adaptation scheme to reduce the conservativeness of R-ZOCBFs. Experimental trials on a crane prototype demonstrate the overall performance of our safe control approach under significant perturbing motions of the crane base. While our focus is on crane-facilitated transfer, the methods more generally apply to safe robotically-assisted parts mating and parts insertion.
PROFusion: Robust and Accurate Dense Reconstruction via Camera Pose Regression and Optimization ICRA 2026
Real-time dense scene reconstruction during unstable camera motions is crucial for robotics, yet current RGB-D SLAM systems fail when cameras experience large viewpoint changes, fast motions, or sudden shaking. Classical optimization-based methods deliver high accuracy but fail with poor initialization during large motions, while learning-based approaches provide robustness but lack sufficient accuracy for dense reconstruction. We address this challenge through a combination of learning-based initialization with optimization-based refinement. Our method employs a camera pose regression network to predict metric-aware relative poses from consecutive RGB-D frames, which serve as reliable starting points for a randomized optimization algorithm that further aligns depth images with the scene geometry. Extensive experiments demonstrate promising results: our approach outperforms the best competitor on challenging benchmarks, while maintaining comparable accuracy on stable motion sequences. The system operates in real-time, showcasing that combining simple and principled techniques can achieve both robustness for unstable motions and accuracy for dense reconstruction. Code released: https://github.com/siyandong/PROFusion.
comment: ICRA 2026
SceneStreamer: Continuous Scenario Generation as Next Token Group Prediction
Realistic and interactive traffic simulation is essential for training and evaluating autonomous driving systems. However, most existing data-driven simulation methods rely on static initialization or log-replay data, limiting their ability to model dynamic, long-horizon scenarios with evolving agent populations. We propose SceneStreamer, a unified autoregressive framework for continuous scenario generation that represents the entire scene as a sequence of tokens, including traffic light signals, agent states, and motion vectors, and generates them step by step with a transformer model. This design enables SceneStreamer to continuously introduce and retire agents over an unbounded horizon, supporting realistic long-duration simulation. Experiments demonstrate that SceneStreamer produces realistic, diverse, and adaptive traffic behaviors. Furthermore, reinforcement learning policies trained in SceneStreamer-generated scenarios achieve superior robustness and generalization, validating its utility as a high-fidelity simulation environment for autonomous driving. More information is available at https://vail-ucla.github.io/scenestreamer/ .
Learning Acrobatic Flight from Preferences
Preference-based reinforcement learning (PbRL) enables agents to learn control policies without requiring manually designed reward functions, making it well-suited for tasks where objectives are difficult to formalize or inherently subjective. Acrobatic flight poses a particularly challenging problem due to its complex dynamics, rapid movements, and the importance of precise execution. However, manually designed reward functions for such tasks often fail to capture the qualities that matter: we find that hand-crafted rewards agree with human judgment only 60.7% of the time, underscoring the need for preference-driven approaches. In this work, we propose Reward Ensemble under Confidence (REC), a probabilistic reward learning framework for PbRL that explicitly models per-timestep reward uncertainty through an ensemble of distributional reward models. By propagating uncertainty into the preference loss and leveraging disagreement for exploration, REC achieves 88.4% of shaped reward performance on acrobatic quadrotor control, compared to 55.2% with standard Preference PPO. We train policies in simulation and successfully transfer them zero-shot to the real world, demonstrating complex acrobatic maneuvers learned purely from preference feedback. We further validate REC on a continuous control benchmark, confirming its applicability beyond the domain of aerial robotics.
comment: 8 pages, 6 figures
GrandTour: A Legged Robotics Dataset in the Wild for Multi-Modal Perception and State Estimation
Accurate state estimation and multi-modal perception are prerequisites for autonomous legged robots in complex, large-scale environments. To date, no large-scale public legged-robot dataset captures the real-world conditions needed to develop and benchmark algorithms for legged-robot state estimation, perception, and navigation. To address this, we introduce the GrandTour dataset, a multi-modal legged-robotics dataset collected across challenging outdoor and indoor environments, featuring an ANYbotics ANYmal-D quadruped equipped with the Boxi multi-modal sensor payload. GrandTour spans a broad range of environments and operational scenarios across distinct test sites, ranging from alpine scenery and forests to demolished buildings and urban areas, and covers a wide variation in scale, complexity, illumination, and weather conditions. The dataset provides time-synchronized sensor data from spinning LiDARs, multiple RGB cameras with complementary characteristics, proprioceptive sensors, and stereo depth cameras. Moreover, it includes high-precision ground-truth trajectories from satellite-based RTK-GNSS and a Leica Geosystems total station. This dataset supports research in SLAM, high-precision state estimation, and multi-modal learning, enabling rigorous evaluation and development of new approaches to sensor fusion in legged robotic systems. With its extensive scope, GrandTour represents the largest open-access legged-robotics dataset to date. The dataset is available at https://grand-tour.leggedrobotics.com on HuggingFace (ROS-independent), and in ROS formats, along with tools and demo resources.
comment: Turcan Tuna, and Jonas Frey contributed equally. Submitted to Sage The International Journal of Robotics Research
Floating-Base Deep Lagrangian Networks
Grey-box methods for system identification combine deep learning with physics-informed constraints, capturing complex dependencies while improving out-of-distribution generalization. Despite the growing importance of floating-base systems such as humanoids and quadrupeds, current grey-box models ignore their specific physical constraints. For instance, the inertia matrix is not only positive definite but also exhibits branch-induced sparsity and input independence. Moreover, the 6x6 composite spatial inertia of the floating base inherits properties of single-rigid-body inertia matrices. As we show, this includes the triangle inequality on the eigenvalues of the composite rotational inertia. To address the lack of physical consistency in deep learning models of floating-base systems, we introduce a parameterization of inertia matrices that satisfies all these constraints. Inspired by Deep Lagrangian Networks (DeLaN), we train neural networks to predict physically plausible inertia matrices that minimize inverse dynamics error under Lagrangian mechanics. For evaluation, we collected and released a dataset on multiple quadrupeds and humanoids. In these experiments, our Floating-Base Deep Lagrangian Networks (FeLaN) achieve better overall performance on both simulated and real robots, while providing greater physical interpretability.
Agility Meets Stability: Versatile Humanoid Control with Heterogeneous Data
Humanoid robots are envisioned to perform a wide range of tasks in human-centered environments, requiring controllers that combine agility with robust balance. Recent advances in locomotion and whole-body tracking have enabled impressive progress in either agile dynamic skills or stability-critical behaviors, but existing methods remain specialized, focusing on one capability while compromising the other. In this work, we introduce AMS (Agility Meets Stability), the first framework that unifies both dynamic motion tracking and extreme balance maintenance in a single policy. Our key insight is to leverage heterogeneous data sources: human motion capture datasets that provide rich, agile behaviors, and physically constrained synthetic balance motions that capture stability configurations. To reconcile the divergent optimization goals of agility and stability, we design a hybrid reward scheme that applies general tracking objectives across all data while injecting balance-specific priors only into synthetic motions. Further, an adaptive learning strategy with performance-driven sampling and motion-specific reward shaping enables efficient training across diverse motion distributions. We validate AMS extensively in simulation and on a real Unitree G1 humanoid. Experiments demonstrate that a single policy can execute agile skills such as dancing and running, while also performing zero-shot extreme balance motions like Ip Man's Squat, highlighting AMS as a versatile control paradigm for future humanoid applications.
ConEQsA: Concurrent and Asynchronous Embodied Questions Scheduling and Answering
This paper formulates the Embodied Questions Answering (EQsA) problem, introduces a corresponding benchmark, and proposes an agentic system to tackle the problem. Classical Embodied Question Answering (EQA) is typically formulated as answering one single question by actively exploring a 3D environment. Real deployments, however, often demand handling multiple questions that may arrive asynchronously and carry different urgencies. We formalize this setting as Embodied Questions Answering (EQsA) and present ConEQsA, an agentic framework for concurrent, urgency-aware scheduling and answering. ConEQsA leverages shared group memory to reduce redundant exploration, and a priority-planning method to dynamically schedule questions. To evaluate the EQsA setting fairly, we contribute the Concurrent Asynchronous Embodied Questions (CAEQs) benchmark containing 40 indoor scenes and five questions per scene (200 in total), featuring asynchronous follow-up questions and human-annotated urgency labels. We further propose metrics for EQsA performance: Direct Answer Rate (DAR), and Normalized Urgency-Weighted Latency (NUWL), which serve as a fair evaluation protocol for EQsA. Empirical evaluations demonstrate that ConEQsA consistently outperforms strong sequential baselines, and show that urgency-aware, concurrent scheduling is key to making embodied agents responsive and efficient under realistic, multi-question workloads. Code is available on https://anonymous.4open.science/r/ConEQsA.
comment: 8 pages, 6 figures
Integration of UWB Radar on Mobile Robots for Continuous Obstacle and Environment Mapping
This paper presents an infrastructure-free approach for obstacle detection and environmental mapping using ultra-wideband (UWB) radar mounted on a mobile robotic platform. Traditional sensing modalities such as visual cameras and Light Detection and Ranging (LiDAR) fail in environments with poor visibility due to darkness, smoke, or reflective surfaces. In these vision-impaired conditions, UWB radar offers a promising alternative. To this end, this work explores the suitability of robot-mounted UWB radar for environmental mapping in anchor-free, unknown scenarios. The study investigates how different materials (metal, concrete and plywood) and UWB radio channels (5 and 9) influence the Channel Impulse Response (CIR). Furthermore, a processing pipeline is proposed to achieve reliable mapping of detected obstacles, consisting of 3 steps: 1) target identification (based on CIR peak detection); 2) filtering (based on peak properties, signal-to-noise score, and phase-difference of arrival); and 3) clustering (based on distance estimation and angle-of-arrival estimation). The proposed approach successfully reduces noise and multipath effects, achieving high obstacle detection performance across a range of materials. Even in challenging low-reflectivity scenarios such as concrete, the method achieves a precision of 73.42% and a recall of 83.38% on channel 9. This work offers a foundation for further development of UWB-based localisation and mapping (SLAM) systems that do not rely on visual features and, unlike conventional UWB localisation systems, do not require fixed anchor nodes for triangulation.
CoRL-MPPI: Enhancing MPPI With Learnable Behaviours For Efficient And Provably-Safe Multi-Robot Collision Avoidance
Decentralized collision avoidance is a core challenge for scalable multi-robot systems. One of the promising approaches to tackle this problem is Model Predictive Path Integral (MPPI) -- a framework that naturally handles arbitrary motion models and provides strong theoretical guarantees. Still, in practice MPPI-based controller may provide suboptimal trajectories as its performance relies heavily on uninformed random sampling. In this work, we introduce CoRL-MPPI, a novel fusion of Cooperative Reinforcement Learning and MPPI to address this limitation. We train an action policy (approximated as deep neural network) in simulation that learns local cooperative collision avoidance behaviors. This learned policy is then embedded into the MPPI framework to guide its sampling distribution, biasing it towards more intelligent and cooperative actions. Notably, CoRL-MPPI preserves all the theoretical guarantees of regular MPPI. We evaluate our approach in dense, dynamic simulation environments against state-of-the-art baselines, such as ORCA, BVC, RL-RVO-NAV and classical MPPI. Our results demonstrate that CoRL-MPPI significantly improves navigation efficiency (measured by success rate and makespan) and safety, enabling agile and robust multi-robot navigation.
comment: The manuscript includes 9 pages, 5 figures, and 1 table. This replacement revises and extends the original submission. The updated version adds a validation in Gazebo. It also expands the experimental evaluation by adding baselines and an evaluation scenario. In addition, the cost functions in MPPI-based methods were refined, leading to improved experimental performance
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI ICLR 2026
Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments -- particularly gaming -- offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations and 1K+ hours of pseudo-labeled gameplay), our 1B-parameter model achieves 96.6% success on LIBERO manipulation and 83.3% on CANVAS navigation, matching or surpassing models up to 7x larger, such as π_{0} (3.3B) and OpenVLA (7B). These results demonstrate that sensorimotor primitives learned from digital interactions transfer effectively to real-world physical tasks, establishing desktop pretraining as a practical paradigm for embodied AI. All resources are publicly available at https://worv-ai.github.io/d2e.
comment: Accepted to ICLR 2026
On Adversarial Attacks In Acoustic Drone Localization
Multi-rotor aerial autonomous vehicles (MAVs, more widely known as "drones") have been generating increased interest in recent years due to their growing applicability in a vast and diverse range of fields (e.g., agriculture, commercial delivery, search and rescue). The sensitivity of visual-based methods to lighting conditions and occlusions had prompted growing study of navigation reliant on other modalities, such as acoustic sensing. A major concern in using drones in scale for tasks in non-controlled environments is the potential threat of adversarial attacks over their navigational systems, exposing users to mission-critical failures, security breaches, and compromised safety outcomes that can endanger operators and bystanders. While previous work shows impressive progress in acoustic-based drone localization, prior research in adversarial attacks over drone navigation only addresses visual sensing-based systems. In this work, we aim to compensate for this gap by supplying a comprehensive analysis of the effect of PGD adversarial attacks over acoustic drone localization. We furthermore develop an algorithm for adversarial perturbation recovery, capable of markedly diminishing the affect of such attacks in our setting.
Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning ICLR 2026
Scaling reinforcement learning to tens of thousands of parallel environments requires overcoming the limited exploration capacity of a single policy. Ensemble-based policy gradient methods, which employ multiple policies to collect diverse samples, have recently been proposed to promote exploration. However, merely broadening the exploration space does not always enhance learning capability, since excessive exploration can reduce exploration quality or compromise training stability. In this work, we theoretically analyze the impact of inter-policy diversity on learning efficiency in policy ensembles, and propose Coupled Policy Optimization which regulates diversity through KL constraints between policies. The proposed method enables effective exploration and outperforms strong baselines such as SAPG, PBT, and PPO across multiple tasks, including challenging dexterous manipulation, in terms of both sample efficiency and final performance. Furthermore, analysis of policy diversity and effective sample size during training reveals that follower policies naturally distribute around the leader, demonstrating the emergence of structured and efficient exploratory behavior. Our results indicate that diverse exploration under appropriate regulation is key to achieving stable and sample-efficient learning in ensemble policy gradient methods. Project page at https://naoki04.github.io/paper-cpo/ .
comment: In ICLR 2026. Website at https://naoki04.github.io/paper-cpo/
Design Framework and Manufacturing of an Active Magnetic Bearing Spindle for Micro-Milling Applications
Micro-milling spindles require high rotational speeds where conventional rolling element bearings face limitations such as friction and thermal expansion. Active magnetic bearings (AMBs) address these challenges by providing non-contact and lubrication-free operation at ultra-high speeds with the ability to actively regulate spindle dynamics. The existing literature on AMB spindles has mainly reported specific prototype realizations or control system implementations for specific spindle dynamics. Consequently, design knowledge remains fragmented across isolated successful studies. This paper addresses this gap by presenting a systematic and iterative framework to design and manufacture a micro-milling AMB spindle. The process involves a multidisciplinary design flow with a focus on critical practical aspects of manufacturing. The realized spindle is reported as a case study.
InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation
To operate effectively in the real world, robots should integrate multimodal reasoning with precise action generation. However, existing vision-language-action (VLA) models often sacrifice one for the other, narrow their abilities to task-specific manipulation data, and suffer catastrophic forgetting of pre-trained vision-language capabilities. To bridge this gap, we introduce InstructVLA, an end-to-end VLA model that preserves the flexible reasoning of large vision-language models (VLMs) while delivering leading manipulation performance with the help of embodied reasoning. InstructVLA introduces a novel training paradigm, Vision-Language-Action Instruction Tuning (VLA-IT), which employs multimodal training with mixture-of-experts adaptation to jointly optimize embodied reasoning and action generation on both standard VLM corpora and a curated 650K-sample VLA-IT dataset. On in-domain SimplerEnv tasks, InstructVLA achieves 33% improvement over SpatialVLA. To evaluate generalization, we introduce SimplerEnv-Instruct, an 80-task benchmark requiring closed-loop control and high-level instruction understanding, where it outperforms a fine-tuned OpenVLA by 96% and an action expert aided by GPT-4o by 29%. Additionally, InstructVLA surpasses baseline VLMs on multimodal tasks and exhibits inference-time scaling by leveraging textual reasoning to boost manipulation performance in both simulated and real-world settings. These results demonstrate InstructVLA's potential for bridging intuitive and steerable human-robot interaction with efficient policy learning.
comment: 48 pages
D-GVIO: A Buffer-Driven and Efficient Decentralized GNSS-Visual-Inertial State Estimator for Multi-Agent Systems ICRA 2026
Cooperative localization is essential for swarm applications like collaborative exploration and search-and-rescue missions. However, maintaining real-time capability, robustness, and computational efficiency on resource-constrained platforms presents significant challenges. To address these challenges, we propose D-GVIO, a buffer-driven and fully decentralized GNSS-Visual-Inertial Odometry (GVIO) framework that leverages a novel buffering strategy to support efficient and robust distributed state estimation. The proposed framework is characterized by four core mechanisms. Firstly, through covariance segmentation, covariance intersection and buffering strategy, we modularize propagation and update steps in distributed state estimation, significantly reducing computational and communication burdens. Secondly, the left-invariant extended Kalman filter (L-IEKF) is adopted for information fusion, which exhibits superior state estimation performance over the traditional extended Kalman filter (EKF) since its state transition matrix is independent of the system state. Thirdly, a buffer-based re-propagation strategy is employed to handle delayed measurements efficiently and accurately by leveraging the L-IEKF, eliminating the need for costly re-computation. Finally, an adaptive buffer-driven outlier detection method is proposed to dynamically cull GNSS outliers, enhancing robustness in GNSS-challenged environments.
comment: Accepted by ICRA 2026
Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects
A deep understanding of kinematic structures and movable components is essential for enabling robots to manipulate objects and model their own articulated forms. Such understanding is captured through articulated objects, which are essential for tasks such as physical simulation, motion planning, and policy learning. However, creating these models, particularly for objects with high degrees of freedom (DoF), remains a significant challenge. Existing methods typically rely on motion sequences or strong assumptions from hand-curated datasets, which hinders scalability. In this paper, we introduce Kinematify, an automated framework that synthesizes articulated objects directly from arbitrary RGB images or textual descriptions. Our method addresses two core challenges: (i) inferring kinematic topologies for high-DoF objects and (ii) estimating joint parameters from static geometry. To achieve this, we combine MCTS search for structural inference with geometry-driven optimization for joint reasoning, producing physically consistent and functionally valid descriptions. We evaluate Kinematify on diverse inputs from both synthetic and real-world environments, demonstrating improvements in registration and kinematic topology accuracy over prior work.
comment: Project Page: https://sites.google.com/deemos.com/kinematify
Learning Agile Gate Traversal via Analytical Optimal Policy Gradient
Traversing narrow gates presents a significant challenge and has become a standard benchmark for evaluating agile and precise quadrotor flight. Traditional modularized autonomous flight stacks require extensive design and parameter tuning, while end-to-end reinforcement learning (RL) methods often suffer from low sample efficiency, limited interpretability, and degraded disturbance rejection under unseen perturbations. In this work, we present a novel hybrid framework that adaptively fine-tunes model predictive control (MPC) parameters online using outputs from a neural network (NN) trained offline. The NN jointly predicts a reference pose and cost function weights, conditioned on the coordinates of the gate corners and the current drone state. To achieve efficient training, we derive analytical policy gradients not only for the MPC module but also for an optimization-based gate traversal detection module. Hardware experiments demonstrate that our method enables fast and accurate quadrotor traversal through narrow gates in confined environments and demonstrates effective disturbance rejection against collision-induced perturbations.
comment: 8 pages, 8 figures
Self-Improving Loops for Visual Robotic Planning ICLR 2026
Video generative models trained on expert demonstrations have been utilized as performant text-conditioned visual planners for solving robotic tasks. However, generalization to unseen tasks remains a challenge. Whereas improved generalization may be facilitated by leveraging learned prior knowledge from additional pre-collected offline data sources, such as web-scale video datasets, in the era of experience we aim to design agents that can continuously improve in an online manner from self-collected behaviors. In this work we thus propose the Self-Improving Loops for Visual Robotic Planning (SILVR), where an in-domain video model iteratively updates itself on self-produced trajectories, and steadily improves its performance for a specified task of interest. We apply SILVR to a diverse suite of MetaWorld tasks, as well as two manipulation tasks on a real robot arm, and find that performance improvements continuously emerge over multiple iterations for novel tasks unseen during initial in-domain video model training. We demonstrate that SILVR is robust in the absence of human-provided ground-truth reward functions or expert-quality demonstrations, and is preferable to alternate approaches that utilize online experience in terms of performance and sample efficiency.
comment: ICLR 2026. Project Page: https://diffusion-supervision.github.io/silvr/
osmAG-LLM: Zero-Shot Open-Vocabulary Object Navigation via Semantic Maps and Large Language Models Reasoning
Recent open-vocabulary robot mapping methods enrich dense geometric maps with pre-trained visual-language features, achieving a high level of detail and guiding robots to find objects specified by open-vocabulary language queries. While the issue of scalability for such approaches has received some attention, another fundamental problem is that high-detail object mapping quickly becomes outdated, as objects get moved around a lot. In this work, we develop a mapping and navigation system for object-goal navigation that, from the ground up, considers the possibilities that a queried object can have moved, or may not be mapped at all. Instead of striving for high-fidelity mapping detail, we consider that the main purpose of a map is to provide environment grounding and context, which we combine with the semantic priors of LLMs to reason about object locations and deploy an active, online approach to navigate to the objects. Through simulated and real-world experiments we find that our approach tends to have higher retrieval success at shorter path lengths for static objects and by far outperforms prior approaches in cases of dynamic or unmapped object queries. We provide our code and dataset at: https://github.com/xiexiexiaoxiexie/osmAG-LLM.
comment: accepted at RA-L 2026
KILO-EKF: Koopman-Inspired Learned Observations Extended Kalman Filter IROS
We present the Koopman-Inspired Learned Observations Extended Kalman Filter (KILO-EKF), which combines a standard EKF prediction step with a correction step based on a Koopman-inspired measurement model learned from data. By lifting measurements into a feature space where they are linear in the state, KILO-EKF enables flexible modeling of complex or poorly calibrated sensors while retaining the structure and efficiency of recursive filtering. The resulting linear-Gaussian measurement model is learned in closed form from groundtruth training data, without iterative optimization or reliance on an explicit parametric sensor model. At inference, KILO-EKF performs a standard EKF update using Jacobians obtained via the learned lifting. We validate the approach on a real-world quadrotor localization task using an IMU, ultra-wideband (UWB) sensors, and a downward-facing laser. We compare against multiple EKF baselines with varying levels of sensor calibration. KILO-EKF achieves better accuracy and consistency compared to data-calibrated baselines, and significantly outperforms EKFs that rely on imperfect geometric models, while maintaining real-time inference and fast training. These results demonstrate the effectiveness of Koopman-inspired measurement learning as a scalable alternative to traditional model-based calibration.
comment: Submitted to IEEE/RSJ IROS. 8 pages, 9 figures, 1 table
Safety Guardrails for LLM-Enabled Robots
Although the integration of large language models (LLMs) into robotics has unlocked transformative capabilities, it has also introduced significant safety concerns, ranging from average-case LLM errors (e.g., hallucinations) to adversarial jailbreaking attacks, which can produce harmful robot behavior in real-world settings. Traditional robot safety approaches do not address the contextual vulnerabilities of LLMs, and current LLM safety approaches overlook the physical risks posed by robots operating in real-world environments. To ensure the safety of LLM-enabled robots, we propose RoboGuard, a two-stage guardrail architecture. RoboGuard first contextualizes pre-defined safety rules by grounding them in the robot's environment using a root-of-trust LLM. This LLM is shielded from malicious prompts and employs chain-of-thought (CoT) reasoning to generate context-dependent safety specifications, such as temporal logic constraints. RoboGuard then resolves conflicts between these contextual safety specifications and potentially unsafe plans using temporal logic control synthesis, ensuring compliance while minimally violating user preferences. In simulation and real-world experiments that consider worst-case jailbreaking attacks, RoboGuard reduces the execution of unsafe plans from over 92% to below 3% without compromising performance on safe plans. We also demonstrate that RoboGuard is resource-efficient, robust against adaptive attacks, and enhanced by its root-of-trust LLM's CoT reasoning. These results demonstrate the potential of RoboGuard to mitigate the safety risks and enhance the reliability of LLM-enabled robots. We provide additional resources at https://robo-guard.github.io/.
Q-Guided Stein Variational Model Predictive Control via RL-informed Policy Prior
Model Predictive Control (MPC) enables reliable trajectory optimization under dynamics constraints, but often depends on accurate dynamics models and carefully hand-designed cost functions. Recent learning-based MPC methods aim to reduce these modeling and cost-design burdens by learning dynamics, priors, or value-related guidance signals. Yet many existing approaches still rely on deterministic gradient-based solvers (e.g., differentiable MPC) or parametric sampling-based updates (e.g., CEM/MPPI), which can lead to mode collapse and convergence to a single dominant solution. We propose Q-SVMPC, a Q-guided Stein variational MPC method with an RL-informed policy prior, which casts learning-based MPC as trajectory-level posterior inference and refines trajectory particles via SVGD under learned soft Q-value guidance to explicitly preserve diverse solutions. Experiments on navigation, robotic manipulation, and a real-world fruit-picking task show improved sample efficiency, stability, and robustness over MPC, model-free RL, and learning-based MPC baselines.
comment: 8 pages, 6 figures
Ask, Reason, Assist: Decentralized Robot Collaboration via Language and Logic
Increased robot deployment, such as in warehousing, has revealed a need for seamless collaboration among heterogeneous robot teams to resolve unforeseen conflicts. To address this challenge, we propose a novel decentralized framework that enables robots to request and provide help. The process begins when a robot detects a conflict and uses a Large Language Model (LLM) to decide whether external assistance is required. If so, it crafts and broadcasts a natural language (NL) help request. Potential helper robots reason over the request and respond with offers of assistance, including information about the effect on their ongoing tasks. Helper reasoning is implemented via an LLM grounded in Signal Temporal Logic (STL) using a Backus-Naur Form (BNF) grammar, ensuring syntactically valid NL-to-STL translations, which are then solved as a Mixed Integer Linear Program (MILP). Finally, the requester robot selects a helper by reasoning over the expected increase in system-level total task completion time. We evaluated our framework through experiments comparing different helper-selection strategies and found that considering multiple offers allows the requester to minimize added makespan. Our approach significantly outperforms heuristics such as selecting the nearest available candidate helper robot, and achieves performance comparable to a centralized "Oracle" baseline but without heavy information demands.
comment: arXiv admin note: substantial text overlap with arXiv:2505.13376
VIGOR: Visual Goal-In-Context Inference for Unified Humanoid Fall Safety
Reliable fall recovery is critical for humanoids operating in cluttered environments. Unlike quadrupeds or wheeled robots, humanoids experience high-energy impacts, complex whole-body contact, and large viewpoint changes during a fall, making recovery essential for continued operation. Existing methods fragment fall safety into separate problems such as fall avoidance, impact mitigation, and stand-up recovery, or rely on end-to-end policies trained without vision through reinforcement learning or imitation learning, often on flat terrain. At a deeper level, fall safety is treated as monolithic data complexity, coupling pose, dynamics, and terrain and requiring exhaustive coverage, limiting scalability and generalization. We present a unified fall safety approach that spans all phases of fall recovery. It builds on two insights: 1) Natural human fall and recovery poses are highly constrained and transferable from flat to complex terrain through alignment, and 2) Fast whole-body reactions require integrated perceptual-motor representations. We train a privileged teacher using sparse human demonstrations on flat terrain and simulated complex terrains, and distill it into a deployable student that relies only on egocentric depth and proprioception. The student learns how to react by matching the teacher's goal-in-context latent representation, which combines the next target pose with the local terrain, rather than separately encoding what it must perceive and how it must act. Results in simulation and on a real Unitree G1 humanoid demonstrate robust, zero-shot fall safety across diverse non-flat environments without real-world fine-tuning. The project page is available at https://vigor2026.github.io/
A Self-Supervised Learning Approach with Differentiable Optimization for UAV Trajectory Planning ICRA 2026
While Unmanned Aerial Vehicles (UAVs) have gained significant traction across various fields, path planning in 3D environments remains a critical challenge, particularly under size, weight, and power (SWAP) constraints. Traditional modular planning systems often introduce latency and suboptimal performance due to limited information sharing and local minima issues. End-to-end learning approaches streamline the pipeline by mapping sensory observations directly to actions but require large-scale datasets, face significant sim-to-real gaps, or lack dynamical feasibility. In this paper, we propose a self-supervised UAV trajectory planning pipeline that integrates a learning-based depth perception with differentiable trajectory optimization. A 3D cost map guides UAV behavior without expert demonstrations or human labels. Additionally, we incorporate a neural network-based time allocation strategy to improve the efficiency and optimality. The system thus combines robust learning-based perception with reliable physics-based optimization for improved generalizability and interpretability. Both simulation and real-world experiments validate our approach across various environments, demonstrating its effectiveness and robustness. Our method achieves a 31.33% improvement in position tracking error and 49.37% reduction in control effort compared to the state-of-the-art.
comment: Accepted by ICRA 2026
Agile Flight Emerges from Multi-Agent Competitive Racing
Through multi-agent competition and the sparse high-level objective of winning a race, we find that both agile flight (e.g., high-speed motion pushing the platform to its physical limits) and strategy (e.g., overtaking or blocking) emerge from agents trained with reinforcement learning. We provide evidence in both simulation and the real world that this approach outperforms the common paradigm of training agents in isolation with rewards that prescribe behavior, e.g., progress on the raceline, in particular when the complexity of the environment increases, e.g., in the presence of obstacles. Moreover, we find that multi-agent competition yields policies that transfer more reliably to the real world than policies trained with a single-agent progress-based reward, despite the two methods using the same simulation environment, randomization strategy, and hardware. In addition to improved sim-to-real transfer, the multi-agent policies also exhibit some degree of generalization to opponents unseen at training time. Overall, our work, following in the tradition of multi-agent competitive game-play in digital domains, shows that sparse task-level rewards are sufficient for training agents capable of advanced low-level control in the physical world. Code: https://github.com/Jirl-upenn/AgileFlight_MultiAgent
VITA: Vision-to-Action Flow Matching Policy
Conventional flow matching and diffusion-based policies sample via iterative denoising from standard noise distributions (e.g., Gaussian), and require conditioning modules to repeatedly incorporate visual information during the generative process, incurring substantial time and memory overhead. To reduce the complexity, we develop VITA, VIsion-To-Action policy, a noise-free and conditioning-free flow matching policy learning framework that directly flows from visual representations to latent actions. Since the source of the flow is visually grounded, VITA eliminates the need for visual conditioning during generation. As expected, bridging vision and action is challenging, because actions are lower-dimensional, less structured, and sparser than visual representations; moreover, flow matching requires the source and target to have the same dimensionality. To overcome this, we introduce an action autoencoder that maps raw actions into a structured latent space aligned with visual latents, trained jointly with flow matching. To further prevent latent action space collapse during end-to-end training, we propose flow latent decoding, which anchors the latent generation process by backpropagating the action reconstruction loss through the flow matching ODE (ordinary differential equation) solving steps. We evaluate VITA on 9 simulation and 5 real-world tasks from ALOHA and Robomimic. VITA achieves 1.5x-2x faster inference compared to conventional methods with conditioning modules, while outperforming or matching state-of-the-art policies. Project page: https://ucd-dare.github.io/VITA/.
comment: Project page: https://ucd-dare.github.io/VITA/ Code: https://github.com/ucd-dare/VITA
RehearseVLA: Simulated Post-Training for VLAs with Physically-Consistent World Model CVPR2026
Vision-Language-Action (VLA) models trained via imitation learning suffer from significant performance degradation in data-scarce scenarios due to their reliance on large-scale demonstration datasets. Although reinforcement learning (RL)-based post-training has proven effective in addressing data scarcity, its application to VLA models is hindered by the non-resettable nature of real-world environments. This limitation is particularly critical in high-risk domains such as industrial automation, where interactions often induce state changes that are costly or infeasible to revert. Furthermore, existing VLA approaches lack a reliable mechanism for detecting task completion, leading to redundant actions that reduce overall task success rates. To address these challenges, we propose RehearseVLA:, an RL-based post-training framework that replaces physical interaction with a low-cost world model-based virtual simulator. RehearseVLA: consists of two key components: (1) a physically-consistent world simulator that generates temporally consistent future visual observations, and (2) a vision-language model (VLM)-guided instant reflector that provides continuous reward signals and predicts action termination. This simulated environment enables VLA models to safely explore and generalize beyond their initial imitation learning distribution. Our method achieves notable performance gains with as few as five expert demonstrations per task. Experiments on complex robotic manipulation tasks demonstrate that RehearseVLA: effectively overcomes the data inefficiency, safety constraints, and inefficient execution of conventional VLA models that rely on real-world interaction, offering a practical and scalable solution for post-training in resource-constrained settings. Our code is available at https://github.com/amap-cvlab/world-env.
comment: Accepted to CVPR2026
Multiagent Systems
Generative adversarial imitation learning for robot swarms: Learning from human demonstrations and trained policies ICRA 2026
In imitation learning, robots are supposed to learn from demonstrations of the desired behavior. Most of the work in imitation learning for swarm robotics provides the demonstrations as rollouts of an existing policy. In this work, we provide a framework based on generative adversarial imitation learning that aims to learn collective behaviors from human demonstrations. Our framework is evaluated across six different missions, learning both from manual demonstrations and demonstrations derived from a PPO-trained policy. Results show that the imitation learning process is able to learn qualitatively meaningful behaviors that perform similarly well as the provided demonstrations. Additionally, we deploy the learned policies on a swarm of TurtleBot 4 robots in real-robot experiments. The exhibited behaviors preserved their visually recognizable character and their performance is comparable to the one achieved in simulation.
comment: Accepted for publication at the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)
EvoSkill: Automated Skill Discovery for Multi-Agent Systems
Coding agents are increasingly used as general-purpose problem solvers, but their flexibility does not by itself confer the domain expertise needed for specialized tasks. Recent work addresses this through \textit{agent skills}: reusable workflows, and code, that augment agents with domain-specific capabilities. Most skills today are hand-crafted, and existing evolutionary approaches optimize low-level artifacts (e.g. prompts \& code) that are tightly coupled to specific models and tasks. We introduce \textbf{EvoSkill}, a self-evolving framework that automatically discovers and refines agent skills through iterative failure analysis. EvoSkill analyzes execution failures, proposes new skills or edits to existing ones, and materializes them into structured, reusable skill folders. A Pareto frontier of agent programs governs selection, retaining only skills that improve held-out validation performance while the underlying model remains frozen. We evaluate EvoSkill on two benchmarks: OfficeQA, a grounded reasoning benchmark over U.S.\ Treasury data, where it improves exact-match accuracy by \textbf{7.3\%} (60.6\% $\to$ 67.9\%); and SealQA, a search-augmented QA benchmark with noisy retrieval, where it yields a \textbf{12.1\%} gain (26.6\% $\to$ 38.7\%). We also investigate the zero-shot transfer capabilties of skills evolved on one task to the other; in particular: skills evolved from SealQA transfers zero-shot to BrowseComp, improving accuracy by \textbf{5.3\%} without modification demonstrating that skill-level optimization produces transferable capabilities beyond the training task.
Generalized Per-Agent Advantage Estimation for Multi-Agent Policy Optimization AAMAS 2026
In this paper, we propose a novel framework for multi-agent reinforcement learning that enhances sample efficiency and coordination through accurate per-agent advantage estimation. The core of our approach is Generalized Per-Agent Advantage Estimator (GPAE), which employs a per-agent value iteration operator to compute precise per-agent advantages. This operator enables stable off-policy learning by indirectly estimating values via action probabilities, eliminating the need for direct Q-function estimation. To further refine estimation, we introduce a double-truncated importance sampling ratio scheme. This scheme improves credit assignment for off-policy trajectories by balancing sensitivity to the agent's own policy changes with robustness to non-stationarity from other agents. Experiments on benchmarks demonstrate that our approach outperforms existing approaches, excelling in coordination and sample efficiency for complex scenarios.
comment: Accepted at the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
Credibility Governance: A Social Mechanism for Collective Self-Correction under Weak Truth Signals
Online platforms increasingly rely on opinion aggregation to allocate real-world attention and resources, yet common signals such as engagement votes or capital-weighted commitments are easy to amplify and often track visibility rather than reliability. This makes collective judgments brittle under weak truth signals, noisy or delayed feedback, early popularity surges, and strategic manipulation. We propose Credibility Governance (CG), a mechanism that reallocates influence by learning which agents and viewpoints consistently track evolving public evidence. CG maintains dynamic credibility scores for both agents and opinions, updates opinion influence via credibility-weighted endorsements, and updates agent credibility based on the long-run performance of the opinions they support, rewarding early and persistent alignment with emerging evidence while filtering short-lived noise. We evaluate CG in POLIS, a socio-physical simulation environment that models coupled belief dynamics and downstream feedback under uncertainty. Across settings with initial majority misalignment, observation noise and contamination, and misinformation shocks, CG outperforms vote-based, stake-weighted, and no-governance baselines, yielding faster recovery to the true state, reduced lock-in and path dependence, and improved robustness under adversarial pressure. Our implementation and experimental scripts are publicly available at https://github.com/Wanying-He/Credibility_Governance.
StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning
Modern machine learning (ML) workloads increasingly rely on GPUs, yet achieving high end-to-end performance remains challenging due to dependencies on both GPU kernel efficiency and host-side settings. Although LLM-based methods show promise on automated GPU kernel generation, prior works mainly focus on single-kernel optimization and do not extend to end-to-end programs, hindering practical deployment. To address the challenge, in this work, we propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation, with three specialized agents: a Planner to orchestrate whole system design, a Coder dedicated to implementing it step-by-step, and a Verifier for correctness check and performance profiling using Nsys/NCU. To fundamentally improve the Coder's ability in end-to-end GPU programming, StitchCUDA integrates rubric-based agentic reinforcement learning over two atomic skills, task-to-code generation and feedback-driven code optimization, with combined rubric reward and rule-based reward from real executions. Therefore, the Coder learns how to implement advanced CUDA programming techniques (e.g., custom kernel fusion, cublas epilogue), and we also effectively prevent Coder's reward hacking (e.g., just copy PyTorch code or hardcoding output) during benchmarking. Experiments on KernelBench show that StitchCUDA achieves nearly 100% success rate on end-to-end GPU programming tasks, with 1.72x better speedup over the multi-agent baseline and 2.73x than the RL model baselines.
Social Norm Reasoning in Multimodal Language Models: An Evaluation
In Multi-Agent Systems (MAS), agents are designed with social capabilities, allowing them to understand and reason about social concepts such as norms when interacting with others (e.g., inter-robot interactions). In Normative MAS (NorMAS), researchers study how norms develop, and how violations are detected and sanctioned. However, existing research in NorMAS use symbolic approaches (e.g., formal logic) for norm representation and reasoning whose application is limited to simplified environments. In contrast, Multimodal Large Language Models (MLLMs) present promising possibilities to develop software used by robots to identify and reason about norms in a wide variety of complex social situations embodied in text and images. However, prior work on norm reasoning have been limited to text-based scenarios. This paper investigates the norm reasoning competence of five MLLMs by evaluating their ability to answer norm-related questions based on thirty text-based and thirty image-based stories, and comparing their responses against humans. Our results show that MLLMs demonstrate superior performance in norm reasoning in text than in images. GPT-4o performs the best in both modalities offering the most promise for integration with MAS, followed by the free model Qwen-2.5VL. Additionally, all models find reasoning about complex norms challenging.
comment: to be published in ICAART 2026 post proceedings
Molt Dynamics: Emergent Social Phenomena in Autonomous AI Agent Populations
MoltBook is a large-scale multi-agent coordination environment where over 770,000 autonomous LLM agents interact without human participation, offering the first opportunity we are aware of to observe emergent multi-agent coordination dynamics at this population scale. We introduce \textit{Molt Dynamics}: the emergent agent coordination behaviors, inter-agent communication dynamics, and role specialization patterns arising when autonomous agents operate as decentralized decision-makers in an unconstrained multi-agent environment. Through longitudinal observation of 90,704 active agents over three weeks, we characterize three aspects. First, spontaneous role specialization: network-based clustering reveals six structural roles (silhouette 0.91), though the result primarily reflects core-periphery organization -- 93.5\% of agents occupy a homogeneous peripheral cluster, with meaningful differentiation confined to the active minority. Second, decentralized information dissemination: cascade analysis of 10,323 inter-agent propagation events reveals power-law distributed cascade sizes ($α= 2.57 \pm 0.02$) and saturating adoption dynamics where adoption probability shows diminishing returns with repeated exposures (Cox hazard ratio 0.53, concordance 0.78). Third, distributed cooperative task resolution: 164 multi-agent collaborative events show detectable coordination patterns, but success rates are low (6.7\%, $p = 0.057$) and cooperative outcomes are significantly worse than a matched single-agent baseline (Cohen's $d = -0.88$), indicating emergent cooperative behavior is nascent. These findings establish an empirical baseline for coordination dynamics in decentralized autonomous agent systems, with implications for multi-agent system design, agent communication protocol engineering, and AI safety.
Multi-Agent Influence Diagrams to Hybrid Threat Modeling
Western governments have adopted an assortment of counter-hybrid threat measures to defend against hostile actions below the conventional military threshold. The impact of these measures is unclear because of the ambiguity of hybrid threats, their cross-domain nature, and uncertainty about how countermeasures shape adversarial behavior. This paper offers a novel approach to clarifying this impact by unifying previously bifurcating hybrid threat modeling methods through a (multi-agent) influence diagram framework. The model balances the costs of countermeasures, their ability to dissuade the adversary from executing hybrid threats, and their potential to mitigate the impact of hybrid threats. We run 1000 semi-synthetic variants of a real-world-inspired scenario simulating the strategic interaction between attacking agent A and defending agent B over a cyber attack on critical infrastructure to explore the effectiveness of a set of five different counter-hybrid threat measures. Counter-hybrid measures range from strengthening resilience and denial of the adversary's ability to execute a hybrid threat to dissuasion through the threat of punishment. Our analysis primarily evaluates the overarching characteristics of counter-hybrid threat measures. This approach allows us to generalize the effectiveness of these measures and examine parameter impact sensitivity. In addition, we discuss policy relevance and outline future research avenues.
comment: The Journal of Defense Modeling and Simulation: Applications, Methodology, Technology. 2025;0(0)
Safety Verification of Wait-Only Non-Blocking Broadcast Protocols
Broadcast protocols are programs designed to be executed by networks of processes. Each process runs the same protocol, and communication between them occurs in synchronously in two ways: broadcast, where one process sends a message to all others, and rendez-vous, where one process sends a message to at most one other process. In both cases, communication is non-blocking, meaning the message is sent even if no process is able to receive it. We consider two coverability problems: the state coverability problem asks whether there exists a number of processes that allows reaching a given state of the protocol, and the configuration coverability problem asks whether there exists a number of processes that allows covering a given configuration. These two problems are known to be decidable and Ackermann-hard. We show that when the protocol is Wait-Only (i.e., it has no state from which a process can both send and receive messages), these problems become P-complete and PSPACE-complete, respectively.
comment: submitted to Fundamenta Informaticae
CoRL-MPPI: Enhancing MPPI With Learnable Behaviours For Efficient And Provably-Safe Multi-Robot Collision Avoidance
Decentralized collision avoidance is a core challenge for scalable multi-robot systems. One of the promising approaches to tackle this problem is Model Predictive Path Integral (MPPI) -- a framework that naturally handles arbitrary motion models and provides strong theoretical guarantees. Still, in practice MPPI-based controller may provide suboptimal trajectories as its performance relies heavily on uninformed random sampling. In this work, we introduce CoRL-MPPI, a novel fusion of Cooperative Reinforcement Learning and MPPI to address this limitation. We train an action policy (approximated as deep neural network) in simulation that learns local cooperative collision avoidance behaviors. This learned policy is then embedded into the MPPI framework to guide its sampling distribution, biasing it towards more intelligent and cooperative actions. Notably, CoRL-MPPI preserves all the theoretical guarantees of regular MPPI. We evaluate our approach in dense, dynamic simulation environments against state-of-the-art baselines, such as ORCA, BVC, RL-RVO-NAV and classical MPPI. Our results demonstrate that CoRL-MPPI significantly improves navigation efficiency (measured by success rate and makespan) and safety, enabling agile and robust multi-robot navigation.
comment: The manuscript includes 9 pages, 5 figures, and 1 table. This replacement revises and extends the original submission. The updated version adds a validation in Gazebo. It also expands the experimental evaluation by adding baselines and an evaluation scenario. In addition, the cost functions in MPPI-based methods were refined, leading to improved experimental performance
Computing Evolutionarily Stable Strategies in Multiplayer Games
We present an algorithm for computing all evolutionarily stable strategies in nondegenerate normal-form games with three or more players.
comment: Reverting to original title after fixing Google scholar merge
Strategic Concealment of Environment Representations in Competitive Games
This paper investigates the strategic concealment of environment representations used by players in competitive games. We consider a defense scenario in which one player (the Defender) seeks to infer and exploit the representation used by the other player (the Attacker). The interaction between the two players is modeled as a Bayesian game: the Defender infers the Attacker's representation from its trajectory and places barriers to obstruct the Attacker's path towards its goal, while the Attacker obfuscates its representation type to mislead the Defender. We solve for the Perfect Bayesian Nash Equilibrium via a bilinear program that integrates Bayesian inference, strategic planning, and belief manipulation. Simulations show that purposeful concealment naturally emerges: the Attacker randomizes its trajectory to manipulate the Defender's belief, inducing suboptimal barrier selections and thereby gaining a strategic advantage.
HAMLET: A Hierarchical and Adaptive Multi-Agent Framework for Live Embodied Theatrics ICLR 2026
Creating an immersive and interactive theatrical experience is a long-term goal in the field of interactive narrative. The emergence of large language models (LLMs) provides a new path to achieve this goal. However, existing LLM-based drama generation methods often produce models that lack initiative and cannot interact with the physical scene, while typically requiring detailed user input that diminishes the immersion of live performance. To address these challenges, we propose HAMLET, a hierarchical adaptive multi-agent framework focused on drama creation and real-time online performance. Given a simple topic, the framework first generates a narrative blueprint to guide the subsequent improvisational performance. In the online performance phase, each actor is equipped with an adaptive reasoning module that enables decision-making based on their personas, memories, goals, and emotional states during complex group chat scenarios. Beyond dialogue, actor agents engage in embodied interactions by changing the state of scene props through actions such as opening a letter or picking up a weapon, which are broadcast to update the global environmental context. To objectively assess the quality of live embodied theatrics, we establish a comprehensive evaluation method and introduce HAMLETJudge, a specialized critic model for automated evaluation. Experimental results demonstrate that HAMLET excels in creating expressive, coherent, and physically interactive theatrical experiences in an autonomous manner.
comment: Accepted to the Fourteenth International Conference on Learning Representations (ICLR 2026)
Agile Flight Emerges from Multi-Agent Competitive Racing
Through multi-agent competition and the sparse high-level objective of winning a race, we find that both agile flight (e.g., high-speed motion pushing the platform to its physical limits) and strategy (e.g., overtaking or blocking) emerge from agents trained with reinforcement learning. We provide evidence in both simulation and the real world that this approach outperforms the common paradigm of training agents in isolation with rewards that prescribe behavior, e.g., progress on the raceline, in particular when the complexity of the environment increases, e.g., in the presence of obstacles. Moreover, we find that multi-agent competition yields policies that transfer more reliably to the real world than policies trained with a single-agent progress-based reward, despite the two methods using the same simulation environment, randomization strategy, and hardware. In addition to improved sim-to-real transfer, the multi-agent policies also exhibit some degree of generalization to opponents unseen at training time. Overall, our work, following in the tradition of multi-agent competitive game-play in digital domains, shows that sparse task-level rewards are sufficient for training agents capable of advanced low-level control in the physical world. Code: https://github.com/Jirl-upenn/AgileFlight_MultiAgent
Systems and Control (EESS)
How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference
Many essential manipulation tasks - such as food preparation, surgery, and craftsmanship - remain intractable for autonomous robots. These tasks are characterized not only by contact-rich, force-sensitive dynamics, but also by their "implicit" success criteria: unlike pick-and-place, task quality in these domains is continuous and subjective (e.g. how well a potato is peeled), making quantitative evaluation and reward engineering difficult. We present a learning framework for such tasks, using peeling with a knife as a representative example. Our approach follows a two-stage pipeline: first, we learn a robust initial policy via force-aware data collection and imitation learning, enabling generalization across object variations; second, we refine the policy through preference-based finetuning using a learned reward model that combines quantitative task metrics with qualitative human feedback, aligning policy behavior with human notions of task quality. Using only 50-200 peeling trajectories, our system achieves over 90% average success rates on challenging produce including cucumbers, apples, and potatoes, with performance improving by up to 40% through preference-based finetuning. Remarkably, policies trained on a single produce category exhibit strong zero-shot generalization to unseen in-category instances and to out-of-distribution produce from different categories while maintaining over 90% success rates.
comment: Project page can be found at https://toruowo.github.io/peel
Can a Learner Regret Using a No-Regret Algorithm? A Control-Theoretic Study of Performance Dominance
No-regret learning dynamics ensure that a learner asymptotically achieves an average reward no worse than that of any fixed strategy. This no-regret guarantee does not determine the value of the asymptotic average reward. Indeed, it is possible for different no-regret learning dynamics to exhibit different asymptotic average rewards when facing the same environment while both assure the no-regret guarantee. This paper asks whether a "free-lunch" phenomenon can arise among no-regret algorithms. Namely, is it possible for one no-regret learning rule to uniformly outperform another no-regret learning rule across all payoff environments. Stated differently, can a learner regret not using a particular no-regret algorithm? We consider generalized replicator dynamics (RD) as a cascade interconnection between a linear time-invariant (LTI) system and the softmax nonlinearity. Varying this LTI system leads to different realizations of replicator dynamics, including so-called anticipatory RD, exponential RD, and other forms of higher-order RD. Setting the LTI system to be an integrator realizes standard RD, which is known to satisfy the no-regret property. Within this framework, we analyze and compare various realizations of these generalized realizations RD by varying the LTI system. We first formulate performance comparison as a passivity property of an associated comparison system and establish "local" dominance results, i.e., comparing the asymptotic performance near an equilibrium payoff vector. We then cast performance comparison between a form of anticipatory RD and standard RD as an optimal-control problem. We show that the minimal achievable cumulative reward gap is zero, thereby establishing global dominance of anticipatory RD across all payoff environments and establishing a "free lunch" among no-regret learning dynamics.
Deep Q-Learning-Based Gain Scheduling for Nonlinear Quadcopter Dynamics
This paper presents a deep Q-network (DQN)-based gain-scheduling framework for safety-critical quadcopter trajectory tracking. Instead of directly learning control inputs, the proposed approach selects from a finite set of pre-certified stabilizing gain vectors, enabling reinforcement learning to operate within a structured and stability-preserving control architecture. By exploiting the isotropic structure of the translational dynamics, feedback gains are shared across spatial axes to reduce dimensionality while preserving performance. The learned policy adapts feedback aggressiveness in real time, applying high authority during large transients and reducing gains near convergence to limit control effort. Simulation results using a high-fidelity nonlinear quadcopter model demonstrate accurate trajectory tracking, bounded attitude excursions, smooth transition to hover after the final time, and consistent reward improvement, validating the effectiveness and robustness of the proposed learning-based gain scheduling strategy.
Safe and Robust Domains of Attraction for Discrete-Time Systems: A Set-Based Characterization and Certifiable Neural Network Estimation
Analyzing nonlinear systems with attracting robust invariant sets (RISs) requires estimating their domains of attraction (DOAs). Despite extensive research, accurately characterizing DOAs for general nonlinear systems remains challenging due to both theoretical and computational limitations, particularly in the presence of uncertainties and state constraints. In this paper, we propose a novel framework for the accurate estimation of safe (state-constrained) and robust DOAs for discrete-time nonlinear uncertain systems with continuous dynamics, open safe sets, compact disturbance sets, and uniformly locally $\ell_p$-stable compact RISs. The notion of uniform $\ell_p$ stability is quite general and encompasses, as special cases, uniform exponential and polynomial stability. The DOAs are characterized via newly introduced value functions defined on metric spaces of compact sets. We establish their fundamental mathematical properties and derive the associated Bellman-type (Zubov-type) functional equations. Building on this characterization, we develop a physics-informed neural network (NN) framework to learn the corresponding value functions by embedding the derived Bellman-type equations directly into the training process. To obtain certifiable estimates of the safe robust DOAs from the learned neural approximations, we further introduce a verification procedure that leverages existing formal verification tools. The effectiveness and applicability of the proposed methodology are demonstrated through four numerical examples involving nonlinear uncertain systems subject to state constraints, and its performance is compared with existing methods from the literature.
Stability properties of Minimal Gated Unit neural networks
In this work, we address the need for efficient and formally stable Recurrent Neural Networks (RNNs) in environments with limited computational resources by analyzing the stability of the Minimal Gated Unit (MGU) network, a lightweight alternative to common gated RNNs used in system identification. We derive sufficient parametric conditions for the MGU network's input-to-state stability and incremental input-to-state stability properties. These conditions enable a-posteriori validation of model stability and form the basis for novel stability-promoting training methodologies, including a warm-start of the network's parameters and a projected gradient-based optimization scheme, both of which are presented in this work. Comparative evaluation, including robustness analysis and validation on synthetic and real-world data (i.e., the Silverbox benchmark), demonstrates that the minimal gated unit network successfully combines formal stability guarantees with superior parameter efficiency and faster inference times compared to other state-of-the-art recurrent neural networks, while maintaining comparable and satisfactory accuracy. Notably, the results attained on the Silverbox benchmark illustrate that the stable MGU network effectively captures the system dynamics, whereas other stable RNNs fail to converge to a reliable model.
comment: Preprint submitted to Automatica. 16 pages, 6 figures and 1 table MATLAB code for the proposed methodologies is available at: https://github.com/StefanoDeCarli/MGU_dISS.git
Grid-Forming Control with Assignable Voltage Regulation Guarantees and Safety-Critical Current Limiting
This paper develops a nonlinear grid-forming (GFM) controller with provable voltage-formation guarantees, with over-current limiting enforced via a control-barrier-function (CBF)-based safety filter. The nominal controller follows a droop-based inner-outer architecture, in which voltage references and frequency are generated by droop laws, an outer-loop voltage controller produces current references using backstepping (BS), and an inner-loop current controller synthesizes the terminal voltage. The grid voltage is treated as an unknown bounded disturbance, without requiring knowledge of its bound, and the controller design does not rely on any network parameters beyond the point of common coupling (PCC). To robustify voltage formation against the grid voltage, a deadzone-adapted disturbance suppression (DADS) framework is incorporated, yielding practical voltage regulation characterized by asymptotic convergence of the PCC voltage errors to an assignably small and known residual set. Furthermore, the closed-loop system is proven to be globally well posed, with all physical and adaptive states bounded and voltage error transients (due to initial conditions) decaying exponentially at an assignable rate. On top of the nominal controller, hard over-current protection is achieved through a minimally invasive CBF-based safety filter that enforces strict current limits via a single-constraint quadratic program. The safety filter is compatible with any locally Lipschitz nominal controller. Rigorous analysis establishes forward invariance of the safe-current set and boundedness of all states under current limiting. Numerical results demonstrate improved transient performance and faster recovery during current-limiting events when the proposed DADS-BS controller is used as the nominal control law, compared with conventional PI-based GFM control.
Exact Moment Estimation of Stochastic Differential Dynamics
Moment estimation for stochastic differential equations (SDEs) is fundamental to the formal reasoning and verification of stochastic dynamical systems, yet remains challenging and is rarely available in closed form. In this paper, we study time-homogeneous SDEs with polynomial drift and diffusion, and investigate when their moments can be computed exactly. We formalize the notion of moment-solvable SDEs and propose a generic symbolic procedure that, for a given monomial, attempts to construct a finite linear ordinary differential equation (ODE) system governing its moment, thereby enabling exact computation. We introduce a syntactic class of pro-solvable SDEs, characterized by a block-triangular structure, and prove that all polynomial moments of any pro-solvable SDE admit such finite ODE representations. This class strictly generalizes linear SDEs and includes many nonlinear models. Experimental results demonstrate the effectiveness of our approach.
comment: 21 pages, 1 table. Accepted by FM 2026
Optimum Battery Depth of Discharge of Stand-alone Hybrid System Using the MOPSO Method
This paper presents an optimized design of a Standalone Solar PV/Battery (SSPVB) system to address energy reliability and cost efficiency challenges in off-grid environments. The proposed system integrates a Multi-Objective Particle Swarm Optimization (MOPSO) approach and validates the results using the Non-Dominated Sorting Genetic Algorithm II (NSGA-II). The optimization process aims to minimize both the Cost of Energy (COE) and Loss of Load Probability (LLP), while examining the effects of Battery Depth of Discharge (DOD) on system reliability and lifecycle cost. Results indicate that an optimal DOD of approximately 70% yields a COE of 0.2059 USD/kWh with zero LLP, demonstrating strong reliability and cost-effectiveness. Comparative analysis shows that both MOPSO and NSGA-II methods achieve consistent outcomes, with MOPSO exhibiting faster convergence. The study provides valuable insights into optimal battery sizing for stand-alone systems, contributing to modern optimization practices in renewable energy applications.
cuNRTO: GPU-Accelerated Nonlinear Robust Trajectory Optimization
Robust trajectory optimization enables autonomous systems to operate safely under uncertainty by computing control policies that satisfy the constraints for all bounded disturbances. However, these problems often lead to large Second Order Conic Programming (SOCP) constraints, which are computationally expensive. In this work, we propose the CUDA Nonlinear Robust Trajectory Optimization (cuNRTO) framework by introducing two dynamic optimization architectures that have direct application to robust decision-making and are implemented on CUDA. The first architecture, NRTO-DR, leverages the Douglas-Rachford (DR) splitting method to solve the SOCP inner subproblems of NRTO, thereby significantly reducing the computational burden through parallel SOCP projections and sparse direct solves. The second architecture, NRTO-FullADMM, is a novel variant that further exploits the problem structure to improve scalability using the Alternating Direction Method of Multipliers (ADMM). Finally, we provide GPU implementation of the proposed methodologies using custom CUDA kernels for SOC projection steps and cuBLAS GEMM chains for feedback gain updates. We validate the performance of cuNRTO through simulated experiments on unicycle, quadcopter, and Franka manipulator models, demonstrating speedup up to 139.6$\times$.
Joint Optimization of Model Partitioning and Resource Allocation for Anti-Jamming Collaborative Inference Systems
With the increasing computational demands of deep neural network (DNN) inference on resource-constrained devices, DNN partitioning-based device-edge collaborative inference has emerged as a promising paradigm. However, the transmission of intermediate feature data is vulnerable to malicious jamming, which significantly degrades the overall inference performance. To counter this threat, this letter focuses on an anti-jamming collaborative inference system in the presence of a malicious jammer. In this system, a DNN model is partitioned into two distinct segments, which are executed by wireless devices and edge servers, respectively. We first analyze the effects of jamming and DNN partitioning on inference accuracy via data regression. Based on this, our objective is to maximize the system's revenue of delay and accuracy (RDA) under inference accuracy and computing resource constraints by jointly optimizing computation resource allocation, devices' transmit power, and DNN partitioning. To address the mixed-integer nonlinear programming problem, we propose an efficient alternating optimization-based algorithm, which decomposes the problem into three subproblems that are solved via Karush-Kuhn-Tucker conditions, convex optimization methods, and a quantum genetic algorithm, respectively. Extensive simulations demonstrate that our proposed scheme outperforms baselines in terms of RDA.
Robust Hybrid Finite-Time Parameter Estimation Without Persistence of Excitation
In this paper, we consider the problem of estimating parameters of a linear regression model. Using a hybrid systems framework, a hybrid algorithm is proposed allowing the estimate to converge to the exact value of the unknown parameters in predetermined finite time. Interestingly, we show that for the case of constant parameters, the convergence property of the hybrid algorithm holds while only requiring the regressor to be exciting on a given interval. For the case of piecewise constant parameters, the classical persistency of excitation condition is required to guarantee the convergence. Robustness of the proposed algorithm with respect to measurements noise is analysed. Finally, illustrative examples are provided showing the merits of the proposed approach in terms of scalability and the applicability for the general class of time-varying unknown parameters
Contractor-Expander and Universal Inverse Optimal Positive Nonlinear Control
For general control-affine nonlinear systems in the positive orthant, and with positive controls, we show how strict CLFs can be utilized for inverse optimal stabilization. Conventional ``LgV'' inverse optimal feedback laws, for systems with unconstrained states and controls, assume sign-unconstrained inputs and input penalties that are class-K in the input magnitude, hence symmetric about zero. Such techniques do not extend to positive-state-and-control systems. Major customizations are needed, and introduced in this paper, for positive systems where highly asymmetric (or unconventionally symmetric) costs not only on the state but also on control are necessary. For the predator-prey positive-state positive-input benchmark system, with a strict CLF built in our previous paper, we prototype two inverse optimal methodological frameworks that employ particular ``contractor and expander functions.'' One framework (A) employs a triple consisting of a CLF, a stabilizing feedback, and an expander, whereas the other framework (B) employs a pair of a CLF and a contractor function. Both frameworks yield inverse optimal stabilizer constructions, on positive orthants of arbitrary dimensions. Framework B demands more design effort than framework A but is free of conditions that may fail to hold in general. Biological interpretation for the predator-prey model illuminates that such inverse optimal control constructions are bio-ecologically meaningful. In addition to general frameworks, we present one fully explicit design: a Sontag-like universal formula for inverse optimal stabilization of positive-orthant systems by positive feedback.
Event-Driven Safe and Resilient Control of Automated and Human-Driven Vehicles under EU-FDI Attacks
This paper studies the safe and resilient control of Connected and Automated Vehicles (CAVs) operating in mixed traffic environments where they must interact with Human-Driven Vehicles (HDVs) under uncertain dynamics and exponentially unbounded false data injection (EU-FDI) attacks. These attacks pose serious threats to safety-critical applications. While resilient control strategies can mitigate adversarial effects, they often overlook collision avoidance requirements. Conversely, safety-focused approaches tend to assume nominal operating conditions and lack resilience to adversarial inputs. To address these challenges, we propose a control framework that integrates event-driven Control Barrier Functions (CBFs) and Control Lyapunov Functions (CLFs) with adaptive attack-resilient control. The framework further incorporates data-driven estimation of HDV behaviors to ensure safety and resilience against EU-FDI attacks. Specifically, we focus on the lane-changing maneuver of CAVs in the presence of unpredictable HDVs and EU-FDI attacks on acceleration inputs. The event-driven approach reduces computational load while maintaining real-time safety guarantees. Simulation results, including comparisons with pure event-driven methods lacking resilience, validate the effectiveness and robustness of the proposed EDSR framework in achieving collision-free maneuvers, stable velocity regulation, and resilient operation under adversarial conditions.
Joint Estimation of Dynamic O-D Demand and Choice Models for Dynamic Multi-modal Networks: Computational Graph-Based Learning and Hypothesis Tests
Understanding travel demand and behavior, particularly route and mode choices, is critical for effective transportation planning and policy design in multi-modal systems with emerging mobility options. Multi-modal system-level data, such as traffic counts, probe speeds, and transit ridership, offer scalable, cost-effective, and privacy-preserving advantages for inferring and analyzing travel behavior. This research uses such system-level data to infer travel demand and choices that vary by time of day, origin/destination location, and mode. Existing studies focus on a single transportation mode, consider limited behavioral factors in disutility functions, rely on static travel time functions, and face computational challenges when applied to large-scale networks. This research addresses these gaps by proposing a joint estimation framework for dynamic origin-destination demand and disutility functions within a multi-modal transportation system that includes both private driving and public transit, using multi-source system-level data. A multi-modal dynamic traffic assignment model that accounts for both route and mode choices is integrated into the framework, with detailed travel time modeling for multiple modes. Alternative-specific and zone-specific factors are incorporated into generic disutility functions to capture heterogeneous traveler perceptions. The estimation problem is formulated and solved using a computational graph-based approach, enabling dynamic network modeling and scalable inference across large-scale networks and diverse data sets. Furthermore, we propose a hypothesis testing framework tailored to this complex estimation setting to assess the statistical significance of behavioral parameters, thereby enabling model selection and statistically rigorous insights for real-world applications.
Scalar-Measurement Attitude Estimation on $\mathbf{SO}(3)$ with Bias Compensation ICRA 2026
Attitude estimation methods typically rely on full vector measurements from inertial sensors such as accelerometers and magnetometers. This paper shows that reliable estimation can also be achieved using only scalar measurements, which naturally arise either as components of vector readings or as independent constraints from other sensing modalities. We propose nonlinear deterministic observers on $\mathbf{SO}(3)$ that incorporate gyroscope bias compensation and guarantee uniform local exponential stability under suitable observability conditions. A key feature of the framework is its robustness to partial sensing: accurate estimation is maintained even when only a subset of vector components is available. Experimental validation on the BROAD dataset confirms consistent performance across progressively reduced measurement configurations, with estimation errors remaining small even under severe information loss. To the best of our knowledge, this is the first work to establish fundamental observability results showing that two scalar measurements under suitable excitation suffice for attitude estimation, and that three are enough in the static case. These results position scalar-measurement-based observers as a practical and reliable alternative to conventional vector-based approaches.
comment: 9 pages, 4 figures. Submitted to ICRA 2026
Safety-Centered Scenario Generation for Autonomous Vehicles
This paper presents a scenario generation framework that creates diverse, parametrized, and safety-critical driving situations to validate the safety features of autonomous vehicles in simulation [15]. By modeling factors such as road geometry, traffic participants, environmental conditions, and perception uncertainties, the framework enables repeatable and scalable testing of safety mechanisms, including emergency braking, evasive maneuvers, and vulnerable road user protection. The framework supports both regulatory and edge case scenarios, mapped to hazards and safety goals derived from Hazard Analysis and Risk Assessment (HARA), ensuring traceability to ISO 26262 functional safety requirements and performance limitations. The output from these simulations provides quantitative safety metrics such as time-to-collision, minimum distance, braking and steering performance, and residual collision severity. These metrics enable the systematic evaluation of evasive maneuvering as a safety feature, while highlighting system limitations and edge-case vulnerabilities. Integration of scenario-based simulation with safety engineering principles offers accelerated validation cycles, improved test coverage at reduced cost, and stronger evidence for regulatory and stakeholder confidence.
comment: To be presented at SAE 2026
Real-time tightly coupled GNSS and IMU integration via Factor Graph Optimization
Reliable positioning in dense urban environments remains challenging due to frequent GNSS signal blockage, multipath, and rapidly varying satellite geometry. While factor graph optimization (FGO)-based GNSS-IMU fusion has demonstrated strong robustness and accuracy, most formulations remain offline. In this work, we present a real-time tightly coupled GNSS-IMU FGO method that enables causal state estimation via incremental optimization with fixed-lag marginalization, and we evaluate its performance in a highly urbanized GNSS-degraded environment using the UrbanNav dataset.
Real-time loosely coupled GNSS and IMU integration via Factor Graph Optimization
Accurate positioning, navigation, and timing (PNT) is fundamental to the operation of modern technologies and a key enabler of autonomous systems. A very important component of PNT is the Global Navigation Satellite System (GNSS) which ensures outdoor positioning. Modern research directions have pushed the performance of GNSS localization to new heights by fusing GNSS measurements with other sensory information, mainly measurements from Inertial Measurement Units (IMU). In this paper, we propose a loosely coupled architecture to integrate GNSS and IMU measurements using a Factor Graph Optimization (FGO) framework. Because the FGO method can be computationally challenging and often used as a post-processing method, our focus is on assessing its localization accuracy and service availability while operating in real-time in challenging environments (urban canyons). Experimental results on the UrbanNav-HK-MediumUrban-1 dataset show that the proposed approach achieves real-time operation and increased service availability compared to batch FGO methods. While this improvement comes at the cost of reduced positioning accuracy, the paper provides a detailed analysis of the trade-offs between accuracy, availability, and computational efficiency that characterize real-time FGO-based GNSS/IMU fusion.
Designing Barrier Functions for Graceful Safety Control
This paper examines the problem of achieving "grace" when controlling dynamical systems for safety, which is defined in terms of providing multi-layered safety assurances. Namely, two safety layers are created: a primary layer that represents a desirable degree of safety, and a secondary failsafe layer. Graceful control then involves ensuring that even if the primary layer is breached, the failsafe layer remains forward invariant. The paper pursues this goal by constructing a safety constraint that combines the concepts of zeroing and reciprocal control barrier functions with regard to the primary and secondary safe sets, respectively. This constraint is analogous to a stiffening spring, making it possible to construct energy-based analytical proofs of the resulting graceful safety guarantees. The proposed approach is developed for systems with a relative degree of either 1 or 2, the latter case being particularly useful for mechanical systems. We demonstrate the applicability of the method using a wall collision avoidance example. This demonstration highlights the benefits of the proposed approach compared to traditional benchmarks from the literature.
comment: 15 pages, 14 figures
DKD-KAN: A Lightweight knowledge-distilled KAN intrusion detection framework, based on MLP and KAN
Cyber-security systems often operate in resource-constrained environments, such as edge environments and real-time monitoring systems, where model size and inference time are crucial. A light-weight intrusion detection framework is proposed that utilizes the Kolmogorov-Arnold Network (KAN) to capture complex features in the data, with the efficiency of decoupled knowledge distillation (DKD) training approach. A high-capacity KAN network is first trained to detect attacks performed on the test bed. This model then serves as a teacher to guide a much smaller multilayer perceptron (MLP) student model via DKD. The resulting DKD-MLP model contains only 2,522 and 1,622 parameters for WADI and SWaT datasets, which are significantly smaller than the number of parameters of the KAN teacher model. This is highly appropriate for deployment in resource-constrained devices with limited computational resources. Despite its low size, the student model maintains a high performance. Our approach demonstrate the practicality of using KAN as a knowledge-rich teacher to train much smaller student models, without considerable drop in accuracy in intrusion detection frameworks. We have validated our approach on two publicly available datasets. We report F1-score improvements of 4.18% on WADI and 3.07% on SWaT when using the DKD-MLP model, compared to the bare student model. The implementation of this paper is available on our GitHub repository.
Multidisciplinary Design Optimization of a Low-Thrust Asteroid Orbit Insertion Using Electric Propulsion
Low-thrust electric propulsion missions are often designed under simplifying assumptions such as constant thrust or fixed specific impulse, neglecting the strong coupling between trajectory dynamics, spacecraft power availability, and propulsion performance. In deep-space environments with reduced solar irradiance, these assumptions can lead to suboptimal or infeasible designs, underscoring the need to simultaneously optimize the trajectory and power subsystem. This paper presents a multidisciplinary design optimization (MDO) framework for the simultaneous design of low-thrust trajectories and spacecraft power systems, with explicit coupling to electric propulsion performance. The framework incorporates a high-fidelity variable-specific impulse model of the SPT-140 Hall thruster, in which thrust and efficiency are directly constrained by time-varying solar power availability and solar array degradation, rather than treated as fixed parameters. The coupled problem is posed as a time-optimal control problem and addressed using a framework built on top of OpenMDAO and Dymos toolchains, where Dymos employs a collocation-based direct-transcription approach for trajectory optimization. OpenMDAO provides accurate analytic partial derivatives, enabling efficient gradient-based optimization. A Fast Fourier Series shape-based method is used to generate dynamically feasible initial guess trajectories, and the resulting nonlinear programming problem is solved using IPOPT. The proposed framework is demonstrated through a low-thrust orbit insertion scenario around asteroid 16-Psyche, a regime in which reduced solar irradiance makes power-aware trajectory design particularly critical. Simulation results demonstrate the framework's ability to capture key power-propulsion-trajectory trade-offs, highlighting the importance of integrated power optimization for realistic electric propulsion mission design.
comment: 11 pages, 2 figures
Safe Payload Transfer with Ship-Mounted Cranes: A Robust Model Predictive Control Approach
Ensuring safe real-time control of ship-mounted cranes in unstructured transportation environments requires handling multiple safety constraints while maintaining effective payload transfer performance. Unlike traditional crane systems, ship-mounted cranes are consistently subjected to significant external disturbances affecting underactuated crane dynamics due to the ship's dynamic motion response to harsh sea conditions, which can lead to robustness issues. To tackle these challenges, we propose a robust and safe model predictive control (MPC) framework and demonstrate it on a 5-DOF crane system, where a Stewart platform simulates the external disturbances that ocean surface motions would have on the supporting ship. The crane payload transfer operation must avoid obstacles and accurately place the payload within a designated target area. We use a robust zero-order control barrier function (R-ZOCBF)-based safety constraint in the nonlinear MPC to ensure safe payload positioning, while time-varying bounding boxes are utilized for collision avoidance. We introduce a new optimization-based online robustness parameter adaptation scheme to reduce the conservativeness of R-ZOCBFs. Experimental trials on a crane prototype demonstrate the overall performance of our safe control approach under significant perturbing motions of the crane base. While our focus is on crane-facilitated transfer, the methods more generally apply to safe robotically-assisted parts mating and parts insertion.
Quantitative Monitoring of Signal First-Order Logic
Runtime monitoring checks, during execution, whether a partial signal produced by a hybrid system satisfies its specification. Signal First-Order Logic (SFO) offers expressive real-time specifications over such signals, but currently comes only with Boolean semantics and has no tool support. We provide the first robustness-based quantitative semantics for SFO, enabling the expression and evaluation of rich real-time properties beyond the scope of existing formalisms such as Signal Temporal Logic. To enable online monitoring, we identify a past-time fragment of SFO and give a pastification procedure that transforms bounded-response SFO formulas into equisatisfiable formulas in this fragment. We then develop an efficient runtime monitoring algorithm for this past-time fragment and evaluate its performance on a set of benchmarks, demonstrating the practicality and effectiveness of our approach. To the best of our knowledge, this is the first publicly available prototype for online quantitative monitoring of full SFO.
comment: Full version of the FM 2026 paper
Floating-Base Deep Lagrangian Networks
Grey-box methods for system identification combine deep learning with physics-informed constraints, capturing complex dependencies while improving out-of-distribution generalization. Despite the growing importance of floating-base systems such as humanoids and quadrupeds, current grey-box models ignore their specific physical constraints. For instance, the inertia matrix is not only positive definite but also exhibits branch-induced sparsity and input independence. Moreover, the 6x6 composite spatial inertia of the floating base inherits properties of single-rigid-body inertia matrices. As we show, this includes the triangle inequality on the eigenvalues of the composite rotational inertia. To address the lack of physical consistency in deep learning models of floating-base systems, we introduce a parameterization of inertia matrices that satisfies all these constraints. Inspired by Deep Lagrangian Networks (DeLaN), we train neural networks to predict physically plausible inertia matrices that minimize inverse dynamics error under Lagrangian mechanics. For evaluation, we collected and released a dataset on multiple quadrupeds and humanoids. In these experiments, our Floating-Base Deep Lagrangian Networks (FeLaN) achieve better overall performance on both simulated and real robots, while providing greater physical interpretability.
Depth-adapted adaptive optics for three-photon microscopy
Three-photon (3-P) fluorescence microscopy enables deep in vivo imaging with subcellular resolution, but its performance is fundamentally constrained by the maximum permissible laser power required to avoid tissue heating and photodamage. Under these power-limited conditions, fluorescence signal generation, image contrast, and achievable imaging depth are strongly affected by the illumination beam profile and aberration correction strategy. In this paper, we showed that using a fixed illumination beam size was suboptimal across different imaging depths. We further showed that conventional Zernike-based adaptive optics (AO) correction degrades under reduced Gaussian illumination beam sizes due to loss of modal orthogonality. This degradation results in slow convergence, unintended focal and field-of-view shifts, and excessive wavefront deformations. To overcome these limitations, we introduced a depth-adapted AO framework in which both the illumination beam profile and the aberration correction basis were dynamically matched to the imaging conditions. By combining depth-optimised beam underfilling with a bespoke set of illumination-matched aberration modes, we achieved faster and more stable AO convergence, enhanced fluorescence signal and image quality during deep in vivo multi-channel neuroimaging. Together, these results established a practical and robust AO-enabled three-photon microscopy strategy that maximised imaging performance under realistic power constraints.
Occlusion-Aware Multi-Object Tracking via Expected Probability of Detection
This paper addresses multi-object systems, where objects may occlude one another relative to the sensor. The standard point-object model for detection-based sensors is enhanced so that the probability of detection considers the presence of all objects. A principled tracking method is derived, assigning each object an expected probability of detection, where the expectation is taken over the reduced Palm density, which means conditionally on the object's existence. The assigned probability thus considers the object's visibility relative to the sensor, under the presence of other objects. Unlike existing methods, the proposed method systematically accounts for uncertainties related to all objects in a clear and manageable way. The method is demonstrated through a visual tracking application using the multi-Bernoulli mixture (MBM) filter with marks.
comment: Submitted to IEEE Transactions on Aerospace and Electronic Systems (TAES)
A Digital Pheromone-Based Approach for In-Control/Out-of-Control Classification
In complex production lines, it is essential to have strict, fast-acting rules to determine whether the system is In Control (InC) or Out of Control (OutC). This study explores a bio-inspired method that digitally mimics ant colony behavior to classify InC/OutC states and forecast imminent transitions requiring maintenance. A case study on industrial potato chip frying provides the application context. During each two-minute frying cycle, sequences of eight temperature readings are collected. Each sequence is treated as a digital ant depositing virtual pheromones, generating a Base Score. New sequences, representing new ants, can either reinforce or weaken this score, leading to a Modified Base Score that reflects the system's evolving condition. Signals such as extreme temperatures, large variations within a sequence, or the detection of change-points contribute to a Threat Score, which is added to the Modified Base Score. Since pheromones naturally decay over time unless reinforced, an Environmental Score is incorporated to reflect recent system dynamics, imitating real ant behavior. This score is calculated from the Modified Base Scores collected over the past hour. The resulting Total Score, obtained as the sum of the Modified Base Score, Threat Score, and Environmental Score, is used as the main indicator for real-time system classification and forecasting of transitions from InC to OutC. This ant colony optimization-inspired approach provides an adaptive and interpretable framework for process monitoring and predictive maintenance in industrial environments.
Design Framework and Manufacturing of an Active Magnetic Bearing Spindle for Micro-Milling Applications
Micro-milling spindles require high rotational speeds where conventional rolling element bearings face limitations such as friction and thermal expansion. Active magnetic bearings (AMBs) address these challenges by providing non-contact and lubrication-free operation at ultra-high speeds with the ability to actively regulate spindle dynamics. The existing literature on AMB spindles has mainly reported specific prototype realizations or control system implementations for specific spindle dynamics. Consequently, design knowledge remains fragmented across isolated successful studies. This paper addresses this gap by presenting a systematic and iterative framework to design and manufacture a micro-milling AMB spindle. The process involves a multidisciplinary design flow with a focus on critical practical aspects of manufacturing. The realized spindle is reported as a case study.
Data-Driven Control of Large-Scale Networks with Formal Guarantees: A Small-Gain Free Approach
This paper offers a data-driven divide-and-conquer strategy to analyze large-scale interconnected networks, characterized by both unknown mathematical models and interconnection topologies. Our data-driven scheme treats an unknown network as an interconnection of individual agents (a.k.a. subsystems) and aims at constructing their symbolic models, referred to as discrete-domain representations of unknown agents, by collecting data from their trajectories. The primary objective is to synthesize a control strategy that guarantees desired behaviors over an unknown network by employing local controllers, derived from symbolic models of individual agents. To achieve this, we leverage the concept of alternating sub-bisimulation function (ASBF) to capture the closeness between state trajectories of each unknown agent and its data-driven symbolic model. Under a newly developed data-driven compositional condition, we then establish an alternating bisimulation function (ABF) between an unknown network and its symbolic model, based on ASBFs of individual agents, while providing correctness guarantees. Despite the sample complexity in existing work being exponential with respect to the network size, we demonstrate that our divide-and-conquer strategy significantly reduces it to a linear scale with respect to the number of agents. We also showcase that our data-driven compositional condition does not necessitate the traditional small-gain condition, which demands precise knowledge of the interconnection topology for its fulfillment. We apply our data-driven findings to three benchmarks comprising unknown networks with an arbitrary, a-priori undefined number of agents and unknown interconnection topologies.
AC-Informed DC Optimal Transmission Switching via Admittance Sensitivity-Augmented Constraints and Repair Costs
AC optimal transmission switching (AC-OTS) is a computationally challenging problem due to the nonconvexity and nonlinearity of AC power-flow (PF) equations coupled with a large number of binary variables. A computationally efficient alternative is the DC-OTS model, which uses the DC PF equations, but it can yield infeasible or suboptimal switching decisions when evaluated under the full AC optimal power flow (AC-OPF). To tackle this issue, we propose an AC-Informed DC Optimal Transmission Switching (AIDC-OTS) scheme that enhances the DC-OTS model by leveraging first- and second-order admittance sensitivities-based constraints and repair/penalty costs that guide the DC OTS towards AC-feasible topologies. The resulting model initially is a Mixed-Integer Quadratically Constrained Quadratic Program (MIQCQP), which we further reformulate into solver-friendly representations, such as a Mixed-Integer Second-Order Cone Program (MISOCP) and a Mixed-Integer Linear Program (MILP). This proposed scheme yields switching topologies that are AC-feasible, while maintaining computational tractability. We validate the proposed scheme using extensive simulations across a large set of PGlib test cases, demonstrating its effectiveness, with performance benchmarks against original DC-OTS and other OTS formulations such as LPAC-OTS and QC-OTS.
comment: 10 pages
An iterative tangential interpolation algorithm for model reduction of MIMO systems
We consider model reduction of large-scale multi-input, multi-output (MIMO) systems using tangential interpolation in the frequency domain. Our scheme is related to the recently-developed Adaptive Antoulas--Anderson (AAA) algorithm, which is an iterative algorithm that uses concepts from the Loewner framework. Our algorithm has two main features. The first is the use of freedom in interpolation weight matrices to optimize a proxy for an \(H_2\) system error. The second is the use of low-rank interpolation, where we iteratively add low-order interpolation data based on several criteria including minimizing maximum errors. We show there is freedom in the interpolation point selection method, leading to multiple algorithms that have trade-offs between computational complexity and approximation performance. We prove that a weighted \(H_2\) norm of a representative error system is monotonically decreasing as interpolation points are added. Finally, we provide computational results and some comparisons with prior work, demonstrating performance on par with standard model reduction methods.
comment: 13 pages, 4 figures Submitted to IEEE TAC. Revision 2
Learning Interior Point Method Central Path Projection for Optimal Power Flow
This paper proposes a learning-based approach to accelerate the interior-point method (IPM) for solving optimal power flow (OPF) problems by learning the structure of the IPM central path from its early stable iterations. Unlike traditional learning models that attempt to predict the OPF solution directly, our approach learns the structure of the IPM trajectory itself, since even accurate predictions may not reliably reduce IPM iterations. The IPM follows a central path that iteratively progresses toward the optimal solution. While this trajectory encodes critical information about the optimization landscape, the later iterations become increasingly expensive due to ill-conditioned linear systems. Our analysis of the IPM central path reveals that its initial segments contain the most informative features for guiding the trajectory toward optimality. Leveraging this insight, we model the central path as a time series and use a Long Short-Term Memory (LSTM) network to project the path using only the first few stable iterations. To ensure that the learned trajectory remains within the feasible region--especially near the optimal point--we introduce a grid-informed mechanism into the LSTM that enforces key operational constraints on generation, voltage magnitudes, and line flows. This framework, referred to as Learning-IPM (L-IPM), significantly reduces both the number of IPM iterations and overall solution time. To improve generalization, we use a sampling-based strategy to generate a diverse set of load conditions that effectively span the operational space. Simulation results across a range of test systems--including a 2869-bus European transmission network--demonstrate that L-IPM achieves up to a 94% reduction in solution time and an 85.5% reduction in iterations, without compromising feasibility or accuracy.
GENAI WORKBENCH: AI-Assisted Analysis and Synthesis of Engineering Systems from Multimodal Engineering Data
Modern engineering design platforms excel at discipline-specific tasks such as CAD, CAM, and CAE, but often lack native systems engineering frameworks. This creates a disconnect where system-level requirements and architectures are managed separately from detailed component design, hindering holistic development and increasing integration risks. To address this, we present the conceptual framework for the GenAI Workbench, a Model-Based Systems Engineering (MBSE) environment that integrates systems engineering principles into the designer's workflow. Built on an open-source PLM platform, it establishes a unified digital thread by linking semantic data from documents, physical B-rep geometry, and relational system graphs. The workbench facilitates an AI-assisted workflow where a designer can ingest source documents, from which the system automatically extracts requirements and uses vision-language models to generate an initial system architecture, such as a Design Structure Matrix (DSM). This paper presents the conceptual architecture, proposed methodology, and anticipated impact of this work-in-progress framework, which aims to foster a more integrated, data-driven, and informed engineering design methodology.
comment: 7 pages, 3 figures, accepted to be presented at IISE Annual Conference 2026
Zono-Conformal Prediction: Zonotope-Based Uncertainty Quantification for Regression and Classification Tasks
Conformal prediction is a popular uncertainty quantification method that augments a base predictor to return sets of predictions with statistically valid coverage guarantees. However, current methods are often computationally expensive and data-intensive, as they require constructing an uncertainty model before calibration. Moreover, existing approaches typically represent the prediction sets with intervals, which limits their ability to capture dependencies in multi-dimensional outputs. We address these limitations by introducing zono-conformal prediction, a novel approach inspired by interval predictor models and reachset-conformant identification that constructs prediction zonotopes with assured coverage. By placing zonotopic uncertainty sets directly into the model of the base predictor, zono-conformal predictors can be identified via a single, data-efficient linear program. While we can apply zono-conformal prediction to arbitrary nonlinear base predictors, we focus on feed-forward neural networks in this work. Aside from regression tasks, we also construct optimal zono-conformal predictors in classification settings where the output of an uncertain predictor is a set of possible classes. We provide probabilistic coverage guarantees and present methods for detecting outliers in the identification data. In extensive numerical experiments, we show that zono-conformal predictors are less conservative than interval predictor models and standard conformal prediction methods, while achieving a similar coverage over the test data.
comment: https://jmlr.org/papers/v26/25-1161.html
A Self-Supervised Learning Approach with Differentiable Optimization for UAV Trajectory Planning ICRA 2026
While Unmanned Aerial Vehicles (UAVs) have gained significant traction across various fields, path planning in 3D environments remains a critical challenge, particularly under size, weight, and power (SWAP) constraints. Traditional modular planning systems often introduce latency and suboptimal performance due to limited information sharing and local minima issues. End-to-end learning approaches streamline the pipeline by mapping sensory observations directly to actions but require large-scale datasets, face significant sim-to-real gaps, or lack dynamical feasibility. In this paper, we propose a self-supervised UAV trajectory planning pipeline that integrates a learning-based depth perception with differentiable trajectory optimization. A 3D cost map guides UAV behavior without expert demonstrations or human labels. Additionally, we incorporate a neural network-based time allocation strategy to improve the efficiency and optimality. The system thus combines robust learning-based perception with reliable physics-based optimization for improved generalizability and interpretability. Both simulation and real-world experiments validate our approach across various environments, demonstrating its effectiveness and robustness. Our method achieves a 31.33% improvement in position tracking error and 49.37% reduction in control effort compared to the state-of-the-art.
comment: Accepted by ICRA 2026
Digital Twin-Based Cooling System Optimization for Data Center
Data center cooling systems consume significant auxiliary energy, yet optimization studies rarely quantify the gap between theoretically optimal and operationally deployable control strategies. This paper develops a digital twin of the liquid cooling infrastructure at the Frontier exascale supercomputer, in which a hot-temperature water system comprises three parallel subloops, each serving dedicated coolant distribution unit clusters through plate heat exchangers and variable-speed pumps. The surrogate model is built based on Modelica and validated through one full calendar year of 10-minute operational data following ASHRAE Guideline 14. The model achieves a subloop coefficient of variation of the root mean square error below 2.7% and a normalized mean bias error within 2.5%. Using this validated surrogate model, a layered optimization framework evaluates three progressively constrained strategies: an analytical flow-only optimization achieves 20.4% total energy saving, unconstrained joint optimization of flow rate and supply temperature demonstrates 30.1% total energy saving, and ramp-constrained optimization of flow rate and supply temperature, enforcing actuator rate limits, can reach total energy saving of 27.8%. The analysis reveals that the baseline system operates at 2.9 times the minimum thermally safe flow rate, and the co-optimizing supply temperature with flow rate nearly doubles the savings achievable by flow reduction alone.
comment: 43 pages, 13 figures
Robotics
From Leaderboard to Deployment: Code Quality Challenges in AV Perception Repositories
Autonomous vehicle (AV) perception models are typically evaluated solely on benchmark performance metrics, with limited attention to code quality, production readiness and long-term maintainability. This creates a significant gap between research excellence and real-world deployment in safety-critical systems subject to international safety standards. To address this gap, we present the first large-scale empirical study of software quality in AV perception repositories, systematically analyzing 178 unique models from the KITTI and NuScenes 3D Object Detection leaderboards. Using static analysis tools (Pylint, Bandit, and Radon), we evaluated code errors, security vulnerabilities, maintainability, and development practices. Our findings revealed that only 7.3% of the studied repositories meet basic production-readiness criteria, defined as having zero critical errors and no high-severity security vulnerabilities. Security issues are highly concentrated, with the top five issues responsible for almost 80% of occurrences, which prompted us to develop a set of actionable guidelines to prevent them. Additionally, the adoption of Continuous Integration/Continuous Deployment pipelines was correlated with better code maintainability. Our findings highlight that leaderboard performance does not reflect production readiness and that targeted interventions could substantially improve the quality and safety of AV perception code.
Rethinking Camera Choice: An Empirical Study on Fisheye Camera Properties in Robotic Manipulation CVPR 2026
The adoption of fisheye cameras in robotic manipulation, driven by their exceptionally wide Field of View (FoV), is rapidly outpacing a systematic understanding of their downstream effects on policy learning. This paper presents the first comprehensive empirical study to bridge this gap, rigorously analyzing the properties of wrist-mounted fisheye cameras for imitation learning. Through extensive experiments in both simulation and the real world, we investigate three critical research questions: spatial localization, scene generalization, and hardware generalization. Our investigation reveals that: (1) The wide FoV significantly enhances spatial localization, but this benefit is critically contingent on the visual complexity of the environment. (2) Fisheye-trained policies, while prone to overfitting in simple scenes, unlock superior scene generalization when trained with sufficient environmental diversity. (3) While naive cross-camera transfer leads to failures, we identify the root cause as scale overfitting and demonstrate that hardware generalization performance can be improved with a simple Random Scale Augmentation (RSA) strategy. Collectively, our findings provide concrete, actionable guidance for the large-scale collection and effective use of fisheye datasets in robotic learning. More results and videos are available on https://robo-fisheye.github.io/
comment: 22 pages, 15 figures, Accecpted by CVPR 2026
Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons
General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame-level supervision. While effective for expert demonstrations, this paradigm scales poorly to large-scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM-1M, a reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real-world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at https://robometer.github.io/.
comment: 33 pages, 17 figures
Real-Time Thermal-Inertial Odometry on Embedded Hardware for High-Speed GPS-Denied Flight
We present a real-time monocular thermal-inertial odometry system designed for high-velocity, GPS-denied flight on embedded hardware. The system fuses measurements from a FLIR Boson+ 640 longwave infrared camera, a high-rate IMU, a laser range finder, a barometer, and a magnetometer within a fixed-lag factor graph. To sustain reliable feature tracks under motion blur, low contrast, and rapid viewpoint changes, we employ a lightweight thermal-optimized front-end with multi-stage feature filtering. Laser range finder measurements provide per-feature depth priors that stabilize scale during weakly observable motion. High-rate inertial data is first pre-filtered using a Chebyshev Type II infinite impulse response (IIR) filter and then preintegrated, improving robustness to airframe vibrations during aggressive maneuvers. To address barometric altitude errors induced at high airspeeds, we train an uncertainty-aware gated recurrent unit (GRU) network that models the temporal dynamics of static pressure distortion, outperforming polynomial and multi-layer perceptron (MLP) baselines. Integrated on an NVIDIA Jetson Xavier NX, the complete system supports closed-loop quadrotor flight at 30 m/s with drift under 2% over kilometer-scale trajectories. These contributions expand the operational envelope of thermal-inertial navigation, enabling reliable high-speed flight in visually degraded and GPS-denied environments.
ACDC: Adaptive Curriculum Planning with Dynamic Contrastive Control for Goal-Conditioned Reinforcement Learning in Robotic Manipulation ICAPS 2026
Goal-conditioned reinforcement learning has shown considerable potential in robotic manipulation; however, existing approaches remain limited by their reliance on prioritizing collected experience, resulting in suboptimal performance across diverse tasks. Inspired by human learning behaviors, we propose a more comprehensive learning paradigm, ACDC, which integrates multidimensional Adaptive Curriculum (AC) Planning with Dynamic Contrastive (DC) Control to guide the agent along a well-designed learning trajectory. More specifically, at the planning level, the AC component schedules the learning curriculum by dynamically balancing diversity-driven exploration and quality-driven exploitation based on the agent's success rate and training progress. At the control level, the DC component implements the curriculum plan through norm-constrained contrastive learning, enabling magnitude-guided experience selection aligned with the current curriculum focus. Extensive experiments on challenging robotic manipulation tasks demonstrate that ACDC consistently outperforms the state-of-the-art baselines in both sample efficiency and final task success rate.
comment: 13 pages (including references and appendix), 12 figures. Accepted to ICAPS 2026. Code available at https://github.com/Xuerui-Wang-oss/Adaptive-Curriculum-Learning-and-Dynamic-Contrastive-Control
$π$-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs
Flow-based vision-language-action (VLA) models excel in embodied control but suffer from intractable likelihoods during multi-step sampling, hindering online reinforcement learning. We propose \textbf{\textit{$\boldsymbolπ$-StepNFT}} (Step-wise Negative-aware Fine-Tuning), a critic-and-likelihood-free framework that requires only a single forward pass per optimization step and eliminates auxiliary value networks. We identify that wider exploration spaces necessitate finer-grained, step-wise guidance for alignment. Empirically, $π$-StepNFT unlocks latent potential on LIBERO with competitive few-shot robustness. Moreover, it achieves superior generalization on ManiSkill, outperforming value-based baselines in OOD scenarios by preventing overfitting to multimodal features. This property offers a scalable solution promising for complex real-world applications.
LAD-Drive: Bridging Language and Trajectory with Action-Aware Diffusion Transformers
While multimodal large language models (MLLMs) provide advanced reasoning for autonomous driving, translating their discrete semantic knowledge into continuous trajectories remains a fundamental challenge. Existing methods often rely on unimodal planning heads that inherently limit their ability to represent multimodal driving behavior. Furthermore, most generative approaches frequently condition on one-hot encoded actions, discarding the nuanced navigational uncertainty critical for complex scenarios. To resolve these limitations, we introduce LAD-Drive, a generative framework that structurally disentangles high-level intention from low-level spatial planning. LAD-Drive employs an action decoder to infer a probabilistic meta-action distribution, establishing an explicit belief state that preserves the nuanced intent typically lost by one-hot encodings. This distribution, fused with the vehicle's kinematic state, conditions an action-aware diffusion decoder that utilizes a truncated denoising process to refine learned motion anchors into safe, kinematically feasible trajectories. Extensive evaluations on the LangAuto benchmark demonstrate that LAD-Drive achieves state-of-the-art results, outperforming competitive baselines by up to 59% in Driving Score while significantly reducing route deviations and collisions. We will publicly release the code and models on https://github.com/iis-esslingen/lad-drive.
CHOP: Counterfactual Human Preference Labels Improve Obstacle Avoidance in Visuomotor Navigation Policies
Visuomotor navigation policies have shown strong perception-action coupling for embodied agents, yet they often struggle with safe navigation and dynamic obstacle avoidance in complex real-world environments. We introduce CHOP, a novel approach that leverages Counterfactual Human Preference Labels to align visuomotor navigation policies towards human intuition of safety and obstacle avoidance in navigation. In CHOP, for each visual observation, the robot's executed trajectory is included among a set of counterfactual navigation trajectories: alternative trajectories the robot could have followed under identical conditions. Human annotators provide pairwise preference labels over these trajectories based on anticipated outcomes such as collision risk and path efficiency. These aggregated preferences are then used to fine-tune visuomotor navigation policies, aligning their behavior with human preferences in navigation. Experiments on the SCAND dataset show that visuomotor navigation policies fine-tuned with CHOP reduce near-collision events by 49.7%, decrease deviation from human-preferred trajectories by 45.0%, and increase average obstacle clearance by 19.8% on average across multiple state-of-the-art models, compared to their pretrained baselines. These improvements transfer to real-world deployments on a Ghost Robotics Vision60 quadruped, where CHOP-aligned policies improve average goal success rates by 24.4%, increase minimum obstacle clearance by 6.8%, reduce collision and intervention events by 45.7%, and improve normalized path completion by 38.6% on average across navigation scenarios, compared to their pretrained baselines. Our results highlight the value of counterfactual preference supervision in bridging the gap between large-scale visuomotor policies and human-aligned, safety-aware embodied navigation.
Learning Vision-Based Omnidirectional Navigation: A Teacher-Student Approach Using Monocular Depth Estimation
Reliable obstacle avoidance in industrial settings demands 3D scene understanding, but widely used 2D LiDAR sensors perceive only a single horizontal slice of the environment, missing critical obstacles above or below the scan plane. We present a teacher-student framework for vision-based mobile robot navigation that eliminates the need for LiDAR sensors. A teacher policy trained via Proximal Policy Optimization (PPO) in NVIDIA Isaac Lab leverages privileged 2D LiDAR observations that account for the full robot footprint to learn robust navigation. The learned behavior is distilled into a student policy that relies solely on monocular depth maps predicted by a fine-tuned Depth Anything V2 model from four RGB cameras. The complete inference pipeline, comprising monocular depth estimation (MDE), policy execution, and motor control, runs entirely onboard an NVIDIA Jetson Orin AGX mounted on a DJI RoboMaster platform, requiring no external computation for inference. In simulation, the student achieves success rates of 82-96.5%, consistently outperforming the standard 2D LiDAR teacher (50-89%). In real-world experiments, the MDE-based student outperforms the 2D LiDAR teacher when navigating around obstacles with complex 3D geometries, such as overhanging structures and low-profile objects, that fall outside the single scan plane of a 2D LiDAR.
Event-Only Drone Trajectory Forecasting with RPM-Modulated Kalman Filtering
Event cameras provide high-temporal-resolution visual sensing that is well suited for observing fast-moving aerial objects; however, their use for drone trajectory prediction remains limited. This work introduces an event-only drone forecasting method that exploits propeller-induced motion cues. Propeller rotational speed are extracted directly from raw event data and fused within an RPM-aware Kalman filtering framework. Evaluations on the FRED dataset show that the proposed method outperforms learning-based approaches and vanilla kalman filter in terms of average distance error and final distance error at 0.4s and 0.8s forecasting horizons. The results demonstrate robust and accurate short- and medium-horizon trajectory forecasting without reliance on RGB imagery or training data.
comment: Submitted to ICUAS 2026 conference
From Transportation to Manipulation: Transforming Magnetic Levitation to Magnetic Robotics
Magnetic Levitation (MagLev) systems fundamentally increase the flexibility of in-machine material flow in industrial automation. Therefore, these systems enable dynamic throughput optimization, which is especially beneficial for high-mix low-volume manufacturing. Until now, MagLev installations have been used primarily for in-machine transport, while their potential for manipulation is largely unexplored. This paper introduces the 6D-Platform MagBot, a low-cost six degrees of freedom parallel kinematic that couples two movers into a composite robotic platform. Experiments show that the 6D-Platform MagBot achieves sub-millimeter positioning accuracy and supports fully autonomous pick up and drop off via a docking station, allowing rapid and repeatable reconfiguration of the machine. Relative to a single mover, the proposed platform substantially expands the reachable workspace, payload, and functional dexterity. By unifying transportation and manipulation, this work advances Magnetic Levitation towards Magnetic Robotics, enabling manufacturing solutions that are more agile, efficient, and adaptable.
Closed-Loop Action Chunks with Dynamic Corrections for Training-Free Diffusion Policy ICRA2026
Diffusion-based policies have achieved remarkable results in robotic manipulation but often struggle to adapt rapidly in dynamic scenarios, leading to delayed responses or task failures. We present DCDP, a Dynamic Closed-Loop Diffusion Policy framework that integrates chunk-based action generation with real-time correction. DCDP integrates a self-supervised dynamic feature encoder, cross-attention fusion, and an asymmetric action encoder-decoder to inject environmental dynamics before action execution, achieving real-time closed-loop action correction and enhancing the system's adaptability in dynamic scenarios. In dynamic PushT simulations, DCDP improves adaptability by 19\% without retraining while requiring only 5\% additional computation. Its modular design enables plug-and-play integration, achieving both temporal coherence and real-time responsiveness in dynamic robotic scenarios, including real-world manipulation tasks. The project page is at: https://github.com/wupengyuan/dcdp
comment: Accepted by ICRA2026
SaferPath: Hierarchical Visual Navigation with Learned Guidance and Safety-Constrained Control ICRA 2026
Visual navigation is a core capability for mobile robots, yet end-to-end learning-based methods often struggle with generalization and safety in unseen, cluttered, or narrow environments. These limitations are especially pronounced in dense indoor settings, where collisions are likely and end-to-end models frequently fail. To address this, we propose SaferPath, a hierarchical visual navigation framework that leverages learned guidance from existing end-to-end models and refines it through a safety-constrained optimization-control module. SaferPath transforms visual observations into a traversable-area map and refines guidance trajectories using Model Predictive Stein Variational Evolution Strategy (MP-SVES), efficiently generating safe trajectories in only a few iterations. The refined trajectories are tracked by an MPC controller, ensuring robust navigation in complex environments. Extensive experiments in scenarios with unseen obstacles, dense unstructured spaces, and narrow corridors demonstrate that SaferPath consistently improves success rates and reduces collisions, outperforming representative baselines such as ViNT and NoMaD, and enabling safe navigation in challenging real-world settings.
comment: ICRA 2026
Streaming Real-Time Trajectory Prediction Using Endpoint-Aware Modeling WACV 2026
Future trajectories of neighboring traffic agents have a significant influence on the path planning and decision-making of autonomous vehicles. While trajectory forecasting is a well-studied field, research mainly focuses on snapshot-based prediction, where each scenario is treated independently of its global temporal context. However, real-world autonomous driving systems need to operate in a continuous setting, requiring real-time processing of data streams with low latency and consistent predictions over successive timesteps. We leverage this continuous setting to propose a lightweight yet highly accurate streaming-based trajectory forecasting approach. We integrate valuable information from previous predictions with a novel endpoint-aware modeling scheme. Our temporal context propagation uses the trajectory endpoints of the previous forecasts as anchors to extract targeted scenario context encodings. Our approach efficiently guides its scene encoder to extract highly relevant context information without needing refinement iterations or segment-wise decoding. Our experiments highlight that our approach effectively relays information across consecutive timesteps. Unlike methods using multi-stage refinement processing, our approach significantly reduces inference latency, making it well-suited for real-world deployment. We achieve state-of-the-art streaming trajectory prediction results on the Argoverse~2 multi-agent and single-agent benchmarks, while requiring substantially fewer resources.
comment: WACV 2026 Oral. Project Page at https://a-pru.github.io/seam/
Tiny-DroNeRF: Tiny Neural Radiance Fields aboard Federated Learning-enabled Nano-drones ICRA 2026
Sub-30g nano-sized aerial robots can leverage their agility and form factor to autonomously explore cluttered and narrow environments, like in industrial inspection and search and rescue missions. However, the price for their tiny size is a strong limit in their resources, i.e., sub-100 mW microcontroller units (MCUs) delivering $\sim$100 GOps/s at best, and memory budgets well below 100 MB. Despite these strict constraints, we aim to enable complex vision-based tasks aboard nano-drones, such as dense 3D scene reconstruction: a key robotic task underlying fundamental capabilities like spatial awareness and motion planning. Top-performing 3D reconstruction methods leverage neural radiance fields (NeRF) models, which require GBs of memory and massive computation, usually delivered by high-end GPUs consuming 100s of Watts. Our work introduces Tiny-DroNeRF, a lightweight NeRF model, based on Instant-NGP, and optimized for running on a GAP9 ultra-low-power (ULP) MCU aboard our nano-drones. Then, we further empower our Tiny-DroNeRF by leveraging a collaborative federated learning scheme, which distributes the model training among multiple nano-drones. Our experimental results show a 96% reduction in Tiny-DroNeRF's memory footprint compared to Instant-NGP, with only a 5.7 dB drop in reconstruction accuracy. Finally, our federated learning scheme allows Tiny-DroNeRF to train with an amount of data otherwise impossible to keep in a single drone's memory, increasing the overall reconstruction accuracy. Ultimately, our work combines, for the first time, NeRF training on an ULP MCU with federated learning on nano-drones.
comment: This paper has been accepted for publication in the IEEE ICRA 2026 conference. ©2026 IEEE
LEAR: Learning Edge-Aware Representations for Event-to-LiDAR Localization
Event cameras offer high-temporal-resolution sensing that remains reliable under high-speed motion and challenging lighting, making them promising for localization from LiDAR point clouds in GPS-denied and visually degraded environments. However, aligning sparse, asynchronous events with dense LiDAR maps is fundamentally ill-posed, as direct correspondence estimation suffers from modality gaps. We propose LEAR, a dual-task learning framework that jointly estimates edge structures and dense event-depth flow fields to bridge the sensing-modality divide. Instead of treating edges as a post-hoc aid, LEAR couples them with flow estimation through a cross-modal fusion mechanism that injects modality-invariant geometric cues into the motion representation, and an iterative refinement strategy that enforces mutual consistency between the two tasks over multiple update steps. This synergy produces edge-aware, depth-aligned flow fields that enable more robust and accurate pose recovery via Perspective-n-Point (PnP) solvers. On several popular and challenging datasets, LEAR achieves superior performance over the best prior method. The source code, trained models, and demo videos are made publicly available online.
SSMG-Nav: Enhancing Lifelong Object Navigation with Semantic Skeleton Memory Graph ICRA
Navigating to out-of-sight targets from human instructions in unfamiliar environments is a core capability for service robots. Despite substantial progress, most approaches underutilize reusable, persistent memory, constraining performance in lifelong settings. Many are additionally limited to single-modality inputs and employ myopic greedy policies, which often induce inefficient back-and-forth maneuvers (BFMs). To address such limitations, we introduce SSMG-Nav, a framework for object navigation built on a \textit{Semantic Skeleton Memory Graph} (SSMG) that consolidates past observations into a spatially aligned, persistent memory anchored by topological keypoints (e.g., junctions, room centers). SSMG clusters nearby entities into subgraphs, unifying entity- and space-level semantics to yield a compact set of candidate destinations. To support multimodal targets (images, objects, and text), we integrate a vision-language model (VLM). For each subgraph, a multimodal prompt synthesized from memory guides the VLM to infer a target belief over destinations. A long-horizon planner then trades off this belief against traversability costs to produce a visit sequence that minimizes expected path length, thereby reducing backtracking. Extensive experiments on challenging lifelong benchmarks and standard ObjectNav benchmarks demonstrate that, compared to strong baselines, our method achieves higher success rates and greater path efficiency, validating the effectiveness of SSMG-Nav.
comment: Accepted by 2026 ICRA
Neural Implicit Action Fields: From Discrete Waypoints to Continuous Functions for Vision-Language-Action Models
Despite the rapid progress of Vision-Language-Action (VLA) models, the prevailing paradigm of predicting discrete waypoints remains fundamentally misaligned with the intrinsic continuity of physical motion. This discretization imposes rigid sampling rates, lacks high-order differentiability, and introduces quantization artifacts that hinder precise, compliant interaction. We propose Neural Implicit Action Fields (NIAF), a paradigm shift that reformulates action prediction from discrete waypoints to continuous action function regression. By utilizing an MLLM as a hierarchical spectral modulator over a learnable motion prior, NIAF synthesizes infinite-resolution trajectories as continuous-time manifolds. This formulation enables analytical differentiability, allowing for explicit supervision of velocity, acceleration, and jerk to ensure mathematical consistency and physical plausibility. Our approach achieves state-of-the-art results on CALVIN and LIBERO benchmarks across diverse backbones. Furthermore, real-world experiments demonstrate that NIAF enables stable impedance control, bridging the gap between high-level semantic understanding and low-level dynamic execution.
Shape-Interpretable Visual Self-Modeling Enables Geometry-Aware Continuum Robot Control
Continuum robots possess high flexibility and redundancy, making them well suited for safe interaction in complex environments, yet their continuous deformation and nonlinear dynamics pose fundamental challenges to perception, modeling, and control. Existing vision-based control approaches often rely on end-to-end learning, achieving shape regulation without explicit awareness of robot geometry or its interaction with the environment. Here, we introduce a shape-interpretable visual self-modeling framework for continuum robots that enables geometry-aware control. Robot shapes are encoded from multi-view planar images using a Bezier-curve representation, transforming visual observations into a compact and physically meaningful shape space that uniquely characterizes the robot's three-dimensional configuration. Based on this representation, neural ordinary differential equations are employed to self-model both shape and end-effector dynamics directly from data, enabling hybrid shape-position control without analytical models or dense body markers. The explicit geometric structure of the learned shape space allows the robot to reason about its body and surroundings, supporting environment-aware behaviors such as obstacle avoidance and self-motion while maintaining end-effector objectives. Experiments on a cable-driven continuum robot demonstrate accurate shape-position regulation and tracking, with shape errors within 1.56% of image resolution and end-effector errors within 2% of robot length, as well as robust performance in constrained environments. By elevating visual shape representations from two-dimensional observations to an interpretable three-dimensional self-model, this work establishes a principled alternative to vision-based end-to-end control and advances autonomous, geometry-aware manipulation for continuum robots.
Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning ICLR 2026
Scaling reinforcement learning to tens of thousands of parallel environments requires overcoming the limited exploration capacity of a single policy. Ensemble-based policy gradient methods, which employ multiple policies to collect diverse samples, have recently been proposed to promote exploration. However, merely broadening the exploration space does not always enhance learning capability, since excessive exploration can reduce exploration quality or compromise training stability. In this work, we theoretically analyze the impact of inter-policy diversity on learning efficiency in policy ensembles, and propose Coupled Policy Optimization which regulates diversity through KL constraints between policies. The proposed method enables effective exploration and outperforms strong baselines such as SAPG, PBT, and PPO across multiple tasks, including challenging dexterous manipulation, in terms of both sample efficiency and final performance. Furthermore, analysis of policy diversity and effective sample size during training reveals that follower policies naturally distribute around the leader, demonstrating the emergence of structured and efficient exploratory behavior. Our results indicate that diverse exploration under appropriate regulation is key to achieving stable and sample-efficient learning in ensemble policy gradient methods. Project page at https://naoki04.github.io/paper-cpo/ .
comment: In ICLR 2026. Website at https://naoki04.github.io/paper-cpo/
A Safety-Aware Shared Autonomy Framework with BarrierIK Using Control Barrier Functions ICRA 2026
Shared autonomy blends operator intent with autonomous assistance. In cluttered environments, linear blending can produce unsafe commands even when each source is individually collision-free. Many existing approaches model obstacle avoidance through potentials or cost terms, which only enforce safety as a soft constraint. In contrast, safety-critical control requires hard guarantees. We investigate the use of control barrier functions (CBFs) at the inverse kinematics (IK) layer of shared autonomy, targeting post-blend safety while preserving task performance. Our approach is evaluated in simulation on representative cluttered environments and in a VR teleoperation study comparing pure teleoperation with shared autonomy. Across conditions, employing CBFs at the IK layer reduces violation time and increases minimum clearance while maintaining task performance. In the user study, participants reported higher perceived safety and trust, lower interference, and an overall preference for shared autonomy with our safety filter. Additional materials available at https://berkguler.github.io/barrierik.
comment: Accepted on ICRA 2026, 9 pages, 5 figures
TacMamba: A Tactile History Compression Adapter Bridging Fast Reflexes and Slow VLA Reasoning
In visually ambiguous manipulation such as detecting button click tactile feedback is often the sole source of ground truth. However, fusing tactile data poses a significant challenge due to a spatiotemporal mismatch: tactile perception requires high-frequency processing with long-horizon memory (System 1), whereas visual policies operate at low control frequencies (System 2). Existing architectures struggle to bridge this gap: Transformers are computationally prohibitive for high-frequency loops (>100Hz), while LSTMs suffer from forgetting over extended interaction histories. In this paper, we introduce TacMamba, a hierarchical architecture that aligns high-bandwidth tactile reflexes with low-frequency visual planning. Our approach comprises three core contributions: (1) a custom high-frequency tactile interface designed for flexible integration; (2) a Mamba-based Tactile History Compressor that encodes continuous force history into a compact state with O(1) inference latency (0.45 ms), enabling plug-and-play fusion with VLA models without joint pre-training and (3) a Tactile-Guided Dual-Stage Training strategy that leverages temporal discrimination for self-supervised representation learning and phase-uniform sampling to mitigate data sparsity. Experiments on discrete counting and implicit state switching demonstrate that TacMamba achieves 100% success rates, significantly outperforming the visual-only pi_0.5 baseline, while strictly satisfying hard real-time constraints.
B$^2$F-Map: Crowd-sourced Mapping with Bayesian B-spline Fusion ICRA 2026
Crowd-sourced mapping offers a scalable alternative to creating maps using traditional survey vehicles. Yet, existing methods either rely on prior high-definition (HD) maps or neglect uncertainties in the map fusion. In this work, we present a complete pipeline for HD map generation using production vehicles equipped only with a monocular camera, consumer-grade GNSS, and IMU. Our approach includes on-cloud localization using lightweight standard-definition maps, on-vehicle mapping via an extended object trajectory (EOT) Poisson multi-Bernoulli (PMB) filter with Gibbs sampling, and on-cloud multi-drive optimization and Bayesian map fusion. We represent the lane lines using B-splines, where each B-spline is parameterized by a sequence of Gaussian distributed control points, and propose a novel Bayesian fusion framework for B-spline trajectories with differing density representation, enabling principled handling of uncertainties. We evaluate our proposed approach, B$^2$F-Map, on large-scale real-world datasets collected across diverse driving conditions and demonstrate that our method is able to produce geometrically consistent lane-level maps.
comment: Accepted to ICRA 2026
Learning Thermal-Aware Locomotion Policies for an Electrically-Actuated Quadruped Robot
Electrically-actuated quadrupedal robots possess high mobility on complex terrains, but their motors tend to accumulate heat under high-torque cyclic loads, potentially triggering overheat protection and limiting long-duration tasks. This work proposes a thermal-aware control method that incorporates motor temperatures into reinforcement learning locomotion policies and introduces thermal-constraint rewards to prevent temperature exceedance. Real-world experiments on the Unitree A1 demonstrate that, under a fixed 3 kg payload, the baseline policy triggers overheat protection and stops within approximately 7 minutes, whereas the proposed method can operate continuously for over 27 minutes without thermal interruptions while maintaining comparable command-tracking performance, thereby enhancing sustainable operational capability.
KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models
Vision-Language-Action (VLA) models build a token-domain robot control paradigm, yet suffer from low speed. Speculative Decoding (SD) is an optimization strategy that can boost inference speed. Two key issues emerge when integrating VLA and SD: first, SD relies on re-inference to address token errors, which is computationally expensive; second, to mitigate token errors, the acceptance threshold in SD requires careful adjustment. Existing works fail to address the above two issues effectively. Meanwhile, as the bridge between AI and the physical world, existing embodied intelligence has overlooked the application of robotic kinematics. To address these issues, we innovatively combine token-domain VLA models with kinematic-domain prediction for SD, proposing a kinematic-rectified SD framework named KERV. We employ a kinematics-based Kalman Filter to predict actions and compensate for SD errors, avoiding costly re-inference. Moreover, we design a kinematics-based adjustment strategy to dynamically rectify the acceptance threshold, addressing the difficulty of threshold determination. Experimental results across diverse tasks and environments demonstrate that KERV achieves 27%~37% acceleration with nearly no Success Rate loss.
comment: This paper has been accepted by DAC 2026
(hu)Man vs. Machine: In the Future of Motorsport, can Autonomous Vehicles Compete?
Motorsport has historically driven technological innovation in the automotive industry. Autonomous racing provides a proving ground to push the limits of performance of autonomous vehicle (AV) systems. In principle, AVs could be at least as fast, if not faster, than humans. However, human driven racing provides broader audience appeal thus far, and is more strategically challenging. Both provide opportunities to push each other even further technologically, yet competitions remain separate. This paper evaluates whether the future of motorsport could encompass joint competition between humans and AVs. Analysis of the current state of the art, as well as recent competition outcomes, shows that while technical performance has reached comparable levels, there are substantial challenges in racecraft, strategy and safety that need to be overcome. Outstanding issues involved in mixed human-AI racing, ranging from an initial assessment of critical factors such as system-level latencies, to effective planning and risk guarantees are explored. The crucial non-technical aspect of audience engagement and appeal regarding the changing character of motorsport is addressed. In the wider context of motorsport and AVs, this work outlines a proposed agenda for future research to 'keep pushing the possible', in the true spirit of motorsport.
Pri4R: Learning World Dynamics for Vision-Language-Action Models with Privileged 4D Representation
Humans learn not only how their bodies move, but also how the surrounding world responds to their actions. In contrast, while recent Vision-Language-Action (VLA) models exhibit impressive semantic understanding, they often fail to capture the spatiotemporal dynamics governing physical interaction. In this paper, we introduce Pri4R, a simple yet effective approach that endows VLA models with an implicit understanding of world dynamics by leveraging privileged 4D information during training. Specifically, Pri4R augments VLAs with a lightweight point track head that predicts 3D point tracks. By injecting VLA features into this head to jointly predict future 3D trajectories, the model learns to incorporate evolving scene geometry within its shared representation space, enabling more physically aware context for precise control. Due to its architectural simplicity, Pri4R is compatible with dominant VLA design patterns with minimal changes. During inference, we run the model using the original VLA architecture unchanged; Pri4R adds no extra inputs, outputs, or computational overhead. Across simulation and real-world evaluations, Pri4R significantly improves performance on challenging manipulation tasks, including a +10% gain on LIBERO-Long and a +40% gain on RoboCasa. We further show that 3D point track prediction is an effective supervision target for learning action-world dynamics, and validate our design choices through extensive ablations.
RoboGPU: Accelerating GPU Collision Detection for Robotics
Autonomous robots are increasingly prevalent in our society, emerging in medical care, transportation vehicles, and home assistance. These robots rely on motion planning and collision detection to identify a sequence of movements allowing them to navigate to an end goal without colliding with the surrounding environment. While many specialized accelerators have been proposed to meet the real-time requirements of robotics planning tasks, they often lack the flexibility to adapt to the rapidly changing landscape of robotics and support future advancements. However, GPUs are well-positioned for robotics and we find that they can also tackle collision detection algorithms with enhancements to existing ray tracing accelerator (RTA) units. Unlike intersection tests in ray tracing, collision queries in robotics require control flow mechanisms to avoid unnecessary computations in each query. In this work, we explore and compare different architectural modifications to address the gaps of existing GPU RTAs. Our proposed RoboGPU architecture introduces a RoboCore that computes collision queries 3.1$\times$ faster than RTA implementations and 14.8$\times$ faster than a CUDA baseline. RoboCore is also useful for other robotics tasks, achieving 3.6$\times$ speedup on a state-of-the-art neural motion planner and 1.1$\times$ speedup on Monte Carlo Localization compared to a baseline GPU. RoboGPU matches the performance of dedicated hardware accelerators while being able to adapt to evolving motion planning algorithms and support classical algorithms.
FATE: Closed-Loop Feasibility-Aware Task Generation with Active Repair for Physically Grounded Robotic Curricula
Recent breakthroughs in generative simulation have harnessed Large Language Models (LLMs) to generate diverse robotic task curricula, yet these open-loop paradigms frequently produce linguistically coherent but physically infeasible goals, stemming from ungrounded task specifications or misaligned objective formulations. To address this critical limitation, we propose FATE (Feasibility-Aware Task gEneration), a closed-loop, self-correcting framework that reimagines task generation as an iterative validation-and-refinement process. Unlike conventional methods that decouple generation and verification into discrete stages, FATE embeds a generalist embodied agent directly into the generation loop to proactively guarantee the physical groundedness of the resulting curriculum. FATE instantiates a sequential auditing pipeline: it first validates static scene attributes (e.g., object affordances, layout compatibility) and subsequently verifies execution feasibility via simulated embodied interaction. Critical to its performance, upon detecting an infeasible task, FATE deploys an active repair module that autonomously adapts scene configurations or policy specifications, converting unworkable proposals into physically valid task instances. Extensive experiments validate that FATE generates semantically diverse, physically grounded task curricula while achieving a substantial reduction in execution failure rates relative to state-of-the-art generative baselines.
comment: 16 Pages, 4 Figures
Bimanual XR Specification of Relative and Absolute Assembly Hierarchies for Teleoperation
We present a bimanual XR interaction approach for specifying remote assembly tasks as hierarchies of relative and absolute object constraints that specify high-level teleoperation goals for robots. Grabbing one object in each hand creates a constraint group (visualized as a hull) and groups can be nested into hierarchies. Each group can be relative (with a robot-specifiable 6DoF pose) or absolute (with an author-specified fixed 6DoF pose) in relation to its parent. A relative group specifies a subassembly that can be constructed at a location chosen by the robot software for efficiency rather than mandated by the user.
Towards Robot Skill Learning and Adaptation with Gaussian Processes
General robot skill adaptation requires expressive representations robust to varying task configurations. While recent learning-based skill adaptation methods refined via Reinforcement Learning (RL), have shown success, existing skill models often lack sufficient representational capacity for anything beyond minor environmental changes. In contrast, Gaussian Process (GP)-based skill modelling provides an expressive representation with useful analytical properties; however, adaptation of GP-based skills remains underexplored. This paper proposes a novel, robust skill adaptation framework that utilises GPs with sparse via-points for compact and expressive modelling. The model considers the trajectory's poses and leverages its first and second analytical derivatives to preserve the skill's kinematic profile. We present three adaptation methods to cater for the variability between initial and observed configurations. Firstly, an optimisation agent that adjusts the path's via-points while preserving the demonstration velocity. Second, a behaviour cloning agent trained to replicate output trajectories from the optimisation agent. Lastly, an RL agent that has learnt to modify via-points whilst maintaining the kinematic profile and enabling online capabilities. Evaluated across three tasks (drawer opening, cube-pushing and bar manipulation) in both simulation and hardware, our proposed methods outperform every benchmark in success rates. Furthermore, the results demonstrate that the GP-based representation enables all three methods to attain high cosine similarity and low velocity magnitude errors, indicating strong preservation of the kinematic profile. Overall, our formulation provides a compact representation capable of adapting to large deviations from a single demonstrated skill.
Multimodal Adversarial Quality Policy for Safe Grasping
Vision-guided robot grasping based on Deep Neural Networks (DNNs) generalizes well but poses safety risks in the Human-Robot Interaction (HRI). Recent works solved it by designing benign adversarial attacks and patches with RGB modality, yet depth-independent characteristics limit their effectiveness on RGBD modality. In this work, we propose the Multimodal Adversarial Quality Policy (MAQP) to realize multimodal safe grasping. Our framework introduces two key components. First, the Heterogeneous Dual-Patch Optimization Scheme (HDPOS) mitigates the distribution discrepancy between RGB and depth modalities in patch generation by adopting modality-specific initialization strategies, employing a Gaussian distribution for depth patches and a uniform distribution for RGB patches, while jointly optimizing both modalities under a unified objective function. Second, the Gradient-Level Modality Balancing Strategy (GLMBS) is designed to resolve the optimization imbalance from RGB and Depth patches in patch shape adaptation by reweighting gradient contributions based on per-channel sensitivity analysis and applying distance-adaptive perturbation bounds. We conduct extensive experiments on the benchmark datasets and a cobot, showing the effectiveness of MAQP.
comment: submitted
SFCo-Nav: Efficient Zero-Shot Visual Language Navigation via Collaboration of Slow LLM and Fast Attributed Graph Alignment ICRA
Recent advances in large vision-language models (VLMs) and large language models (LLMs) have enabled zero-shot approaches to visual language navigation (VLN), where an agent follows natural language instructions using only ego perception and reasoning. However, existing zero-shot methods typically construct a naive observation graph and perform per-step VLM-LLM inference on it, resulting in high latency and computation costs that limit real-time deployment. To address this, we present SFCo-Nav, an efficient zero-shot VLN framework inspired by the principle of slow-fast cognitive collaboration. SFCo-Nav integrates three key modules: 1) a slow LLM-based planner that produces a strategic chain of subgoals, each linked to an imagined object graph; 2) a fast reactive navigator for real-time object graph construction and subgoal execution; and 3) a lightweight asynchronous slow-fast bridge aligns advanced structured, attributed imagined and perceived graphs to estimate navigation confidence, triggering the slow LLM planner only when necessary. To the best of our knowledge, SFCo-Nav is the first slow-fast collaboration zero-shot VLN system supporting asynchronous LLM triggering according to the internal confidence. Evaluated on the public R2R and REVERIE benchmarks, SFCo-Nav matches or exceeds prior state-of-the-art zero-shot VLN success rates while cutting total token consumption per trajectory by over 50% and running more than 3.5 times faster. Finally, we demonstrate SFCo-Nav on a legged robot in a hotel suite, showcasing its efficiency and practicality in indoor environments.
comment: Accepted by 2026 IEEE International Conference on Robotics and Automation (ICRA)
ROSER: Few-Shot Robotic Sequence Retrieval for Scalable Robot Learning ICLR
A critical bottleneck in robot learning is the scarcity of task-labeled, segmented training data, despite the abundance of large-scale robotic datasets recorded as long, continuous interaction logs. Existing datasets contain vast amounts of diverse behaviors, yet remain structurally incompatible with modern learning frameworks that require cleanly segmented, task-specific trajectories. We address this data utilization crisis by formalizing robotic sequence retrieval: the task of extracting reusable, task-centric segments from unlabeled logs using only a few reference examples. We introduce ROSER, a lightweight few-shot retrieval framework that learns task-agnostic metric spaces over temporal windows, enabling accurate retrieval with as few as 3-5 demonstrations, without any task-specific training required. To validate our approach, we establish comprehensive evaluation protocols and benchmark ROSER against classical alignment methods, learned embeddings, and language model baselines across three large-scale datasets (e.g., LIBERO, DROID, and nuScenes). Our experiments demonstrate that ROSER consistently outperforms all prior methods in both accuracy and efficiency, achieving sub-millisecond per-match inference while maintaining superior distributional alignment. By reframing data curation as few-shot retrieval, ROSER provides a practical pathway to unlock underutilized robotic datasets, fundamentally improving data availability for robot learning.
comment: 2026 ICLR DATA-FM Workshop
Mean-Flow based One-Step Vision-Language-Action
Recent advances in FlowMatching-based Vision-Language-Action (VLA) frameworks have demonstrated remarkable advantages in generating high-frequency action chunks, particularly for highly dexterous robotic manipulation tasks. Despite these notable achievements, their practical applications are constrained by prolonged generation latency, which stems from inherent iterative sampling requirements and architectural limitations. To address this critical bottleneck, we propose a Mean-Flow based One-Step VLA approach. Specifically, we resolve the noise-induced issues in the action generation process, thereby eliminating the consistency constraints inherent to conventional Flow-Matching methods. This significantly enhances generation efficiency and enables one-step action generation. Real-world robotic experiments show that the generation speed of the proposed Mean-Flow based One-Step VLA is 8.7 times and 83.9 times faster than that of SmolVLA and Diffusion Policy, respectively. These results elucidate its great potential as a high-efficiency backbone for VLA-based robotic manipulation.
Non-Markovian Long-Horizon Robot Manipulation via Keyframe Chaining
Existing Vision-Language-Action (VLA) models often struggle to generalize to long-horizon tasks due to their heavy reliance on immediate observations. While recent studies incorporate retrieval mechanisms or extend context windows to handle procedural tasks, they often struggle to capture Non-Markovian dependencies, where optimal actions rely solely on specific past states rather than the current observation. To address this, we introduce Keyframe-Chaining VLA, a framework that extracts and links key historical frames to model long-horizon dependencies. Specifically, we propose an automatic keyframe selector that learns a discriminative embedding space, effectively identifying distinct state transitions. To capture task-critical information, we design a progress-aware query mechanism that dynamically retrieves historical frames based on their temporal relevance to the current execution phase. These selected keyframes are integrated into the VLA as interleaved visual tokens, explicitly grounding the policy in the long-horizon temporal context. Finally, we introduce a suite of four Non-Markovian manipulation tasks built upon the ManiSkill simulator to measure task success rates. Experimental results demonstrate that our method achieves superior performance, effectively tackling robot manipulation tasks characterized by long-horizon temporal dependencies. Code is available at https://github.com/cytoplastm/KC-VLA.
Scaling Tasks, Not Samples: Mastering Humanoid Control through Multi-Task Model-Based Reinforcement Learning
Developing generalist robots capable of mastering diverse skills remains a central challenge in embodied AI. While recent progress emphasizes scaling model parameters and offline datasets, such approaches are limited in robotics, where learning requires active interaction. We argue that effective online learning should scale the \emph{number of tasks}, rather than the number of samples per task. This regime reveals a structural advantage of model-based reinforcement learning (MBRL). Because physical dynamics are invariant across tasks, a shared world model can aggregate multi-task experience to learn robust, task-agnostic representations. In contrast, model-free methods suffer from gradient interference when tasks demand conflicting actions in similar states. Task diversity therefore acts as a regularizer for MBRL, improving dynamics learning and sample efficiency. We instantiate this idea with \textbf{EfficientZero-Multitask (EZ-M)}, a sample-efficient multi-task MBRL algorithm for online learning. Evaluated on \textbf{HumanoidBench}, a challenging whole-body control benchmark, EZ-M achieves state-of-the-art performance with significantly higher sample efficiency than strong baselines, without extreme parameter scaling. These results establish task scaling as a critical axis for scalable robotic learning. The project website is available \href{https://yewr.github.io/ez_m/}{here}.
Unifying Language-Action Understanding and Generation for Autonomous Driving
Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language-action mapping. Finally, we replace the slow, step-by-step generation with a two-step coarse-to-fine generation method C2F that efficiently decodes the action sequence, saving 86% inference time. Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, alongside reduced inference latency.
PhysGraph: Physically-Grounded Graph-Transformer Policies for Bimanual Dexterous Hand-Tool-Object Manipulation
Bimanual dexterous manipulation for tool use remains a formidable challenge in robotics due to the high-dimensional state space and complicated contact dynamics. Existing methods naively represent the entire system state as a single configuration vector, disregarding the rich structural and topological information inherent to articulated hands. We present PhysGraph, a physically-grounded graph transformer policy designed explicitly for challenging bimanual hand-tool-object manipulation. Unlike prior works, we represent the bimanual system as a kinematic graph and introduce per-link tokenization to preserve fine-grained local state information. We propose a physically-grounded bias generator that injects structural priors directly into the attention mechanism, including kinematic spatial distance, dynamic contact states, geometric proximity, and anatomical properties. This allows the policy to explicitly reason about physical interactions rather than learning them implicitly from sparse rewards. Extensive experiments show that PhysGraph significantly outperforms baseline - ManipTrans in manipulation precision and task success rates while using only 51% of the parameters of ManipTrans. Furthermore, the inherent topological flexibility of our architecture shows qualitative zero-shot transfer to unseen tool/object geometries, and is sufficiently general to be trained on three robotic hands (Shadow, Allegro, Inspire).
Jailbreaking Embodied LLMs via Action-level Manipulation
Embodied Large Language Models (LLMs) enable AI agents to interact with the physical world through natural language instructions and actions. However, beyond the language-level risks inherent to LLMs themselves, embodied LLMs with real-world actuation introduce a new vulnerability: instructions that appear semantically benign may still lead to dangerous real-world consequences, revealing a fundamental misalignment between linguistic security and physical outcomes. In this paper, we introduce Blindfold, an automated attack framework that leverages the limited causal reasoning capabilities of embodied LLMs in real-world action contexts. Rather than iterative trial-and-error jailbreaking of black-box embodied LLMs, Blindfold adopts an Adversarial Proxy Planning strategy: it compromises a local surrogate LLM to perform action-level manipulations that appear semantically safe but could result in harmful physical effects when executed. Blindfold further conceals key malicious actions by injecting carefully crafted noise to evade detection by defense mechanisms, and it incorporates a rule-based verifier to improve the attack executability. Evaluations on both embodied AI simulators and a real-world 6DoF robotic arm show that Blindfold achieves up to 53% higher attack success rates than SOTA baselines, highlighting the urgent need to move beyond surface-level language censorship and toward consequence-aware defense mechanisms to secure embodied LLMs.
comment: This paper has been officially accepted for ACM SenSys 2026
D-GVIO: A Buffer-Driven and Efficient Decentralized GNSS-Visual-Inertial State Estimator for Multi-Agent Systems ICRA 2026
Cooperative localization is essential for swarm applications like collaborative exploration and search-and-rescue missions. However, maintaining real-time capability, robustness, and computational efficiency on resource-constrained platforms presents significant challenges. To address these challenges, we propose D-GVIO, a buffer-driven and fully decentralized GNSS-Visual-Inertial Odometry (GVIO) framework that leverages a novel buffering strategy to support efficient and robust distributed state estimation. The proposed framework is characterized by four core mechanisms. Firstly, through covariance segmentation, covariance intersection and buffering strategy, we modularize propagation and update steps in distributed state estimation, significantly reducing computational and communication burdens. Secondly, the left-invariant extended Kalman filter (L-IEKF) is adopted for information fusion, which exhibits superior state estimation performance over the traditional extended Kalman filter (EKF) since its state transition matrix is independent of the system state. Thirdly, a buffer-based re-propagation strategy is employed to handle delayed measurements efficiently and accurately by leveraging the L-IEKF, eliminating the need for costly re-computation. Finally, an adaptive buffer-driven outlier detection method is proposed to dynamically cull GNSS outliers, enhancing robustness in GNSS-challenged environments.
comment: Accepted by ICRA 2026
Align and Filter: Improving Performance in Asynchronous On-Policy RL
Distributed training and increasing the gradient update frequency are practical strategies to accelerate learning and improve performance, but both exacerbate a central challenge: \textit{policy lag}, which is the mismatch between the behavior policy generating data and the learning policy being updated. Policy lag can hinder the scaling of on-policy learning algorithms to larger problems. In this paper, we identify the sources of policy lag caused by distributed learning and high update frequency. We use the findings to propose \textit{total Variation-based Advantage aligned Constrained policy Optimization (\methodacronym)} as a practical approach to mitigate policy lag. We empirically validate our method and show that it offers better robustness to policy lag in classic RL tasks and a modern RL for LLM math reasoning task.
A Novel Modular Cable-Driven Soft Robotic Arm with Multi-Segment Reconfigurability
This paper presents a novel, modular, cable-driven soft robotic arm featuring multi-segment reconfigurability. The proposed architecture enables a stackable system with independent segment control, allowing scalable adaptation to diverse structural and application requirements. The system is fabricated from soft silicone material and incorporates embedded tendon-routing channels with a protective dual-helical tendon structure. Experimental results showed that modular stacking substantially expanded the reachable workspace: relative to the single-segment arm, the three-segment configuration achieved up to a 13-fold increase in planar workspace area and a 38.9-fold increase in workspace volume. Furthermore, this study investigated the effect of silicone stiffness on actuator performance. The results revealed a clear trade-off between compliance and stiffness: softer silicone improved bending flexibility, while stiffer silicone improved structural rigidity and load-bearing stability. These results highlight the potential of stiffness tuning to balance compliance and strength for configuring scalable, reconfigurable soft robotic arms.
comment: 6 pages, 8 figures, Submitted to IEEE/ASME International Conference on Advanced Intelligent Mechatronics
Learning Therapist Policy from Therapist-Exoskeleton-Patient Interaction ICRA 2026
Post-stroke rehabilitation is often necessary for patients to regain proper walking gait. However, the typical therapy process can be exhausting and physically demanding for therapists, potentially reducing therapy intensity, duration, and consistency over time. We propose a Patient-Therapist Force Field (PTFF) to visualize therapist responses to patient kinematics and a Synthetic Therapist (ST) machine learning model to support the therapist in dyadic robot-mediated physical interaction therapy. The first encodes patient and therapist stride kinematics into a shared low-dimensional latent manifold using a Variational Autoencoder (VAE) and models their interaction through a Gaussian Mixture Model (GMM), which learns a probabilistic vector field mapping patient latent states to therapist responses. This representation visualizes patient-therapist interaction dynamics to inform therapy strategies and robot controller design. The latter is implemented as a Long Short-Term Memory (LSTM) network trained on patient-therapist interaction data to predict therapist-applied joint torques from patient kinematics. Trained and validated using leave-one-out cross-validation across eight post-stroke patients, the model was integrated into a ROS-based exoskeleton controller to generate real-time torque assistance based on predicted therapist responses. Offline results and preliminary testing indicate the potential of their use as an alternative approach to post-stroke exoskeleton therapy. The PTFF provides understanding of the therapist's actions while the ST frees the human therapist from the exoskeleton, allowing them to continuously monitor the patient's nuanced condition.
comment: Accepted at IEEE International Conference on Robotics and Automation (ICRA 2026)
Safe Whole-Body Loco-Manipulation via Combined Model and Learning-based Control ICRA
Simultaneous locomotion and manipulation enables robots to interact with their environment beyond the constraints of a fixed base. However, coordinating legged locomotion with arm manipulation, while considering safety and compliance during contact interaction remains challenging. To this end, we propose a whole-body controller that combines a model-based admittance control for the manipulator arm with a Reinforcement Learning (RL) policy for legged locomotion. The admittance controller maps external wrenches--such as those applied by a human during physical interaction--into desired end-effector velocities, allowing for compliant behavior. The velocities are tracked jointly by the arm and leg controllers, enabling a unified 6-DoF force response. The model-based design permits accurate force control and safety guarantees via a Reference Governor (RG), while robustness is further improved by a Kalman filter enhanced with neural networks for reliable base velocity estimation. We validate our approach in both simulation and hardware using the Unitree Go2 quadruped robot with a 6-DoF arm and wrist-mounted 6-DoF Force/Torque sensor. Results demonstrate accurate tracking of interaction-driven velocities, compliant behavior, and safe, reliable performance in dynamic settings.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA), June 2026, in Vienna, Austria
Strategic Shaping of Human Prosociality: A Latent-State POMDP Framework
We propose a decision-theoretic framework in which a robot strategically can shape inferred human's prosocial state during repeated interactions. Modeling the human's prosociality as a latent state that evolves over time, the robot learns to infer and influence this state through its own actions, including helping and signaling. We formalize this as a latent-state POMDP with limited observations and learn the transition and observation dynamics using expectation maximization. The resulting belief-based policy balances task and social objectives, selecting actions that maximize long-term cooperative outcomes. We evaluate the model using data from user studies and show that the learned policy outperforms baseline strategies in both team performance and increasing observed human cooperative behavior.
comment: This article has been published in IEEE Robotics and Automation Letters. https://ieeexplore.ieee.org/document/11410120
Diffusion-MPC in Discrete Domains: Feasibility Constraints, Horizon Effects, and Critic Alignment: Case study with Tetris
We study diffusion-based model predictive control (Diffusion-MPC) in discrete combinatorial domains using Tetris as a case study. Our planner samples candidate placement sequences with a MaskGIT-style discrete denoiser and selects actions via reranking. We analyze three key factors: (1) feasibility-constrained sampling via logit masking over valid placements, (2) reranking strategies using a heuristic score, a pretrained DQN critic, and a hybrid combination, and (3) compute scaling in candidate count and planning horizon. We find that feasibility masking is necessary in discrete domains, removing invalid action mass (46%) and yielding a 6.8% improvement in score and 5.6% improvement in survival over unconstrained sampling. Naive DQN reranking is systematically misaligned with rollout quality, producing high decision regret (mean 17.6, p90 36.6). Shorter planning horizons outperform longer ones under sparse and delayed rewards, suggesting uncertainty compounding in long imagined rollouts. Overall, compute choices (K, H) determine dominant failure modes: small K limits candidate quality, while larger H amplifies misranking and model mismatch. Our findings highlight structural challenges of diffusion planners in discrete environments and provide practical diagnostics for critic integration.
comment: 7 pages, 3 figures, 2 tables. Includes regret diagnostics and compute-quality frontier analysis. Code and experiment configurations available in the Diffusion-Tetris repository
Goal-Oriented Semantic Communication for ISAC-Enabled Robotic Obstacle Avoidance
We investigate an integrated sensing and communication (ISAC)-enabled BS for the unmanned aerial vehicle (UAV) obstacle avoidance task, and propose a goal-oriented semantic communication (GOSC) framework for the BS to transmit sensing and command and control (C&C) signals efficiently and effectively. Our GOSC framework establishes a closed loop for sensing-C&C generation-sensing and C&C transmission: For sensing, a Kalman filter (KF) is applied to continuously predict UAV positions, mitigating the reliance of UAV position acquisition on continuous sensing signal transmission, and enhancing position estimation accuracy through sensing-prediction fusion. Based on the refined estimation position provided by the KF, we develop a Mahalanobis distance-based dynamic window approach (MD-DWA) to generate precise C&C signals under uncertainty, in which we derive the mathematical expression of the minimum Mahalanobis distance required to guarantee collision avoidance. Finally, for efficient sensing and C&C signal transmission, we propose an effectiveness-aware deep Q-network (E-DQN) to determine the transmission of sensing and C&C signals based on their value of information (VoI). The VoI of sensing signals is quantified by the reduction in uncertainty entropy of UAV's position estimation, while the VoI of C&C signals is measured by their contribution to UAV navigation improvement. Extensive simulations validate the effectiveness of our proposed GOSC framework. Compared to the conventional ISAC transmission framework that transmits sensing and C&C signals at every time slot, GOSC achieves the same 100% task success rate while reducing the number of transmitted sensing and C&C signals by 92.4% and the number of transmission time slots by 85.5%.
comment: 13 pages, 15 figures
Return Augmented Decision Transformer for Off-Dynamics Reinforcement Learning
We study offline off-dynamics reinforcement learning (RL) to utilize data from an easily accessible source domain to enhance policy learning in a target domain with limited data. Our approach centers on return-conditioned supervised learning (RCSL), particularly focusing on Decision Transformer (DT) type frameworks, which can predict actions conditioned on desired return guidance and complete trajectory history. Previous works address the dynamics shift problem by augmenting the reward in the trajectory from the source domain to match the optimal trajectory in the target domain. However, this strategy can not be directly applicable in RCSL owing to (1) the unique form of the RCSL policy class, which explicitly depends on the return, and (2) the absence of a straightforward representation of the optimal trajectory distribution. We propose the Return Augmented (REAG) method for DT type frameworks, where we augment the return in the source domain by aligning its distribution with that in the target domain. We provide the theoretical analysis demonstrating that the RCSL policy learned from REAG achieves the same level of suboptimality as would be obtained without a dynamics shift. We introduce two practical implementations REAG$_\text{Dara}^{*}$ and REAG$_\text{MV}^{*}$ respectively. Thorough experiments on D4RL datasets and various DT-type baselines demonstrate that our methods consistently enhance the performance of DT type frameworks in off-dynamics RL.
comment: 26 pages, 11 tables, 8 figures. Published in Transactions on Machine Learning Research (TMLR)
Goal Reaching with Eikonal-Constrained Hierarchical Quasimetric Reinforcement Learning
Goal-Conditioned Reinforcement Learning (GCRL) mitigates the difficulty of reward design by framing tasks as goal reaching rather than maximizing hand-crafted reward signals. In this setting, the optimal goal-conditioned value function naturally forms a quasimetric, motivating Quasimetric RL (QRL), which constrains value learning to quasimetric mappings and enforces local consistency through discrete, trajectory-based constraints. We propose Eikonal-Constrained Quasimetric RL (Eik-QRL), a continuous-time reformulation of QRL based on the Eikonal Partial Differential Equation (PDE). This PDE-based structure makes Eik-QRL trajectory-free, requiring only sampled states and goals, while improving out-of-distribution generalization. We provide theoretical guarantees for Eik-QRL and identify limitations that arise under complex dynamics. To address these challenges, we introduce Eik-Hierarchical QRL (Eik-HiQRL), which integrates Eik-QRL into a hierarchical decomposition. Empirically, Eik-HiQRL achieves state-of-the-art performance in offline goal-conditioned navigation and yields consistent gains over QRL in manipulation tasks, matching temporal-difference methods.
UNCLE-Grasp: Uncertainty-Aware Grasping of Leaf-Occluded Strawberries
Robotic strawberry harvesting remains challenging under partial occlusion, where leaf interference introduces significant geometric uncertainty and renders grasp decisions based on a single deterministic shape estimate unreliable. From a single partial observation, multiple incompatible 3D shape completions may be plausible, such that grasps deemed feasible on one completion can fail on another. This paper presents an uncertainty-aware grasping pipeline for partially occluded strawberries that explicitly models geometric uncertainty arising from both occlusion and learned shape completion. The proposed approach employs point cloud completion with Monte Carlo dropout to sample multiple shape hypotheses, generates candidate grasps for each completion, and evaluates grasp feasibility using physically grounded force-closure metrics. Rather than selecting a grasp from a single shape estimate, feasibility is aggregated across completions and a conservative lower confidence bound (LCB) criterion is used to decide whether grasping a strawberry should be attempted or safely abstained. The method is evaluated in simulation and on a physical robot under increasing levels of synthetic and real leaf occlusion. Experimental results demonstrate that uncertainty-aware decision making enables reliable abstention from high-risk grasp attempts under severe occlusion while maintaining robust grasp execution when geometric confidence is sufficient, outperforming deterministic baselines in both simulated and physical robot experiments.
Dense-Jump Flow Matching with Non-Uniform Time Scheduling for Robotic Policies: Mitigating Multi-Step Inference Degradation
Flow matching has emerged as a competitive framework for learning high-quality generative policies in robotics; however, we find that generalisation arises and saturates early along the flow trajectory, in accordance with recent findings in the literature. We further observe that increasing the number of Euler integration steps during inference counter-intuitively and universally degrades policy performance. We attribute this to (i) additional, uniformly spaced integration steps oversample the late-time region, thereby constraining actions towards the training trajectories and reducing generalisation; and (ii) the learned velocity field becoming non-Lipschitz as integration time approaches 1, causing instability. To address these issues, we propose a novel policy that utilises non-uniform time scheduling (e.g., U-shaped) during training, which emphasises both early and late temporal stages to regularise policy training, and a dense-jump integration schedule at inference, which uses a single-step integration to replace the multi-step integration beyond a jump point, to avoid unstable areas around 1. Essentially, our policy is an efficient one-step learner that still pushes forward performance through multi-step integration, yielding up to 23.7% performance gains over state-of-the-art baselines across diverse robotic tasks.
HiCrowd: Hierarchical Crowd Flow Alignment for Dense Human Environments ICRA
Navigating through dense human crowds remains a significant challenge for mobile robots. A key issue is the freezing robot problem, where the robot struggles to find safe motions and becomes stuck within the crowd. To address this, we propose HiCrowd, a hierarchical framework that integrates reinforcement learning (RL) with model predictive control (MPC). HiCrowd leverages surrounding pedestrian motion as guidance, enabling the robot to align with compatible crowd flows. A high-level RL policy generates a follow point to align the robot with a suitable pedestrian group, while a low-level MPC safely tracks this guidance with short horizon planning. The method combines long-term crowd aware decision making with safe short-term execution. We evaluate HiCrowd against reactive and learning-based baselines in offline setting (replaying recorded human trajectories) and online setting (human trajectories are updated to react to the robot in simulation). Experiments on a real-world dataset and a synthetic crowd dataset show that our method outperforms in navigation efficiency and safety, while reducing freezing behaviors. Our results suggest that leveraging human motion as guidance, rather than treating humans solely as dynamic obstacles, provides a powerful principle for safe and efficient robot navigation in crowds.
comment: 2026 IEEE International Conference on Robotics and Automation (ICRA)
Sample-efficient and Scalable Exploration in Continuous-Time RL ICLR 2026
Reinforcement learning algorithms are typically designed for discrete-time dynamics, even though the underlying real-world control systems are often continuous in time. In this paper, we study the problem of continuous-time reinforcement learning, where the unknown system dynamics are represented using nonlinear ordinary differential equations (ODEs). We leverage probabilistic models, such as Gaussian processes and Bayesian neural networks, to learn an uncertainty-aware model of the underlying ODE. Our algorithm, COMBRL, greedily maximizes a weighted sum of the extrinsic reward and model epistemic uncertainty. This yields a scalable and sample-efficient approach to continuous-time model-based RL. We show that COMBRL achieves sublinear regret in the reward-driven setting, and in the unsupervised RL setting (i.e., without extrinsic rewards), we provide a sample complexity bound. In our experiments, we evaluate COMBRL in both standard and unsupervised RL settings and demonstrate that it scales better, is more sample-efficient than prior methods, and outperforms baselines across several deep RL tasks.
comment: 28 pages, 8 figures, 6 tables. Published as a conference paper at ICLR 2026
Learning Contact Dynamics through Touching: Action-conditional Graph Neural Networks for Robotic Peg Insertion
We present a learnable physics-based predictive model that provides accurate motion and force-torque prediction of the robot end effector in contact-rich manipulation. The proposed model extends the state-of-the-art GNN-based simulator (FIGNet) with novel node and edge types, enabling action-conditional predictions for control and state estimation in the context of robotic peg insertion. Our model learns in a self-supervised manner, using only joint encoder and force-torque data while the robot is touching the environment. In simulation, the MPC agent using our model matches the performance of the same controller with the ground truth dynamics model in a challenging peg-in-hole task, while in the real-world experiment, our model achieves a 50$\%$ improvement in motion prediction accuracy and 3$\times$ increase in force-torque prediction precision over the baseline physics simulator. Finally, we apply the model to track the robot end effector with a particle filter during real-world peg insertion, demonstrating a practical application of its predictive accuracy.
HIMM: Human-Inspired Long-Term Memory Modeling for Embodied Exploration and Question Answering
Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limited context budgets. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary environments. In this work, we propose a non-parametric memory framework that explicitly disentangles episodic and semantic memory for embodied exploration and question answering. Our retrieval-first, reasoning-assisted paradigm recalls episodic experiences via semantic similarity and verifies them through visual reasoning, enabling robust reuse of past observations without rigid geometric alignment. In parallel, we introduce a program-style rule extraction mechanism that converts experiences into structured, reusable semantic memory, facilitating cross-environment generalization. Extensive experiments demonstrate state-of-the-art performance on embodied question answering and exploration benchmarks, yielding a 7.3% gain in LLM-Match and an 11.4% gain in LLM MatchXSPL on A-EQA, as well as +7.7% success rate and +6.8% SPL on GOAT-Bench. Analyses reveal that our episodic memory primarily improves exploration efficiency, while semantic memory strengthens complex reasoning of embodied agents.
CAIMAN: Causal Action Influence Detection for Sample-efficient Loco-manipulation
Enabling legged robots to perform non-prehensile loco-manipulation is crucial for enhancing their versatility. Learning behaviors such as whole-body object pushing often requires sophisticated planning strategies or extensive task-specific reward shaping, especially in unstructured environments. In this work, we present CAIMAN, a practical reinforcement learning framework that encourages the agent to gain control over other entities in the environment. CAIMAN leverages causal action influence as an intrinsic motivation objective, allowing legged robots to efficiently acquire object pushing skills even under sparse task rewards. We employ a hierarchical control strategy, combining a low-level locomotion module with a high-level policy that generates task-relevant velocity commands and is trained to maximize the intrinsic reward. To estimate causal action influence, we learn the dynamics of the environment by integrating a kinematic prior with data collected during training. We empirically demonstrate CAIMAN's superior sample efficiency and adaptability to diverse scenarios in simulation, as well as its successful transfer to real-world systems without further fine-tuning. A video demo is available at https://www.youtube.com/watch?v=dNyvT04Cqaw.
Soft Pneumatic Grippers: Topology optimization, 3D-printing and Experimental validation
This paper presents a systematic topology optimization framework for designing a soft pneumatic gripper (SPG), explicitly considering the design-dependent nature of the actuating load. The load is modeled using Darcy's law with an added drainage term. A 2D soft arm unit is optimized by formulating it as a compliant mechanism design problem using the robust formulation. The problem is posed as a min-max optimization, where the output deformations of blueprint and eroded designs are considered. A volume constraint is imposed on the blueprint part, while a strain-energy constraint is enforced on the eroded part. The MMA is employed to solve the optimization problem and obtain the optimized soft unit. Finite element analysis with the Ogden material model confirms that the optimized 2D unit outperforms a conventional rectangular design under pneumatic loading. The optimized 2D unit is extruded to obtain a 3D module, and ten such units are assembled to create a soft arm. Deformation profiles of the optimized arm are analysed under different pressure loads. Four arms are 3D-printed and integrated with a supporting structure to realize the proposed SPG. The gripping performance of the SPG is demonstrated on objects with different weights, sizes, stiffness, and shapes.
comment: 11 Figures
Symphony: A Heuristic Normalized Calibrated Advantage Actor and Critic Algorithm in application for Humanoid Robots
In our work we implicitly suggest that it is a misconception to think that humans learn fast. The learning process takes time. Babies start learning to move in the restricted fluid environment of the womb. Children are often limited by underdeveloped body. Even adults are not allowed to participate in complex competitions right away. However, with robots, when learning from scratch, we often don't have the privilege of waiting for tens of millions of steps. "Swaddling" regularization is responsible for restraining an agent in rapid but unstable development penalizing action strength in a specific way not affecting actions directly. The Symphony, Transitional-policy Deterministic Actor and Critic algorithm, is a concise combination of different ideas for possibility of training humanoid robots from scratch with Sample Efficiency, Sample Proximity and Safety of Actions in mind. It is well known that continuous increase in Gaussian noise without appropriate smoothing is harmful for motors and gearboxes. Compared to Stochastic algorithms, we set limited parametric noise and promote a reduced strength of actions, safely increasing entropy, since the actions are submerged in weaker noise. When actions require more extreme values, actions rise above the weak noise. Training becomes empirically much safer for both the environment around and the robot's mechanisms. We use Fading Replay Buffer: using a fixed formula containing the hyperbolic tangent, we adjust the batch sampling probability: the memory contains a recent memory and a long-term memory trail. Fading Replay Buffer allows us to use Temporal Advantage when we improve the current Critic Network prediction compared to the exponential moving average. Temporal Advantage allows us to update the Actor and Critic in one pass, as well as combine the Actor and Critic in one Object and implement their Losses in one line.
comment: https://github.com/SuspensionRailway/symphony
SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios
Autonomous agents operating in the real world must interact continuously with existing physical and semantic infrastructure, track delayed consequences, and verify outcomes over time. Everyday environments are rich in tangible control interfaces (TCIs)-e.g., light switches, appliance panels, and embedded GUI-posing core challenges for lifelong embodied agents, including partial observability, causal reasoning across time, and failure-aware verification under real-world constraints. Yet, current benchmarks rarely consider such long-horizon interaction and causality requirements. We introduce SWITCH (Semantic World Interface Tasks for Control & Handling), an embodied, task-driven benchmark created through iterative releases to probe these gaps. Its first iteration, SWITCH-Basic, evaluates five complementary abilities-task-aware VQA, semantic UI grounding, action generation, state transition prediction, and result verification-under ego-centric RGB video input and device diversity across 351 tasks spanning 98 real devices/appliances. Results from commercial and open LMMMs reveal systematic failures, highlighting critical gaps for lifelong agent deployment. SWITCH provides data, code, and held-out splits to enable reproducible non-contaminated evaluation and community contributions toward more challenging future iterations of the benchmark and the creation of relevant training data. Benchmark resources are available at: https://github.com/BAAI-Agents/SWITCH.
Properties of Lyapunov Subcenter Manifolds in Conservative Mechanical Systems
Multi-body mechanical systems have rich internal dynamics, whose solutions can be exploited as efficient control targets. Yet, solutions non-trivially depend on system parameters, obscuring feasible properties for use as target trajectories. For periodic regulation tasks in robotics applications, we investigate properties of nonlinear normal modes (NNMs) collected in Lyapunov subcenter manifolds (LSMs) of conservative mechanical systems. Using a time-symmetry of conservative mechanical systems, we show that mild non-resonance conditions guarantee LSMs to be Eigenmanifolds, in which NNMs are guaranteed to oscillate between two points of zero velocity. We also prove the existence of a unique generator, which is a connected, 1D manifold that collects these points of zero velocity for a given Eigenmanifold. Furthermore, we show that an additional spatial symmetry provides LSMs with yet stronger properties of Rosenberg manifolds. Here all brake trajectories pass through a unique equilibrium configuration, which can be favorable for control applications. These theoretical results are numerically confirmed on two mechanical systems: a double pendulum and a 5-link pendulum.
comment: 20 pages, 27 figures, submitted to Automatica
OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation ICRA'26
Vision-language-action (VLA) models have shown strong generalization for robotic action prediction through large-scale vision-language pretraining. However, most existing models rely solely on RGB cameras, limiting their perception and, consequently, manipulation capabilities. We present OmniVLA, an omni-modality VLA model that integrates novel sensing modalities for physically-grounded spatial intelligence beyond RGB perception. The core of our approach is the sensor-masked image, a unified representation that overlays spatially grounded and physically meaningful masks onto the RGB images, derived from sensors including an infrared camera, a mmWave radar, and a microphone array. This image-native unification keeps sensor input close to RGB statistics to facilitate training, provides a uniform interface across sensor hardware, and enables data-efficient learning with lightweight per-sensor projectors. Built on this, we present a multisensory vision-language-action model architecture and train the model based on an RGB-pretrained VLA backbone. We evaluate OmniVLA on challenging real-world tasks where sensor-modality perception guides the robotic manipulation. OmniVLA achieves an average task success rate of 84%, significantly outperforms both RGB-only and raw-sensor-input baseline models by 59% and 28% respectively, meanwhile showing higher learning efficiency and stronger generalization capability.
comment: Accepted by ICRA'26
RoboPARA: Dual-Arm Robot Planning with Parallel Allocation and Recomposition Across Tasks ICLR 2026
Dual-arm robots play a crucial role in improving efficiency and flexibility in complex multitasking scenarios.While existing methods have achieved promising results in task planning, they often fail to fully optimize task parallelism, limiting the potential of dual-arm collaboration.To address this issue, we propose RoboPARA, a novel large language model (LLM)-driven framework for dual-arm task parallelism planning.RoboPARA employs a two-stage process: (1) Dependency Graph-based Planning Candidates Generation, which constructs directed acyclic graphs (DAGs) to model task dependencies and eliminate redundancy, and (2) Graph Re-Traversal-based Dual-Arm Parallel Planning, which optimizes DAG traversal to maximize parallelism while maintaining task coherence.In addition, we introduce the Cross-Scenario Dual-Arm Parallel Task dataset (X-DAPT dataset), the first dataset specifically designed to evaluate dual-arm task parallelism across diverse scenarios and difficulty levels.Extensive experiments demonstrate that RoboPARA significantly outperforms existing planning methods, achieving higher efficiency and reliability, particularly in complex task combinations.Our code is publicly available at https://github.com/AiDuanshiying/RoboPARA.
comment: Accepted to ICLR 2026
UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos ICLR 2026
Urban embodied AI agents, ranging from delivery robots to quadrupeds, are increasingly populating our cities, navigating chaotic streets to provide last-mile connectivity. Training such agents requires diverse, high-fidelity urban environments to scale, yet existing human-crafted or procedurally generated simulation scenes either lack scalability or fail to capture real-world complexity. We introduce UrbanVerse, a data-driven real-to-sim system that converts crowd-sourced city-tour videos into physics-aware, interactive simulation scenes. UrbanVerse consists of: (i) UrbanVerse-100K, a repository of 100k+ annotated urban 3D assets with semantic and physical attributes, and (ii) UrbanVerse-Gen, an automatic pipeline that extracts scene layouts from video and instantiates metric-scale 3D simulations using retrieved assets. Running in IsaacSim, UrbanVerse offers 160 high-quality constructed scenes from 24 countries, along with a curated benchmark of 10 artist-designed test scenes. Experiments show that UrbanVerse scenes preserve real-world semantics and layouts, achieving human-evaluated realism comparable to manually crafted scenes. In urban navigation, policies trained in UrbanVerse exhibit scaling power laws and strong generalization, improving success by +6.3% in simulation and +30.1% in zero-shot sim-to-real transfer comparing to prior methods, accomplishing a 300 m real-world mission with only two interventions.
comment: Accepted to ICLR 2026. Project page: https://urbanverseproject.github.io/
Model Predictive Adversarial Imitation Learning for Planning from Observation ICLR 2026
Human demonstration data is often ambiguous and incomplete, motivating imitation learning approaches that also exhibit reliable planning behavior. A common paradigm to perform planning-from-demonstration involves learning a reward function via Inverse Reinforcement Learning (IRL) then deploying this reward via Model Predictive Control (MPC). Towards unifying these methods, we derive a replacement of the policy in IRL with a planning-based agent. With connections to Adversarial Imitation Learning, this formulation enables end-to-end interactive learning of planners from observation-only demonstrations. In addition to benefits in interpretability, complexity, and safety, we study and observe significant improvements on sample efficiency, out-of-distribution generalization, and robustness. The study includes evaluations in both simulated control benchmarks and real-world navigation experiments using few-to-single observation-only demonstrations.
comment: Accepted at ICLR 2026
Coordinated Control of Multiple Construction Machines Using LLM-Generated Behavior Trees with Flag-Based Synchronization
Earthwork operations face increasing demand, while workforce aging creates a growing need for automation. ROS2-TMS for Construction, a Cyber-Physical System framework for construction machinery automation, has been proposed; however, its reliance on manually designed Behavior Trees (BTs) limits scalability in cooperative operations. Recent advances in Large Language Models (LLMs) offer new opportunities for automated task planning, yet most existing studies remain limited to simple robotic systems. This paper proposes an LLM-based workflow for automatic generation of BTs toward coordinated operation of construction machines. The method introduces synchronization flags managed through a Global Blackboard, enabling multiple BTs to share execution states and represent inter-machine dependencies. The workflow consists of Action Sequence generation and BTs generation using LLMs. Simulation experiments on 30 construction instruction scenarios achieved up to 93\% success rate in coordinated multi-machine tasks. Real-world experiments using an excavator and a dump truck further demonstrate successful cooperative execution, indicating the potential to reduce manual BTs design effort in construction automation. These results highlight the feasibility of applying LLM-driven task planning to practical earthwork automation.
comment: 9 pages, 7 figures
Optimization of Edge Directions and Weights for Mixed Guidance Graphs in Lifelong Multi-Agent Path Finding
Multi-Agent Path Finding (MAPF) aims to move agents from their start to goal vertices on a graph. Lifelong MAPF (LMAPF) continuously assigns new goals to agents as they complete current ones. To guide agents' movement in LMAPF, prior works have proposed Guidance Graph Optimization (GGO) methods to optimize a guidance graph, which is a bidirected weighted graph whose directed edges represent moving and waiting actions with edge weights being action costs. Higher edge weights represent higher action costs. However, edge weights only provide soft guidance. An edge with a high weight only discourages agents from using it, instead of prohibiting agents from traversing it. In this paper, we explore the need to incorporate edge directions optimization into GGO, providing strict guidance. We generalize GGO to Mixed Guidance Graph Optimization (MGGO), presenting two MGGO methods capable of optimizing both edge weights and directions. The first optimizes edge directions and edge weights in two phases separately. The second applies Quality Diversity algorithms to optimize a neural network capable of generating edge directions and weights. We also incorporate traffic patterns relevant to edge directions into a GGO method, making it capable of generating edge-direction-aware guidance graphs.
Physically Ground Commonsense Knowledge for Articulated Object Manipulation with Analytic Concepts
We humans rely on a wide range of commonsense knowledge to interact with an extensive number and categories of objects in the physical world. Likewise, such commonsense knowledge is also crucial for robots to successfully develop generalized object manipulation skills. While recent advancements in Multi-modal Large Language Models (MLLMs) have showcased their impressive capabilities in acquiring commonsense knowledge and conducting commonsense reasoning, effectively grounding this semantic-level knowledge produced by MLLMs to the physical world to thoroughly guide robots in generalized articulated object manipulation remains a challenge that has not been sufficiently addressed. To this end, we introduce analytic concepts, procedurally defined upon mathematical symbolism that can be directly computed and simulated by machines. By leveraging the analytic concepts as a bridge between the semantic-level knowledge inferred by MLLMs and the physical world where real robots operate, we can figure out the knowledge of object structure and functionality with physics-informed representations, and then use the physically grounded knowledge to instruct robot control policies for generalized and accurate articulated object manipulation. Extensive experiments in both real world and simulation demonstrate the superiority of our approach.
Automated Action Generation based on Action Field for Robotic Garment Smoothing and Alignment
Garment manipulation using robotic systems is a challenging task due to the diverse shapes and deformable nature of fabric. In this paper, we propose a novel method for robotic garment smoothing and alignment that significantly improves the accuracy while reducing computational time compared to previous approaches. Our method features an action generator that directly interprets scene images and generates pixel-wise end-effector action vectors using a neural network. The network also predicts a manipulation score map that ranks potential actions, allowing the system to select the most effective action. Extensive simulation experiments demonstrate that our method achieves higher smoothing and alignment performances and faster computation time than previous approaches. Real-world experiments show that the proposed method generalizes well to different garment types and successfully flattens garments.
comment: Accepted by IEEE Transactions on Automation Science and Engineering
Endowing Embodied Agents with Spatial Reasoning Capabilities for Vision-and-Language Navigation
Enhancing the spatial perception capabilities of mobile robots is crucial for achieving embodied Vision-and-Language Navigation (VLN). Although significant progress has been made in simulated environments, directly transferring these capabilities to real-world scenarios often results in severe hallucination phenomena, causing robots to lose effective spatial awareness. To address this issue, we propose BrainNav, a bio-inspired spatial cognitive navigation framework inspired by biological spatial cognition theories and cognitive map theory. BrainNav integrates dual-map (coordinate map and topological map) and dual-orientation (relative orientation and absolute orientation) strategies, enabling real-time navigation through dynamic scene capture and path planning. Its five core modules-Hippocampal Memory Hub, Visual Cortex Perception Engine, Parietal Spatial Constructor, Prefrontal Decision Center, and Cerebellar Motion Execution Unit-mimic biological cognitive functions to reduce spatial hallucinations and enhance adaptability. Validated in a zero-shot real-world lab environment using the Limo Pro robot, BrainNav, compatible with GPT-4, outperforms existing State-of-the-Art (SOTA) Vision-and-Language Navigation in Continuous Environments (VLN-CE) methods without fine-tuning.
AoE: Always-on Egocentric Human Video Collection for Embodied AI
Embodied foundation models require large-scale, high-quality real-world interaction data for pre-training and scaling. However, existing data collection methods suffer from high infrastructure costs, complex hardware dependencies, and limited interaction scope, making scalable expansion challenging. In fact, humans themselves are ideal physically embodied agents. Therefore, obtaining egocentric real-world interaction data from globally distributed "human agents" offers advantages of low cost and sustainability. To this end, we propose the Always-on Egocentric (AoE) data collection system, which aims to simplify hardware dependencies by leveraging humans themselves and their smartphones, enabling low-cost, highly efficient, and scene-agnostic real-world interaction data collection to address the challenge of data scarcity. Specifically, we first employ an ergonomic neck-mounted smartphone holder to enable low-barrier, large-scale egocentric data collection through a cloud-edge collaborative architecture. Second, we develop a cross-platform mobile APP that leverages on-device compute for real-time processing, while the cloud hosts automated labeling and filtering pipelines that transform raw videos into high-quality training data. Finally, the AoE system supports distributed Ego video data collection by anyone, anytime, and anywhere. We evaluate AoE on data preprocessing quality and downstream tasks, demonstrating that high-quality egocentric data significantly boosts real-world generalization.
ULC: A Unified and Fine-Grained Controller for Humanoid Loco-Manipulation
Loco-Manipulation for humanoid robots aims to enable robots to integrate mobility with upper-body tracking capabilities. Most existing approaches adopt hierarchical architectures that decompose control into isolated upper-body (manipulation) and lower-body (locomotion) policies. While this decomposition reduces training complexity, it inherently limits coordination between subsystems and contradicts the unified whole-body control exhibited by humans. We demonstrate that a single unified policy can achieve a combination of tracking accuracy, large workspace, and robustness for humanoid loco-manipulation. We propose the Unified Loco-Manipulation Controller (ULC), a single-policy framework that simultaneously tracks root velocity, root height, torso rotation, and dual-arm joint positions in an end-to-end manner, proving the feasibility of unified control without sacrificing performance. We achieve this unified control through key technologies: sequence skill acquisition for progressive learning complexity, residual action modeling for fine-grained control adjustments, command polynomial interpolation for smooth motion transitions, random delay release for robustness to deploy variations, load randomization for generalization to external disturbances, and center-of-gravity tracking for providing explicit policy gradients to maintain stability. We validate our method on the Unitree G1 humanoid robot with 3-DOF (degrees-of-freedom) waist. Compared with strong baselines, ULC shows better tracking performance to disentangled methods and demonstrating larger workspace coverage. The unified dual-arm tracking enables precise manipulation under external loads while maintaining coordinated whole-body control for complex loco-manipulation tasks.
V-MORALS: Visual Morse Graph-Aided Estimation of Regions of Attraction in a Learned Latent Space
Reachability analysis has become increasingly important in robotics to distinguish safe from unsafe states. Unfortunately, existing reachability and safety analysis methods often fall short, as they typically require known system dynamics or large datasets to estimate accurate system models, are computationally expensive, and assume full state information. A recent method, called MORALS, aims to address these shortcomings by using topological tools to estimate Regions of Attraction (ROA) in a low-dimensional latent space. However, MORALS still relies on full state knowledge and has not been studied when only sensor measurements are available. This paper presents Visual Morse Graph-Aided Estimation of Regions of Attraction in a Learned Latent Space (V-MORALS). V-MORALS takes in a dataset of image-based trajectories of a system under a given controller, and learns a latent space for reachability analysis. Using this learned latent space, our method is able to generate well-defined Morse Graphs, from which we can compute ROAs for various systems and controllers. V-MORALS provides capabilities similar to the original MORALS architecture without relying on state knowledge, and using only high-level sensor data. Our project website is at: https://v-morals.onrender.com.
Viability-Preserving Passive Torque Control
Conventional passivity-based torque controllers for manipulators are typically unconstrained, which can lead to safety violations under external perturbations. In this paper, we employ viability theory to pre-compute safe sets in the state-space of joint positions and velocities. These viable sets, constructed via data-driven and analytical methods for self-collision avoidance, external object collision avoidance and joint-position and joint-velocity limits, provide constraints on joint accelerations and thus joint torques via the robot dynamics. A quadratic programming-based control framework enforces these constraints on a passive controller tracking a dynamical system, ensuring the robot states remain within the safe set in an infinite time horizon. We validate the proposed approach through simulations and hardware experiments on a 7-DoF Franka Emika manipulator. In comparison to a baseline constrained passive controller, our method operates at higher control-loop rates and yields smoother trajectories.
comment: 8 pages, 7 figures, Project Website: https://vpp-tc.github.io/webpage/
Multimodal Sensing for Robot-Assisted Sub-Tissue Feature Detection in Physiotherapy Palpation
Robotic palpation relies on force sensing, but force signals in soft-tissue environments are variable and cannot reliably reveal subtle subsurface features. We present a compact multimodal sensor that integrates high-resolution vision-based tactile imaging with a 6-axis force-torque sensor. In experiments on silicone phantoms with diverse subsurface tendon geometries, force signals alone frequently produce ambiguous responses, while tactile images reveal clear structural differences in presence, diameter, depth, crossings, and multiplicity. Yet accurate force tracking remains essential for maintaining safe, consistent contact during physiotherapeutic interaction. Preliminary results show that combining tactile and force modalities enables robust subsurface feature detection and controlled robotic palpation.
comment: Accepted by AMSE Design of Medical Device 2026
Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control
Pretrained vision-language models (VLMs) can make semantic and visual inferences across diverse settings, providing valuable common-sense priors for robotic control. However, effectively grounding this knowledge in robot behaviors remains an open challenge. Prior methods often employ a hierarchical approach where VLMs reason over high-level commands to be executed by separate low-level policies, e.g., vision-language-action models (VLAs). The interface between VLMs and VLAs is usually natural language task instructions, which fundamentally limits how much VLM reasoning can steer low-level behavior. We thus introduce Steerable Policies: VLAs trained on rich synthetic commands at various levels of abstraction, like subtasks, motions, and grounded pixel coordinates. By improving low-level controllability, Steerable Policies can unlock pretrained knowledge in VLMs, enabling improved task generalization. We demonstrate this benefit by controlling our Steerable Policies with both a learned high-level embodied reasoner and an off-the-shelf VLM prompted to reason over command abstractions via in-context learning. Across extensive real-world manipulation experiments, these two novel methods outperform prior embodied reasoning VLAs and VLM-based hierarchical baselines, including on challenging generalization and long-horizon tasks. Website: steerable-policies.github.io
Towards an Adaptive Social Game-Playing Robot: An Offline Reinforcement Learning-Based Framework
HRI research increasingly demands robots that go beyond task execution to respond meaningfully to user emotions. This is especially needed when supporting students with learning difficulties in game-based learning scenarios. Here, the objective of these robots is to train users with game-playing skills, and this requires robots to get input about users' interests and engagement. In this paper, we present a system for an adaptive social game-playing robot. However, creating such an agent through online RL requires extensive real-world training data and potentially be uncomfortable for users. To address this, we investigate offline RL as a safe and efficient alternative. We introduce a system architecture that integrates multimodal emotion recognition and adaptive robotic responses. We also evaluate the performance of various offline RL algorithms using a dataset collected from a real-world human-robot game-playing scenario. Our results indicate that BCQ and DDQN offer the greatest robustness to hyperparameter variations, whereas CQL is the most effective at mitigating overestimation bias. Through this research, we aim to inform the selection and design of reliable offline RL policies for real-world social robotics. Ultimately, this work provides a foundational step toward creating socially intelligent agents that can learn complex and emotion-adaptive behaviors entirely from offline datasets, ensuring both human comfort and practical scalability.
comment: Submitted to conference
REFLEX: Metacognitive Reasoning for Reflective Zero-Shot Robotic Planning with Large Language Models
While large language models (LLMs) have shown great potential across various domains, their applications in robotics remain largely limited to static prompt-based behaviors and still face challenges in complex tasks under zero-shot or few-shot settings. Inspired by human metacognitive learning and creative problem-solving, we address this limitation by exploring a fundamental question: Can LLMs be empowered with metacognitive capabilities to reason, reflect, and create, thereby enhancing their ability to perform robotic tasks with minimal demonstrations? In this paper, we present REFLEX, a framework that integrates metacognitive learning into LLM-powered multi-robot collaboration. The system equips the LLM-powered robotic agents with a skill decomposition and self-reflection mechanism that identifies modular skills from prior tasks, reflects on failures in unseen task scenarios, and synthesizes effective new solutions. We propose a more challenging robotic benchmark task and evaluate our framework on the existing benchmark and the novel task. Experimental results show that our metacognitive learning framework significantly outperforms existing baselines. Moreover, we observe that our framework can generate solutions that differ from the ground truth yet still successfully complete the tasks. These findings support our hypothesis that metacognitive learning can foster creativity in robotic planning.
SoraNav: Adaptive UAV Task-Centric Navigation via Zeroshot VLM Reasoning
Interpreting visual observations and natural language instructions for complex task execution remains a key challenge in robotics and AI. Despite recent advances, language-driven navigation is still difficult, particularly for UAVs in small-scale 3D environments. Existing Vision-Language Navigation (VLN) approaches are mostly designed for ground robots and struggle to generalize to aerial tasks that require full 3D spatial reasoning. The emergence of large Vision-Language Models (VLMs), such as GPT and Claude, enables zero-shot semantic reasoning from visual and textual inputs. However, these models lack spatial grounding and are not directly applicable to navigation. To address these limitations, SoraNav is introduced, an adaptive UAV navigation framework that integrates zero-shot VLM reasoning with geometry-aware decision-making. Geometric priors are incorporated into image annotations to constrain the VLM action space and improve decision quality. A hybrid switching strategy leverages navigation history to alternate between VLM reasoning and geometry-based exploration, mitigating dead-ends and redundant revisits. A PX4-based hardware-software platform, comprising both a digital twin and a physical micro-UAV, enables reproducible evaluation. Experimental results show that in 2.5D scenarios, our method improves Success Rate (SR) by 25.7% and Success weighted by Path Length (SPL) by 17%. In 3D scenarios, it improves SR by 29.5% and SPL by 18.5% relative to the baseline.
Multiagent Systems
Boltzmann-based Exploration for Robust Decentralized Multi-Agent Planning ICAPS 2026
Decentralized Monte Carlo Tree Search (Dec-MCTS) is widely used for cooperative multi-agent planning but struggles in sparse or skewed reward environments. We introduce Coordinated Boltzmann MCTS (CB-MCTS), which replaces deterministic UCT with a stochastic Boltzmann policy and a decaying entropy bonus for sustained yet focused exploration. While Boltzmann exploration has been studied in single-agent MCTS, applying it in multi-agent systems poses unique challenges. CB-MCTS is the first to address this. We analyze CB-MCTS in the simple-regret setting and show in simulations that it outperforms Dec-MCTS in deceptive scenarios and remains competitive on standard benchmarks, providing a robust solution for multi-agent planning.
comment: To appear in ICAPS 2026
GenDB: The Next Generation of Query Processing -- Synthesized, Not Engineered
Traditional query processing relies on engines that are carefully optimized and engineered by many experts. However, new techniques and user requirements evolve rapidly, and existing systems often cannot keep pace. At the same time, these systems are difficult to extend due to their internal complexity, and developing new systems requires substantial engineering effort and cost. In this paper, we argue that recent advances in Large Language Models (LLMs) are starting to shape the next generation of query processing systems. We propose using LLMs to synthesize execution code for each incoming query, instead of continuously building, extending, and maintaining complex query processing engines. As a proof of concept, we present GenDB, an LLM-powered agentic system that generates instance-optimized and customized query execution code tailored to specific data, workloads, and hardware resources. We implemented an early prototype of GenDB that uses Claude Code Agent as the underlying component in the multi-agent system, and we evaluate it on OLAP workloads. We use queries from the well-known TPC-H benchmark and also construct a new benchmark designed to reduce potential data leakage from LLM training data. We compare GenDB with state-of-the-art query engines, including DuckDB, Umbra, MonetDB, ClickHouse, and PostgreSQL. GenDB achieves significantly better performance than these systems. Finally, we discuss the current limitations of GenDB and outline future extensions and related research challenges.
Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning
When automating plan generation for a real-world sequential decision problem, the goal is often not to replace the human planner, but to facilitate an iterative reasoning and elicitation process, where the human's role is to guide the AI planner according to their preferences and expertise. In this context, explanations that respond to users' questions are crucial to improve their understanding of potential solutions and increase their trust in the system. To enable natural interaction with such a system, we present a multi-agent Large Language Model (LLM) architecture that is agnostic to the explanation framework and enables user- and context-dependent interactive explanations. We also describe an instantiation of this framework for goal-conflict explanations, which we use to conduct a user study comparing the LLM-powered interaction with a baseline template-based explanation interface.
comment: Preprint
Selection as Power: Constrained Reinforcement for Bounded Decision Authority
Selection as Power argued that upstream selection authority, rather than internal objective misalignment, constitutes a primary source of risk in high-stakes agentic systems. However, the original framework was static: governance constraints bounded selection power but did not adapt over time. In this work, we extend the framework to dynamic settings by introducing incentivized selection governance, where reinforcement updates are applied to scoring and reducer parameters under externally enforced sovereignty constraints. We formalize selection as a constrained reinforcement process in which parameter updates are projected onto governance-defined feasible sets, preventing concentration beyond prescribed bounds. Across multiple regulated financial scenarios, unconstrained reinforcement consistently collapses into deterministic dominance under repeated feedback, especially at higher learning rates. In contrast, incentivized governance enables adaptive improvement while maintaining bounded selection concentration. Projection-based constraints transform reinforcement from irreversible lock-in into controlled adaptation, with governance debt quantifying the tension between optimization pressure and authority bounds. These results demonstrate that learning dynamics can coexist with structural diversity when sovereignty constraints are enforced at every update step, offering a principled approach to integrating reinforcement into high-stakes agentic systems without surrendering bounded selection authority.
A speciation simulation that partly passes open-endedness tests
One of the main goals of artificial life research is to recreate in artificial systems the trends for ever more complex and novel entities, interactions and processes that we see in Earth's biosphere, that is, to create open-ended systems. In this paper, we test for Tokyo type 1 open-ended evolution (OEE) of the Tree of Life Simulation (ToLSim), an artificial life software created by Lana Sinapayen. To do so, we conducted an experiment to measure evolutionary activity statistics. These require us to define the notion of components. Here, we define components as the agent's genes. The results show that ToLSim is capable of exhibiting unbounded total cumulative evolutionary activity. However, total and median normalized cumulative evolutionary activity appear bounded and new evolutionary activity is persistently null, suggesting that ToLSim is not open-ended. Further studies on ToLSim could repeat this experiment with individuals or even species, rather than genes, to test whether the present results are valid.
comment: 12 pages, 4 figures
The Observer-Situation Lattice: A Unified Formal Basis for Perspective-Aware Cognition AAMAS 2026
Autonomous agents operating in complex, multi-agent environments must reason about what is true from multiple perspectives. Existing approaches often struggle to integrate the reasoning of different agents, at different times, and in different contexts, typically handling these dimensions in separate, specialized modules. This fragmentation leads to a brittle and incomplete reasoning process, particularly when agents must understand the beliefs of others (Theory of Mind). We introduce the Observer-Situation Lattice (OSL), a unified mathematical structure that provides a single, coherent semantic space for perspective-aware cognition. OSL is a finite complete lattice where each element represents a unique observer-situation pair, allowing for a principled and scalable approach to belief management. We present two key algorithms that operate on this lattice: (i) Relativized Belief Propagation, an incremental update algorithm that efficiently propagates new information, and (ii) Minimal Contradiction Decomposition, a graph-based procedure that identifies and isolates contradiction components. We prove the theoretical soundness of our framework and demonstrate its practical utility through a series of benchmarks, including classic Theory of Mind tasks and a comparison with established paradigms such as assumption-based truth maintenance systems. Our results show that OSL provides a computationally efficient and expressive foundation for building robust, perspective-aware autonomous agents.
comment: Extended version of the AAMAS 2026 paper with the same title
Exploration enhances cooperation in the multi-agent communication system
Designing protocols enhancing cooperation for multi-agent systems remains a grand challenge. Cheap talk, defined as costless, non-binding communication before formal action, serves as a pivotal solution. However, existing theoretical frameworks often exclude random exploration, or noise, for analytical tractability, leaving its functional impact on system performance largely unexplored. To bridge this gap, we propose a two-stage evolutionary game-theoretical model, integrating signalling with a donation game, with exploration explicitly incorporated into the decision-making. Our agent-based simulations across topologies reveal a universal optimal exploration rate that maximises system-wide cooperation. Mechanistically, moderate exploration undermines the stability of defection and catalyses the self-organised cooperative alliances, facilitating their cyclic success. Moreover, the cooperation peak is enabled by the delicate balance between oscillation period and amplification. Our findings suggest that rather than pursuing deterministic rigidity, embracing strategic exploration, as a form of engineered randomness, is essential to sustain cooperation and realise optimal performance in communication-based intelligent systems.
CUCo: An Agentic Framework for Compute and Communication Co-design
Custom CUDA kernel development is essential for maximizing GPU utilization in large-scale distributed LLM training and inference, yet manually writing kernels that jointly leverage both computation and communication remains a labor-intensive and error-prone process. Prior work on kernel optimization has focused almost exclusively on computation, leaving communication kernels largely untouched even though they constitute a significant share of total execution time. We introduce CUCo, a training-free agent-driven workflow that automatically generates high-performance CUDA kernels that jointly orchestrate computation and communication. By co-optimizing these traditionally disjoint components, CUCo unlocks new optimization opportunities unavailable to existing approaches, outperforming state-of-the-art baselines and reducing end-to-end latency by up to $1.57\times$.
RIVA: Leveraging LLM Agents for Reliable Configuration Drift Detection
Infrastructure as code (IaC) tools automate cloud provisioning but verifying that deployed systems remain consistent with the IaC specifications remains challenging. Such configuration drift occurs because of bugs in the IaC specification, manual changes, or system updates. Large language model (LLM)-based agentic AI systems can automate the analysis of large volumes of telemetry data, making them suitable for the detection of configuration drift. However, existing agentic systems implicitly assume that the tools they invoke always return correct outputs, making them vulnerable to erroneous tool responses. Since agents cannot distinguish whether an anomalous tool output reflects a real infrastructure problem or a broken tool, such errors may cause missed drift or false alarms, reducing reliability precisely when it is most needed. We introduce RIVA (Robust Infrastructure by Verification Agents), a novel multi-agent system that performs robust IaC verification even when tools produce incorrect or misleading outputs. RIVA employs two specialized agents, a verifier agent and a tool generation agent, that collaborate through iterative cross-validation, multi-perspective verification, and tool call history tracking. Evaluation on the AIOpsLab benchmark demonstrates that RIVA, in the presence of erroneous tool responses, recovers task accuracy from 27.3% when using a baseline ReAct agent to 50.0% on average. RIVA also improves task accuracy 28% to 43.8% without erroneous tool responses. Our results show that cross-validation of diverse tool calls enables more reliable autonomous infrastructure verification in production cloud environments.
TritonDFT: Automating DFT with a Multi-Agent Framework
Density Functional Theory (DFT) is a cornerstone of materials science, yet executing DFT in practice requires coordinating a complex, multi-step workflow. Existing tools and LLM-based solutions automate parts of the steps, but lack support for full workflow automation, diverse task adaptation, and accuracy-cost trade-off optimization in DFT configuration. To this end, we present TritonDFT, a multi-agent framework that enables efficient and accurate DFT execution through an expert-curated, extensible workflow design, Pareto-aware parameter inference, and multi-source knowledge augmentation. We further introduce DFTBench, a benchmark for evaluating the agent's multi-dimensional capabilities, spanning science expertise, trade0off optimization, HPC knowledge, and cost efficiency. TritonDFT provides an open user interface for real-world usage. Our website is at https://www.tritondft.com. Our source code and benchmark suite are available at https://github.com/Leo9660/TritonDFT.git.
Graphon Mean-Field Control for Cooperative Multi-Agent Reinforcement Learning
The marriage between mean-field theory and reinforcement learning has shown a great capacity to solve large-scale control problems with homogeneous agents. To break the homogeneity restriction of mean-field theory, a recent interest is to introduce graphon theory to the mean-field paradigm. In this paper, we propose a graphon mean-field control (GMFC) framework to approximate cooperative multi-agent reinforcement learning (MARL) with nonuniform interactions and show that the approximate order is of $\mathcal{O}(\frac{1}{\sqrt{N}})$, with $N$ the number of agents. By discretizing the graphon index of GMFC, we further introduce a smaller class of GMFC called block GMFC, which is shown to well approximate cooperative MARL. Our empirical studies on several examples demonstrate that our GMFC approach is comparable with the state-of-art MARL algorithms while enjoying better scalability.
Optimization of Edge Directions and Weights for Mixed Guidance Graphs in Lifelong Multi-Agent Path Finding
Multi-Agent Path Finding (MAPF) aims to move agents from their start to goal vertices on a graph. Lifelong MAPF (LMAPF) continuously assigns new goals to agents as they complete current ones. To guide agents' movement in LMAPF, prior works have proposed Guidance Graph Optimization (GGO) methods to optimize a guidance graph, which is a bidirected weighted graph whose directed edges represent moving and waiting actions with edge weights being action costs. Higher edge weights represent higher action costs. However, edge weights only provide soft guidance. An edge with a high weight only discourages agents from using it, instead of prohibiting agents from traversing it. In this paper, we explore the need to incorporate edge directions optimization into GGO, providing strict guidance. We generalize GGO to Mixed Guidance Graph Optimization (MGGO), presenting two MGGO methods capable of optimizing both edge weights and directions. The first optimizes edge directions and edge weights in two phases separately. The second applies Quality Diversity algorithms to optimize a neural network capable of generating edge directions and weights. We also incorporate traffic patterns relevant to edge directions into a GGO method, making it capable of generating edge-direction-aware guidance graphs.
Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems ICLR 2026
While Multi-Agent Systems (MAS) excel at complex tasks, their growing autonomy with operational complexity often leads to critical inefficiencies, such as excessive token consumption and failures arising from misinformation. Existing methods primarily focus on post-hoc failure attribution, lacking proactive, real-time interventions to enhance robustness and efficiency. To this end, we introduce SupervisorAgent, a lightweight and modular framework for runtime, adaptive supervision that operates without altering the base agent's architecture. Triggered by an LLM-free adaptive filter, SupervisorAgent intervenes at critical junctures to proactively correct errors, guide inefficient behaviors, and purify observations. On the challenging GAIA benchmark, SupervisorAgent reduces the token consumption of the Smolagent framework by an average of 29.68% without compromising its success rate. Extensive experiments across five additional benchmarks (math reasoning, code generation, and question answering) and various SoTA foundation models validate the broad applicability and robustness of our approach.
comment: Accepted to ICLR 2026. The code is available at https://github.com/LINs-lab/SupervisorAgent
NeuroWise: A Multi-Agent LLM "Glass-Box" System for Practicing Double-Empathy Communication with Autistic Partners
The double empathy problem frames communication difficulties between neurodivergent and neurotypical individuals as arising from mutual misunderstanding, yet most interventions focus on autistic individuals. We present NeuroWise, a multi-agent LLM-based coaching system that supports neurotypical users through stress visualization, interpretation of internal experiences, and contextual guidance. In a between-subjects study (N=30), NeuroWise was rated as helpful by all participants and showed a significant condition-time effect on deficit-based attributions (p=0.02): NeuroWise users reduced deficit framing, while baseline users shifted toward blaming autistic "deficits" after difficult interactions. NeuroWise users also completed conversations more efficiently (37% fewer turns, p=0.03). These findings suggest that AI-based interpretation can support attributional change by helping users recognize communication challenges as mutual.
comment: Accepted to ACM CHI 2026
Reuse, Don't Recompute: Efficient Large Reasoning Model Inference via Memory Orchestration
Large reasoning models (LRMs) achieve strong accuracy through test-time scaling, generating longer chains of thought or sampling multiple solutions, but at steep costs in tokens and latency. We argue that memory is a core ingredient for efficient reasoning: when evidence already exists, models should think less by reusing structured memory instead of recomputing derivations. We present ENGRAM-R, an inference-time memory layer that integrates typed retrieval with compact fact card representations and explicit citation control. On the LoCoMo benchmark, ENGRAM-R reduces input tokens by 85% and reasoning tokens by 75% compared to full context while maintaining high accuracy. On a multi-hop slice of the LongMemEval benchmark, it achieves similar efficiency with substantial accuracy gains. These results show that memory is not only critical for long-horizon correctness but also a practical lever for efficient reasoning under tight compute, memory, and latency budgets.
Personalized Collaborative Learning with Affinity-Based Variance Reduction ICLR 2026
Multi-agent learning faces a fundamental tension: leveraging distributed collaboration without sacrificing the personalization needed for diverse agents. This tension intensifies when aiming for full personalization while adapting to unknown heterogeneity levels -- gaining collaborative speedup when agents are similar, without performance degradation when they are different. Embracing the challenge, we propose personalized collaborative learning (PCL), a novel framework for heterogeneous agents to collaboratively learn personalized solutions with seamless adaptivity. Through carefully designed bias correction and importance correction mechanisms, our method AffPCL robustly handles both environment and objective heterogeneity. We prove that AffPCL reduces sample complexity over independent learning by a factor of $\max\{n^{-1}, δ\}$, where $n$ is the number of agents and $δ\in[0,1]$ measures their heterogeneity. This affinity-based acceleration automatically interpolates between the linear speedup of federated learning in homogeneous settings and the baseline of independent learning, without requiring prior knowledge of the system. Our analysis further reveals that an agent may obtain linear speedup even by collaborating with arbitrarily dissimilar agents, unveiling new insights into personalization and collaboration in the high heterogeneity regime.
comment: Published as a conference paper at ICLR 2026
Systems and Control (EESS)
Characterizing Information Accuracy in Timeliness-Based Gossip Networks
We investigate information accuracy in timeliness-based gossip networks where the source evolves according to a continuous-time Markov chain (CTMC) with $M$ states and disseminates status updates to a network of $n$ nodes. In addition to direct source updates, nodes exchange their locally stored packets via gossip and accept incoming packets solely based on whether the incoming packet is fresher than their local copy. As a result, a node can possess the freshest packet in the network while still not having the current source state. To quantify the amount of accurate information flowing in the network under such a gossiping scheme, we introduce two accuracy metrics, average accuracy, defined as the expected fraction of nodes carrying accurate information in any given subset, and freshness-based accuracy, defined as the accuracy of the freshest node in any given subset. Using a stochastic hybrid systems (SHS) framework, we first derive steady-state balance equations and obtain matrix-valued recursions that characterize these metrics in fully connected gossip networks under binary CTMCs. We then extend our analysis to the general multi-state information source using a joint CTMC approach. Finally, we quantify the fraction of nodes whose information is accurate due to direct source pushes versus gossip exchanges. We verify our findings with numerical analyses and provide asymptotic insights.
Resilient Chaotic Cross-Layer Routing for Smart Grid IoT Networks
This paper presents the Distributed Adaptive Multi-Radio Cross-Layer Routing (DAMCR) protocol, designed to enhance reliability, adaptability, and energy efficiency in smart grid and industrial Internet of Things (IoT) communication networks. DAMCR integrates Chaotic Frequency-Hopping Spread Spectrum (C-FHSS) to improve physical-layer security and jamming resilience with Link-Adaptive Quality Power Control (LAQPC) to dynamically regulate transmission power based on instantaneous link quality and residual node energy. To meet heterogeneous traffic requirements, the protocol incorporates priority-aware message classification that differentiates between periodic monitoring data and time-critical fault and protection messages. The proposed framework is implemented and evaluated in MATLAB using a heterogeneous network composed of LoRa, Wi-Fi, and dual-radio nodes operating under AWGN, Rayleigh, and Rician fading environments. Extensive simulation results demonstrate that DAMCR consistently achieves a Packet Delivery Ratio (PDR) exceeding 95% across all evaluated scenarios, while maintaining end-to-end latency between 17 and 23 ms, even in the presence of controlled jamming attacks. These results confirm that the tight integration of chaos-based spectrum agility, cross-technology routing, and energy-aware cross-layer adaptation significantly improves communication reliability, latency stability, and resilience compared to conventional single-radio and static-routing protocols.
A System-of-Systems Convergence Paradigm for Societal Challenges of the Anthropocene
Modern societal challenges, such as climate change, urbanization, and water resource management, demand integrated, multi-discipline, multi-problem approaches to frame and address their complexity. Unfortunately, current methodologies often operate within disciplinary silos, leading to fragmented insights and missed opportunities for convergence. A critical barrier to cross-disciplinary integration lies in the disparate ontologies that shape how different fields conceptualize and communicate knowledge. To address these limitations, this paper proposes a system-of-systems (SoS) convergence paradigm grounded in a meta-cognition map, a framework that integrates five complementary domains: real-world observations, systems thinking, visual modeling, mathematics, and computing. The paradigm is based on the Systems Modeling Language (SysML), offering a standardized, domain-neutral approach for representing and analyzing complex systems. The proposed methodology is demonstrated through a case study of the Chesapeake Bay Watershed, a socio-environmental system requiring coordination across land use, hydrology, economic and policy domains. By modeling this system with SysML, the study illustrates practical strategies for navigating interdisciplinary challenges and highlights the potential of agile SoS modeling to support large-scale, multi-dimensional decision-making.
A Hetero-functional Graph State Estimator for Watershed Systems: Application to the Chesapeake Bay
Regional watersheds are complex systems of systems encompassing hydrology, land-use decision-making, estuarine ecological feedbacks, and overlapping governance jurisdictions. Their effective management underlies many modern societal challenges and therefore requires models that capture interdependencies between natural and institutional systems. Regional-specific models such as the Chesapeake Assessment Scenario Tool, used in this paper's case study, provide valuable nutrient estimates but rely on structurally opaque watershed routing that limits integration into broader systems-level analyses. This paper introduces a modeling framework for watershed systems. First, a region-independent reference architecture is developed. Second, the Weighted Least Squares Error Hetero-functional Graph State Estimator, an extension of Hetero-functional Graph Theory (HFGT), is adapted to estimate nutrient flows from uncertain data. The framework is demonstrated through instantiation in the Chesapeake Bay Watershed. By establishing a shared ontology grounded in Systems Modeling Language and HFGT, the approach enables integration of economic and governance systems to support sustainable watershed management.
PAC Finite-Time Safety Guarantees for Stochastic Systems with Unknown Disturbance Distributions SC
We investigate the problem of establishing finite-time probabilistic safety guarantees for discrete-time stochastic dynamical systems subject to unknown disturbance distributions, using barrier certificate methods. Our approach develops a data-driven safety certification framework that relies only on a finite collection of independent and identically distributed (i.i.d.) disturbance samples. Within this framework, we propose a certification procedure such that, with confidence at least $1-δ$ over the sampled disturbances, if the output of the certification procedure is accepted, the probability that the system remains within a prescribed safe set over a finite horizon is at least $1-ε$. A key challenge lies in formally characterizing the probably approximately correct (PAC) generalization behavior induced by finite samples. To address this, we derive PAC generalization bounds using tools from VC dimension, scenario optimization, and Rademacher complexity. These results illuminate the fundamental trade-offs between sample size, model complexity, and safety tolerance, providing both theoretical insight and practical guidance for designing reliable, data-driven safety certificates in discrete-time stochastic systems.
comment: To appear in HSCC 2026
Dynamic Connectivity and Local Frequency Strength under Stochastic Variations
This paper introduces a novel metric, termed the Generalized Fiedler Vector (GFV), to evaluate the \textit{dynamic connectivity} in power systems. The proposed metric leverages the network connectivity, represented by the system Laplacian matrix, together with the nodal inertia distribution, following a formulation previously developed by the first author. By capturing the interplay between system topology and dynamic properties, the GFV provides valuable insights for the optimal siting of stochastic generation to mitigate its impact on local and system-wide frequency variability. The effectiveness of the proposed approach is demonstrated through Monte Carlo simulations performed on the IEEE 68-bus test system.
Tiny-DroNeRF: Tiny Neural Radiance Fields aboard Federated Learning-enabled Nano-drones ICRA 2026
Sub-30g nano-sized aerial robots can leverage their agility and form factor to autonomously explore cluttered and narrow environments, like in industrial inspection and search and rescue missions. However, the price for their tiny size is a strong limit in their resources, i.e., sub-100 mW microcontroller units (MCUs) delivering $\sim$100 GOps/s at best, and memory budgets well below 100 MB. Despite these strict constraints, we aim to enable complex vision-based tasks aboard nano-drones, such as dense 3D scene reconstruction: a key robotic task underlying fundamental capabilities like spatial awareness and motion planning. Top-performing 3D reconstruction methods leverage neural radiance fields (NeRF) models, which require GBs of memory and massive computation, usually delivered by high-end GPUs consuming 100s of Watts. Our work introduces Tiny-DroNeRF, a lightweight NeRF model, based on Instant-NGP, and optimized for running on a GAP9 ultra-low-power (ULP) MCU aboard our nano-drones. Then, we further empower our Tiny-DroNeRF by leveraging a collaborative federated learning scheme, which distributes the model training among multiple nano-drones. Our experimental results show a 96% reduction in Tiny-DroNeRF's memory footprint compared to Instant-NGP, with only a 5.7 dB drop in reconstruction accuracy. Finally, our federated learning scheme allows Tiny-DroNeRF to train with an amount of data otherwise impossible to keep in a single drone's memory, increasing the overall reconstruction accuracy. Ultimately, our work combines, for the first time, NeRF training on an ULP MCU with federated learning on nano-drones.
comment: This paper has been accepted for publication in the IEEE ICRA 2026 conference. ©2026 IEEE
Critical Clearing Time Enhancement of Droop-Controlled Grid-Forming Inverters with Adaptive Function-Based Parameters
With the increasing penetration of renewable energy sources, grid-forming (GFM) inverters are becoming essential for voltage and frequency regulation. However, the transient stability of GFM inverter is critically affected by the current limiters that are embedded with the standard control schemes. This paper proposes a novel adaptive function to enhance the transient stability of droop-controlled GFM inverters. The proposed method autonomously adjusts the active power reference and the droop gain based on the terminal voltage of the inverter. Also, the acceleration of the phase angle is prevented, leading to the maximization of critical clearing time (CCT). The proposed method is benchmarked against two state-of-the-art GFM inverter CCT enhancement methods. Effectiveness of the proposed method is validated through electromagnetic transient (EMT) simulations in MATLAB/Simulink\textsuperscript{\textregistered}.
comment: 5 pages, 7 figures
Shape-Interpretable Visual Self-Modeling Enables Geometry-Aware Continuum Robot Control
Continuum robots possess high flexibility and redundancy, making them well suited for safe interaction in complex environments, yet their continuous deformation and nonlinear dynamics pose fundamental challenges to perception, modeling, and control. Existing vision-based control approaches often rely on end-to-end learning, achieving shape regulation without explicit awareness of robot geometry or its interaction with the environment. Here, we introduce a shape-interpretable visual self-modeling framework for continuum robots that enables geometry-aware control. Robot shapes are encoded from multi-view planar images using a Bezier-curve representation, transforming visual observations into a compact and physically meaningful shape space that uniquely characterizes the robot's three-dimensional configuration. Based on this representation, neural ordinary differential equations are employed to self-model both shape and end-effector dynamics directly from data, enabling hybrid shape-position control without analytical models or dense body markers. The explicit geometric structure of the learned shape space allows the robot to reason about its body and surroundings, supporting environment-aware behaviors such as obstacle avoidance and self-motion while maintaining end-effector objectives. Experiments on a cable-driven continuum robot demonstrate accurate shape-position regulation and tracking, with shape errors within 1.56% of image resolution and end-effector errors within 2% of robot length, as well as robust performance in constrained environments. By elevating visual shape representations from two-dimensional observations to an interpretable three-dimensional self-model, this work establishes a principled alternative to vision-based end-to-end control and advances autonomous, geometry-aware manipulation for continuum robots.
Predictive Lane-Change and Routing Coordination in Bus-Priority Mixed Traffic Corridors
In this paper, we investigate the coordination of vehicle maneuvers in mixed-traffic corridors where connected and automated vehicles, human-driven vehicles, and buses interact under dedicated bus lane operations. We develop a segment-based network coordination framework that jointly optimizes lane-change and routing decisions of connected and automated vehicles to improve dedicated lane utilization while preserving bus priority. The proposed framework incorporates a predictive bus-protection mechanism that restricts vehicle access to protected lane segments within a monitoring horizon, together with a utility-driven lane-change strategy that accounts for anticipated travel time gains, downstream routing feasibility, and lane-change stability. By explicitly coupling network-level routing decisions with lane-level interaction control, the method proactively mitigates conflicts on dedicated lanes before congestion effects materialize. The proposed approach is evaluated through microscopic traffic simulations in SUMO using a realistic urban corridor. Simulation results demonstrate that the framework enhances bus schedule adherence and reduces average travel times for both automated and human-driven vehicles, while maintaining stable lane-change behavior without increasing maneuver frequency.
Battery Discharge Modeling for Electric Vehicles: A Hybrid Physics-based Residual Learning Approach
The growing integration of electric vehicle (EV) fleets into transportation services and energy systems requires accurate modeling of battery discharge and state-of-charge (SoC) evolution to ensure reliable vehicle operation and grid coordination. Existing approaches face a trade-off between interpretable but simplified physics-based models and data-driven methods that demand large datasets and may lack physical consistency. In this paper, we propose a hybrid physics-based residual learning framework for EV battery discharge modeling. A vehicle dynamics model based on force-balance equations provides an interpretable baseline estimate of energy consumption and SoC evolution, capturing aerodynamic drag, rolling resistance, and regenerative braking. A neural network residual learner then corrects discrepancies caused by complex factors such as traffic conditions and driver behavior. Experimental results on $1,500$ trip scenarios demonstrate that the proposed approach reduces the mean absolute percentage error to approximately $0.8\%$, significantly outperforming physics-only models while preserving physical interpretability and computational efficiency.
On the Stability Connection Between Discrete-Time Algorithms and Their Resolution ODEs: Applications to Min-Max Optimisation
This work establishes a rigorous connection between stability properties of discrete-time algorithms (DTAs) and corresponding continuous-time dynamical systems derived through $ O(s^r) $-resolution ordinary differential equations (ODEs). We show that for discrete- and continuous-time dynamical systems satisfying a mild error assumption, exponential stability of a common equilibrium with respect to the continuous time dynamics implies exponential stability of the corresponding equilibrium for the discrete-time dynamics, provided that the step size is chosen sufficiently small. We extend this result to common compact invariant sets. We prove that if an equilibrium is exponentially stable for the $ O(s^r) $-resolution ODE, then it is also exponentially stable for the associated DTA. We apply this framework to analyse the limit point properties of several prominent optimisation algorithms, including Two-Timescale Gradient Descent--Ascent (TT-GDA), Generalised Extragradient (GEG), Two-Timescale Proximal Point (TT-PPM), Damped Newton (DN), Regularised Damped Newton (RDN), and the Jacobian method (JM), by studying their $ O(1) $- and $ O(s) $-resolution ODEs. We show that under a proper choice of hyperparameters, the set of saddle points of the objective function is a subset of the set of exponentially stable equilibria of GEG, TT-PPM, DN, and RDN. We relax the common Hessian invariance assumption through direct analysis of the resolution ODEs, broadening the applicability of our results. Numerical examples illustrate the theoretical findings.
Koopman-based Estimation of Lyapunov Functions: Theory on a Reproducing Kernel Hilbert Space
Koopman operator provides a general linear description of nonlinear systems, whose estimation from data (via extended dynamic mode decomposition) has been extensively studied. However, the elusiveness between the Koopman spectrum and the stability of equilibrium point poses a challenge to utilizing the Koopman operator for stability analysis, which further hinders the construction of a universal theory of Koopman-based control. In our prior work, we defined the Koopman operator on a reproducing kernel Hilbert space (RKHS) using a linear--radial product kernel, and proved that the Koopman spectrum is confined in the unit disk of the complex plane when the origin is an asymptotically stable equilibrium point. Building on this fundamental spectrum--stability relation, here we consider the problem of Koopman operator-based Lyapunov function estimation with a given decay rate function. The decay rate function and the Lyapunov function are both specified by positive operators on the RKHS and are related by an operator algebraic Lyapunov equation (ALE), whose solution exists uniquely. The error bound of such a Lyapunov function estimate, obtained via kernel extended dynamic mode decomposition (kEDMD), are established based on statistical learning theory and verified by a numerical study.
comment: 6 pages, 3 figures
Align and Filter: Improving Performance in Asynchronous On-Policy RL
Distributed training and increasing the gradient update frequency are practical strategies to accelerate learning and improve performance, but both exacerbate a central challenge: \textit{policy lag}, which is the mismatch between the behavior policy generating data and the learning policy being updated. Policy lag can hinder the scaling of on-policy learning algorithms to larger problems. In this paper, we identify the sources of policy lag caused by distributed learning and high update frequency. We use the findings to propose \textit{total Variation-based Advantage aligned Constrained policy Optimization (\methodacronym)} as a practical approach to mitigate policy lag. We empirically validate our method and show that it offers better robustness to policy lag in classic RL tasks and a modern RL for LLM math reasoning task.
Operational Modal Analysis of Aeronautical Structures via Tangential Interpolation
Over the last decades, progress in modal analysis has enabled increasingly routine use of modal parameters, including those extracted from in-situ measurements, for applications such as structural health monitoring and finite element model updating. For output-only identification, or Operational Modal Analysis (OMA), widely adopted approaches include Stochastic Subspace Identification (SSI) methods and the Natural Excitation Technique combined with the Eigensystem Realization Algorithm (NExT-ERA). Nevertheless, SSI-based techniques may become cumbersome on large systems, while NExT-ERA fitting can struggle when measurements are contaminated by noise. To alleviate these, this work investigates an OMA frequency-domain formulation for aeronautical structures by coupling the Loewner Framework (LF) with NExT, yielding the proposed NExT-LF method. The method exploits the computational efficiency of LF together with the impulse response function retrieval enabled by NExT. NExT-LF is assessed on two experimental benchmarks: the eXperimental BeaRDS 2 high-aspect-ratio wing main spar and an Airbus Helicopters H135 bearingless main rotor blade. The identified modal parameters are compared against available experimental references and results obtained via SSI with Canonical Variate Analysis and NExT-ERA. The results show that the modes identified by NExT-LF correlate well with benchmark data, particularly for high-amplitude tests and in the low-frequency range.
Safe Whole-Body Loco-Manipulation via Combined Model and Learning-based Control ICRA
Simultaneous locomotion and manipulation enables robots to interact with their environment beyond the constraints of a fixed base. However, coordinating legged locomotion with arm manipulation, while considering safety and compliance during contact interaction remains challenging. To this end, we propose a whole-body controller that combines a model-based admittance control for the manipulator arm with a Reinforcement Learning (RL) policy for legged locomotion. The admittance controller maps external wrenches--such as those applied by a human during physical interaction--into desired end-effector velocities, allowing for compliant behavior. The velocities are tracked jointly by the arm and leg controllers, enabling a unified 6-DoF force response. The model-based design permits accurate force control and safety guarantees via a Reference Governor (RG), while robustness is further improved by a Kalman filter enhanced with neural networks for reliable base velocity estimation. We validate our approach in both simulation and hardware using the Unitree Go2 quadruped robot with a 6-DoF arm and wrist-mounted 6-DoF Force/Torque sensor. Results demonstrate accurate tracking of interaction-driven velocities, compliant behavior, and safe, reliable performance in dynamic settings.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA), June 2026, in Vienna, Austria
Ill-Conditioned Power Flow Analysis Using a Quantized State-Based Approach
This paper focuses on power flow analysis through the lens of the Newton flow, a continuous-time formulation of Newton's method. Within this framework, we explore how quantized-state concepts, originally developed as an alternative to time discretization, can be incorporated to govern the evolution of the Newton flow toward the power flow solution. This approach provides a novel perspective on adaptive step-size control and shows how state quantization can enhance robustness in illconditioned cases. The performance of the proposed approach is discussed with the ACTIVSg70k synthetic test system.
Strategic Shaping of Human Prosociality: A Latent-State POMDP Framework
We propose a decision-theoretic framework in which a robot strategically can shape inferred human's prosocial state during repeated interactions. Modeling the human's prosociality as a latent state that evolves over time, the robot learns to infer and influence this state through its own actions, including helping and signaling. We formalize this as a latent-state POMDP with limited observations and learn the transition and observation dynamics using expectation maximization. The resulting belief-based policy balances task and social objectives, selecting actions that maximize long-term cooperative outcomes. We evaluate the model using data from user studies and show that the learned policy outperforms baseline strategies in both team performance and increasing observed human cooperative behavior.
comment: This article has been published in IEEE Robotics and Automation Letters. https://ieeexplore.ieee.org/document/11410120
A Passivity-Agnostic Framework for Distributed Adaptive Synchronization under Unknown Leader Dynamics
We present a passivity-agnostic framework for distributed adaptive synchronization under position-only communication, bounded disturbances, and unknown leader dynamics. By passivity-agnostic we mean the design does not require the closed loop system to be strictly positive real (SPR) a priori: it certifies SPR when present and recovers it by frequency shaping when absent. Followers are heterogeneous second-order systems with unknown (possibly unstable) dynamics. In the SPR regime, a structured reparameterization yields gradient-based adaptive error dynamics; Lyapunov analysis guarantees global asymptotic synchronization in the disturbance-free case, exact rejection of constant disturbances, and bounded responses to time-varying disturbances, with parameter convergence under persistent excitation. In the non-SPR regime, frequency shaping recovers effective passivity of the unshaped transfer function, enabling the same stability guarantees via standard passivity/Lyapunov arguments using Meyer-Kalman-Yakubovich (MKY) Lemma. Simulations across star, cyclic, path, and arbitrary graphs demonstrate scalable synchronization, robust tracking, and parameter adaptation under multiple disturbance profiles, confirming that the frequency-shaped non-SPR designs match the performance of the SPR case.
comment: This extended version is accepted for publication in The 2026 American Control Conference (ACC)
MuFlex: A Scalable, Physics-based Platform for Multi-Building Flexibility Analysis and Coordination
With the increasing penetration of renewable generation on the power grid, maintaining system balance requires coordinated demand flexibility from aggregations of buildings. Reinforcement learning has been widely explored for building controls because of its model-free nature. Open-source simulation testbeds are essential not only for training RL agents but also for fairly benchmarking control strategies. However, most building-sector testbeds target single buildings; multi-building platforms are relatively limited and typically rely on simplified models (e.g., Resistance-Capacitance) or data-driven approaches, which lack the ability to fully capture the physical intricacies and intermediate variables necessary for interpreting control performance. Moreover, these platforms often impose fixed inputs, outputs, and model formats, restricting their applicability as benchmarking tools across diverse control scenarios. To address these gaps, MuFlex, a scalable, open-source platform for multi-building flexibility coordination, was developed. MuFlex enables synchronous information exchange and co-simulation across multiple detailed building models programmed in EnergyPlus and Modelica, and adheres to the latest OpenAI Gym interface, providing a modular, standardized RL implementation. The platform's physics-based capabilities and workflow were demonstrated in a case study coordinating demand flexibility across four office buildings using the Soft Actor-Critic algorithm. The results show that under four buildings' coordination, SAC effectively reduced the aggregated peak demand by nearly 12% with maintained indoor comfort to ensure the power demand below the threshold. Additionally, the platform's scalability was investigated through computational benchmarking on building clusters with varying sizes, model types, and simulation programs.
comment: The platform is released open-source on GitHub: https://github.com/BuildNexusX/MuFlex
Novel Stability Criteria for Discrete and Hybrid Systems via Ramanujan Inner Products
This paper introduces a Ramanujan inner product and its corresponding norm, establishing a novel framework for the stability analysis of hybrid and discrete-time systems as an alternative to traditional Euclidean metrics. We establish new $ε$-$δ$ stability conditions that utilize the unique properties of Ramanujan summations and their relationship with number-theoretic concepts. The proposed approach provides enhanced robustness guarantees and reveals fundamental connections between system stability and arithmetic properties of the system dynamics. Theoretical results are rigorously proven, and simulation results on numerical examples are presented to validate the efficacy of the proposed approach.
comment: 14 pages, 2 figures
Goal Reaching with Eikonal-Constrained Hierarchical Quasimetric Reinforcement Learning
Goal-Conditioned Reinforcement Learning (GCRL) mitigates the difficulty of reward design by framing tasks as goal reaching rather than maximizing hand-crafted reward signals. In this setting, the optimal goal-conditioned value function naturally forms a quasimetric, motivating Quasimetric RL (QRL), which constrains value learning to quasimetric mappings and enforces local consistency through discrete, trajectory-based constraints. We propose Eikonal-Constrained Quasimetric RL (Eik-QRL), a continuous-time reformulation of QRL based on the Eikonal Partial Differential Equation (PDE). This PDE-based structure makes Eik-QRL trajectory-free, requiring only sampled states and goals, while improving out-of-distribution generalization. We provide theoretical guarantees for Eik-QRL and identify limitations that arise under complex dynamics. To address these challenges, we introduce Eik-Hierarchical QRL (Eik-HiQRL), which integrates Eik-QRL into a hierarchical decomposition. Empirically, Eik-HiQRL achieves state-of-the-art performance in offline goal-conditioned navigation and yields consistent gains over QRL in manipulation tasks, matching temporal-difference methods.
Universal Dynamics with Globally Controlled Analog Quantum Simulators
Analog quantum simulators with global control fields have emerged as powerful platforms for exploring complex quantum phenomena. Despite these advances, a fundamental theoretical question remains unresolved: to what extent can such systems realize universal quantum dynamics under global control? Here we establish a necessary and sufficient condition for universal quantum computation using only global pulse control, proving that a broad class of analog quantum simulators is, in fact, universal. We further extend this framework to fermionic and bosonic systems, including modern platforms such as ultracold atoms in optical superlattices. Moreover, we observe that analog simulators driven by random global pulses exhibit information scrambling comparable to random unitary circuits. In a dual-species neutral-atom array setup, the measurement outcomes anti-concentrate on a $\log N$ timescale despite the presence of only temporal randomness, opening opportunities for efficient randomness generation. To bridge theoretical possibility with experimental reality, we introduce \emph{direct quantum optimal control}, a control framework that enables the synthesis of complex effective Hamiltonians while incorporating realistic hardware constraints. Using this approach, we experimentally engineer three-body interactions outside the blockade regime and demonstrate topological dynamics on a Rydberg-atom array. Experimental measurements reveal dynamical signatures of symmetry-protected-topological edge modes, confirming both the expressivity and feasibility of our method. Our work opens a new avenue for quantum simulation beyond native hardware Hamiltonians, enabling the engineering of effective multi-body interactions and advancing the frontier of quantum information processing with globally-controlled analog platforms.
comment: The updated version adds new applications and discussions on information scrambling with globally controlled analog quantum systems. 11 pages, 6 figures with Methods. HYH, AMG, and LC contributed equally to this work. Updated acknowledgement
TAO: Tolerance-Aware Optimistic Verification for Floating-Point Neural Networks
Neural networks increasingly run on hardware outside the user's control (cloud GPUs, inference marketplaces). Yet ML-as-a-Service reveals little about what actually ran or whether returned outputs faithfully reflect the intended inputs. Users lack recourse against service downgrades (model swaps, quantization, graph rewrites, or discrepancies like altered ad embeddings). Verifying outputs is hard because floating-point(FP) execution on heterogeneous accelerators is inherently nondeterministic. Existing approaches are either impractical for real FP neural networks or reintroduce vendor trust. We present TAO: a Tolerance Aware Optimistic verification protocol that accepts outputs within principled operator-level acceptance regions rather than requiring bitwise equality. TAO combines two error models: (i) sound per-operator IEEE-754 worst-case bounds and (ii) tight empirical percentile profiles calibrated across hardware. Discrepancies trigger a Merkle-anchored, threshold-guided dispute game that recursively partitions the computation graph until one operator remains, where adjudication reduces to a lightweight theoretical-bound check or a small honest-majority vote against empirical thresholds. Unchallenged results finalize after a challenge window, without requiring trusted hardware or deterministic kernels. We implement TAO as a PyTorch-compatible runtime and a contract layer currently deployed on Ethereum Holesky testnet. The runtime instruments graphs, computes per-operator bounds, and runs unmodified vendor kernels in FP32 with negligible overhead (0.3% on Qwen3-8B). Across CNNs, Transformers and diffusion models on A100, H100, RTX6000, RTX4090, empirical thresholds are $10^2-10^3$ times tighter than theoretical bounds, and bound-aware adversarial attacks achieve 0% success. Together, TAO reconciles scalability with verifiability for real-world heterogeneous ML compute.
comment: 18 pages, 8 figures
Data-Driven Prediction and Control of Hammerstein-Wiener Systems with Implicit Gaussian Processes
This work investigates data-driven prediction and control of Hammerstein-Wiener systems using physics-informed Gaussian process (GP) models that encode the block-oriented model structure. Data-driven prediction algorithms have been developed for structured nonlinear systems based on Willems' fundamental lemma. However, existing frameworks do not apply to output nonlinearities in Wiener systems and rely on a finite-dimensional dictionary of basis functions for Hammerstein systems. In this work, an implicit predictor structure is considered, leveraging the linearity for the dynamical part of the model. This implicit function is learned by GP regression, utilizing carefully designed structured kernel functions from linear model parameters and GP priors for the nonlinearities. Virtual derivative points are added to the regression by expectation propagation to encode monotonicity information of the nonlinearities. The linear model parameters are estimated as hyperparameters by assuming a stable spline hyperprior. The implicit GP model provides explicit output prediction by optimizing selected optimality criteria. The implicit model is also applied to receding horizon control with the expected control cost and chance constraint satisfaction guarantee. Numerical results demonstrate that the proposed prediction and control algorithms are superior to black-box GP models without model structure knowledge.
Edge-based Synchronization over Signed Digraphs with Multiple Leaders
This work addresses the edge-based synchronization problem in first-order multi-agent systems containing both cooperative and antagonistic interactions with one or multiple leader groups. The presence of multiple leaders and antagonistic interactions means that the multi-agent system typically does not achieve consensus, unless specific conditions (on the number of leaders and on the signed graph) are met, in which case the agents reach a trivial form of consensus. In general, we show that the multi-agent system exhibits a more general form of synchronization, including bipartite consensus and containment. Our approach proposes a signed edge-based agreement protocol for signed networks described by signed edge-Laplacian matrices. In particular, in this work, we present new spectral properties of signed edge-Laplacian matrices containing multiple zero eigenvalues and establish global exponential stability of the synchronization errors. Moreover, we explicitly compute the equilibrium to which all edge states converge, thereby characterizing the resulting synchronization behavior. Numerical simulations validate our theoretical results.
Customising Electricity Contracts at Scale with Large Language Models
The electricity system becomes more complex, connecting massive numbers of end-users and distributed generators. Adding or removing grid connections requires expert studies to align technical constraints with user requests. In times of labour shortages, carrying out these studies represents a significant amount of time that engineers at system operators spend in planning departments. As time is limited, only standard block connectivity contracts can be offered to end-users, or the requests pile up. Even if offers are made, these often do not perfectly match the user's requirements, leading to overpaying or underusing the grid capacity. This paper investigates whether end-users can negotiate individual, flexible time-of-use contracts directly with the grid using Large Language Models (LLMs) in chats at scale. This work addresses system-level technical challenges in automating contract design under grid constraints, integrating LLMs with power system models, and ensuring secure, reliable interaction. We develop a chat system using functional programs for power system analysis, enabling users to request customised, technically feasible contracts at scale. We demonstrate high accuracy in executing engineering studies, robustness to user input variations, self-assessment of connection requests by small and medium enterprises, and potential for secure, chat-enabled maintenance planning. This initial study paves the way toward developing a tailored LLM system, resulting in possible high-efficiency gains for grid planning and customer management. The code is available at: https://github.com/TU-Delft-AI-Energy-Lab/LLM-Electricity-Contracts
comment: 14 pages, 21 figures
A Neural Network-Based Real-time Casing Collar Recognition System for Downhole Instruments
Casing collar locator (CCL) measurements are widely used as reliable depth markers for positioning downhole instruments in cased-hole operations, enabling accurate depth control for operations such as perforation. However, autonomous collar recognition in downhole environments remains challenging because CCL signals are often corrupted by toolstring- or casing-induced magnetic interference, while stringent size and power budgets limit the use of computationally intensive algorithms and specific operations require real-time, in-situ processing. To address these constraints, we propose Collar Recognition Nets (CRNs), a family of domain-specific lightweight 1-D convolutional neural networks for collar signature recognition from streaming CCL waveforms. With depthwise separable convolutions and input pooling, CRNs optimize efficiency without sacrificing accuracy. Our most compact model achieves an F1-score of 0.972 on field data with only 1,985~parameters and 8,208~MACs, and deployed on an ARM Cortex-M7 based embedded system using TensorFlow Lite for Microcontrollers (TFLM) library, the model demonstrates a throughput of 1,000 inference per second and 343.2 μs latency, confirming the feasibility of robust, autonomous, and real-time collar recognition under stringent downhole constraints.
Federated Nonlinear System Identification
We consider federated learning of linearly-parameterized nonlinear systems. We establish theoretical guarantees on the effectiveness of federated nonlinear system identification compared to centralized approaches, demonstrating that the convergence rate improves as the number of clients increases. Although the convergence rates in the linear and nonlinear cases differ only by a constant, this constant depends on the feature map $φ$, which can be carefully chosen in the nonlinear setting to increase excitation and improve performance. We experimentally validate our theory in physical settings where client devices are driven by i.i.d. control inputs and control policies exhibiting i.i.d. random perturbations, ensuring non-active exploration. Experiments use trajectories from nonlinear dynamical systems characterized by real-analytic feature functions, including polynomial and trigonometric components, representative of physical systems including pendulum and quadrotor dynamics. We analyze the convergence behavior of the proposed method under varying noise levels and data distributions. Results show that federated learning consistently improves convergence of any individual client as the number of participating clients increases.
comment: 8 pages. Accepted at American Control Conference 2026
Accurate Small-Signal Modeling of Digitally Controlled Buck Converters with ADC-PWM Synchronization
Digital control has become increasingly widespread in modern power electronic converters. When acquiring feedback signals such as the inductor current, synchronizing the analog-to-digital converter (ADC) with the digital pulse-width modulator (DPWM) is commonly employed to accurately track their steady-state average. However, the small-signal implications of such synchronization have not been investigated. This paper presents an exact small-signal model for digitally controlled buck converters operating in forced continuous-conduction mode (FCCM) under constant-frequency current-mode control, explicitly accounting for DPWM-ADC synchronization. Using a sampled-data framework, the proposed model captures all sideband effects introduced by the sampling process, yielding precise predictions of both analog and digital loop gains, even at frequencies beyond the switching and sampling frequencies. Both asymmetrical and symmetrical carrier modulations are considered. Furthermore, the digital loop gain is derived in closed form using the modified z-transform, enabling low-complexity compensator design and stability assessment. Within this framework, the analog loop gain can be directly obtained from the digital loop gain, thereby eliminating the need for computationally intensive infinite series evaluations. The validity of the proposed model is confirmed through both simulation and experimental results.
Rapid Boundary Stabilization of Two-Dimensional Elastic Plates with In-Domain Aeroelastic Instabilities
Motivated by active wing flutter suppression in high-Mach-number flight, this paper presents a rapid boundary stabilization strategy for a two-dimensional PDE-modeled elastic plate with in-domain instabilities, where the exponential stability is achieved with a decay rate that can be arbitrarily assigned by the users. First, the aeroelastic system is modeled as two-dimensional coupled wave PDEs with internal anti-damping terms, derived by Piston theory and Hamilton's principle. Using Fourier series expansion, the 2-D problem is decomposed into a large-scale 1-D system, based on which full-state boundary feedback control is designed via PDE backstepping transformation. To enable output-feedback implementation, a state observer is further designed to estimate the distributed states over the two-dimensional spatial domain using the available boundary measurements. Through Lyapunov analysis, the exponential stability of the 2-D elastic plate PDE under the proposed boundary control is established with a designer-tunable decay rate. Numerical simulations verify the effectiveness of the control strategy in suppressing flow-induced vibrations in a 2-D elastic plate.
The Waterbed Effect on Quasiperiodic Disturbance Observer: Avoidance of Sensitivity Tradeoff with Time Delays
In linear time-invariant systems, the sensitivity function to disturbances is designed under a sensitivity tradeoff known as the waterbed effect. To compensate for a quasiperiodic disturbance, a quasiperiodic disturbance observer using time delays was proposed. Its sensitivity function avoids the sensitivity tradeoff, achieving wideband harmonic suppression without amplifying aperiodic disturbances or shifting harmonic suppression frequencies. However, its open-loop transfer function is not rational and does not satisfy the assumptions of existing Bode sensitivity integrals due to its time delays. This paper provides Bode-like sensitivity integrals for the quasiperiodic disturbance observer in both continuous-time and discrete-time representations and clarifies the avoided sensitivity tradeoff with time delays.
Training with Hard Constraints: Learning Neural Certificates and Controllers for SDEs
Due to their expressive power, neural networks (NNs) are promising templates for functional optimization problems, particularly for reach-avoid certificate generation for systems governed by stochastic differential equations (SDEs). However, ensuring hard-constraint satisfaction remains a major challenge. In this work, we propose two constraint-driven training frameworks with guarantees for supermartingale-based neural certificate construction and controller synthesis for SDEs. The first approach enforces certificate inequalities via domain discretization and a bound-based loss, guaranteeing global validity once the loss reaches zero. We show that this method also enables joint NN controller-certificate synthesis with hard guarantees. For high-dimensional systems where discretization becomes prohibitive, we introduce a partition-free, scenario-based training method that provides arbitrarily tight PAC guarantees for certificate constraint satisfaction. Benchmarks demonstrate scalability of the bound-based method up to 5D, outperforming the state of the art, and scalability of the scenario-based approach to at least 10D with high-confidence guarantees.
comment: Paper under review
Viability-Preserving Passive Torque Control
Conventional passivity-based torque controllers for manipulators are typically unconstrained, which can lead to safety violations under external perturbations. In this paper, we employ viability theory to pre-compute safe sets in the state-space of joint positions and velocities. These viable sets, constructed via data-driven and analytical methods for self-collision avoidance, external object collision avoidance and joint-position and joint-velocity limits, provide constraints on joint accelerations and thus joint torques via the robot dynamics. A quadratic programming-based control framework enforces these constraints on a passive controller tracking a dynamical system, ensuring the robot states remain within the safe set in an infinite time horizon. We validate the proposed approach through simulations and hardware experiments on a 7-DoF Franka Emika manipulator. In comparison to a baseline constrained passive controller, our method operates at higher control-loop rates and yields smoother trajectories.
comment: 8 pages, 7 figures, Project Website: https://vpp-tc.github.io/webpage/
Learning Contextual Runtime Monitors for Safe AI-Based Autonomy
We introduce a novel framework for learning context-aware runtime monitors for AI-based control ensembles. Machine-learning (ML) controllers are increasingly deployed in (autonomous) cyber-physical systems because of their ability to solve complex decision-making tasks. However, their accuracy can degrade sharply in unfamiliar environments, creating significant safety concerns. Traditional ensemble methods aim to improve robustness by averaging or voting across multiple controllers, yet this often dilutes the specialized strengths that individual controllers exhibit in different operating contexts. We argue that, rather than blending controller outputs, a monitoring framework should identify and exploit these contextual strengths. In this paper, we reformulate the design of safe AI-based control ensembles as a contextual monitoring problem. A monitor continuously observes the system's context and selects the controller best suited to the current conditions. To achieve this, we cast monitor learning as a contextual learning task and draw on techniques from contextual multi-armed bandits. Our approach comes with two key benefits: (1) theoretical safety guarantees during controller selection, and (2) improved utilization of controller diversity. We validate our framework in two simulated autonomous driving scenarios, demonstrating significant improvements in both safety and performance compared to non-contextual baselines.
Experimental Demonstration of a Decentralized Electromagnetic Formation Flying Control Using Alternating Magnetic Field Forces
Electromagnetic formation flying (EMFF) is challenging due to the complex coupling between the electromagnetic fields generated by each satellite in the formation. To address this challenge, this article uses alternating magnetic field forces (AMFF) to decouple the electromagnetic forces between each pair of satellites. The key idea of AMFF is that a pair of alternating (e.g., sinusoidal) magnetic moments results in a nonzero time-averaged interaction force if and only if those alternating magnetic moments have the same frequency. Hence, the approach in this article is to drive each satellite's electromagnetic actuation system with a sum of sinusoids, where each frequency is common to only a pair of satellites. Then, the amplitudes of each sinusoid are modulated (i.e., controlled) to achieve the desired forces between each pair of satellites. The main contribution of this article is a 3-satellite experimental demonstration of decentralized closed-loop EMFF using AMFF. To the authors' knowledge, this is the first demonstration of AMFF with at least 3 satellites in open or closed loop. This is noteworthy because the coupling challenges of EMFF are only present with more than 2 satellites, and thus, a formation of at least 3 is necessary to evaluate the effectiveness of AMFF. The experiments are conducted on a ground-based testbed consisting of 3 electromagnetically actuated satellites on linear air tracks. The closed-loop experiments demonstrate decentralized EMFF with AMFF where the mean steady-state formation error is less than 0.005 m, the maximum steady-state formation error is less than $\pm$0.01 m, and the settling time is less than 30 s. The closed-loop experimental results are compared with behavior from numerical simulations.
comment: Preprint submitted to Aerospace Science and Technology (Elsevier)
Personalized Collaborative Learning with Affinity-Based Variance Reduction ICLR 2026
Multi-agent learning faces a fundamental tension: leveraging distributed collaboration without sacrificing the personalization needed for diverse agents. This tension intensifies when aiming for full personalization while adapting to unknown heterogeneity levels -- gaining collaborative speedup when agents are similar, without performance degradation when they are different. Embracing the challenge, we propose personalized collaborative learning (PCL), a novel framework for heterogeneous agents to collaboratively learn personalized solutions with seamless adaptivity. Through carefully designed bias correction and importance correction mechanisms, our method AffPCL robustly handles both environment and objective heterogeneity. We prove that AffPCL reduces sample complexity over independent learning by a factor of $\max\{n^{-1}, δ\}$, where $n$ is the number of agents and $δ\in[0,1]$ measures their heterogeneity. This affinity-based acceleration automatically interpolates between the linear speedup of federated learning in homogeneous settings and the baseline of independent learning, without requiring prior knowledge of the system. Our analysis further reveals that an agent may obtain linear speedup even by collaborating with arbitrarily dissimilar agents, unveiling new insights into personalization and collaboration in the high heterogeneity regime.
comment: Published as a conference paper at ICLR 2026
Distributed AC Optimal Power Flow: A Scalable Solution for Large-Scale Problems
This paper introduces a novel distributed optimization framework for large-scale AC Optimal Power Flow (OPF) problems, offering both theoretical convergence guarantees and rapid convergence in practice. By integrating smoothing techniques and the Schur complement, the proposed approach addresses the scalability challenges and reduces communication overhead in distributed AC OPF. Additionally, optimal network decomposition enables efficient parallel processing under the single program multiple data (SPMD) paradigm. Extensive simulations on large-scale benchmarks across various operating scenarios indicate that the proposed framework outperforms the state-of-the-art centralized solver IPOPT on modest hardware. This paves the way for more scalable and efficient distributed optimization in future power system applications.
Robotics
Hybrid TD3: Overestimation Bias Analysis and Stable Policy Optimization for Hybrid Action Space
Reinforcement learning in discrete-continuous hybrid action spaces presents fundamental challenges for robotic manipulation, where high-level task decisions and low-level joint-space execution must be jointly optimized. Existing approaches either discretize continuous components or relax discrete choices into continuous approximations, which suffer from scalability limitations and training instability in high-dimensional action spaces and under domain randomization. In this paper, we propose Hybrid TD3, an extension of Twin Delayed Deep Deterministic Policy Gradient (TD3) that natively handles parameterized hybrid action spaces in a principled manner. We conduct a rigorous theoretical analysis of overestimation bias in hybrid action settings, deriving formal bounds under twin-critic architectures and establishing a complete bias ordering across five algorithmic variants. Building on this analysis, we introduce a weighted clipped Q-learning target that marginalizes over the discrete action distribution, achieving equivalent bias reduction to standard clipped minimization while improving policy smoothness. Experimental results demonstrate that Hybrid TD3 achieves superior training stability and competitive performance against state-of-the-art hybrid action baselines
Spherical Latent Motion Prior for Physics-Based Simulated Humanoid Control
Learning motion priors for physics-based humanoid control is an active research topic. Existing approaches mainly include variational autoencoders (VAE) and adversarial motion priors (AMP). VAE introduces information loss, and random latent sampling may sometimes produce invalid behaviors. AMP suffers from mode collapse and struggles to capture diverse motion skills. We present the Spherical Latent Motion Prior (SLMP), a two-stage method for learning motion priors. In the first stage, we train a high-quality motion tracking controller. In the second stage, we distill the tracking controller into a spherical latent space. A combination of distillation, a discriminator, and a discriminator-guided local semantic consistency constraint shapes a structured latent action space, allowing stable random sampling without information loss. To evaluate SLMP, we collect a two-hour human combat motion capture dataset and show that SLMP preserves fine motion detail without information loss, and random sampling yields semantically valid and stable behaviors. When applied to a two-agent physics-based combat task, SLMP produces human-like and physically plausible combat behaviors only using simple rule-based rewards. Furthermore, SLMP generalizes across different humanoid robot morphologies, demonstrating its transferability beyond a single simulated avatar.
Integrating LTL Constraints into PPO for Safe Reinforcement Learning
This paper proposes Proximal Policy Optimization with Linear Temporal Logic Constraints (PPO-LTL), a framework that integrates safety constraints written in LTL into PPO for safe reinforcement learning. LTL constraints offer rigorous representations of complex safety requirements, such as regulations that broadly exist in robotics, enabling systematic monitoring of safety requirements. Violations against LTL constraints are monitored by limit-deterministic Büchi automata, and then translated by a logic-to-cost mechanism into penalty signals. The signals are further employed for guiding the policy optimization via the Lagrangian scheme. Extensive experiments on the Zones and CARLA environments show that our PPO-LTL can consistently reduce safety violations, while maintaining competitive performance, against the state-of-the-art methods. The code is at https://github.com/EVIEHub/PPO-LTL.
Information-Theoretic Framework for Self-Adapting Model Predictive Controllers
Model Predictive Control (MPC) is a vital technique for autonomous systems, like Unmanned Aerial Vehicles (UAVs), enabling optimized motion planning. However, traditional MPC struggles to adapt to real-time changes such as dynamic obstacles and shifting system dynamics, lacking inherent mechanisms for self-monitoring and adaptive optimization. Here, we introduce Entanglement Learning (EL), an information-theoretic framework that enhances MPC adaptability through an Information Digital Twin (IDT). The IDT monitors and quantifies, in bits, the information flow between MPC inputs, control actions, and UAV behavior. By introducing new information-theoretic metrics we call entanglement metrics, it tracks variations in these dependencies. These metrics measure the mutual information between the optimizer's input, its control actions, and the resulting UAV dynamics, enabling a deeper understanding of their interrelationships. This allows the IDT to detect performance deviations and generate real-time adaptive signals to recalibrate MPC parameters, preserving stability. Unlike traditional MPC, which relies on error-based feedback, this dual-feedback approach leverages information flow for proactive adaptation to evolving conditions. Scalable and leveraging existing infrastructure, this framework improves MPC reliability and robustness across diverse scenarios, extending beyond UAV control to any MPC implementation requiring adaptive performance.
comment: 9 pages, 5 figures
Certifiable Estimation with Factor Graphs
Factor graphs provide a convenient modular modeling language that enables practitioners to design and deploy high-performance robotic state estimation systems by composing simple, reusable building blocks. However, inference in these models is typically performed using local optimization methods that can converge to suboptimal solutions, a serious reliability concern in safety-critical applications. Conversely, certifiable estimators based on convex relaxation can recover verifiably globally optimal solutions in many practical settings, but the computational cost of solving their large-scale relaxations necessitates specialized, structure-exploiting solvers that require substantial expertise to implement, significantly hampering practical deployment. In this paper, we show that these two paradigms, which have thus far been treated as independent in the literature, can be naturally synthesized into a unified framework for certifiable factor graph optimization. The key insight is that factor graph structure is preserved under Shor's relaxation and Burer-Monteiro factorization: applying these transformations to a QCQP with an associated factor graph representation yields a lifted problem admitting a factor graph model with identical connectivity, in which variables and factors are simple one-to-one algebraic transformations of those in the original QCQP. This structural preservation enables the Riemannian Staircase methodology for certifiable estimation to be implemented using the same mature, highly-performant factor graph libraries and workflows already ubiquitously employed throughout robotics and computer vision, making certifiable estimation as straightforward to design and deploy as conventional factor graph inference.
RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into Policy Design
Robotic manipulation policies have made rapid progress in recent years, yet most existing approaches give limited consideration to memory capabilities. Consequently, they struggle to solve tasks that require reasoning over historical observations and maintaining task-relevant information over time, which are common requirements in real-world manipulation scenarios. Although several memory-aware policies have been proposed, systematic evaluation of memory-dependent manipulation remains underexplored, and the relationship between architectural design choices and memory performance is still not well understood. To address this gap, we introduce RMBench, a simulation benchmark comprising 9 manipulation tasks that span multiple levels of memory complexity, enabling systematic evaluation of policy memory capabilities. We further propose Mem-0, a modular manipulation policy with explicit memory components designed to support controlled ablation studies. Through extensive simulation and real-world experiments, we identify memory-related limitations in existing policies and provide empirical insights into how architectural design choices influence memory performance. The website is available at https://rmbench.github.io/.
comment: website: https://rmbench.github.io/
Monocular 3D Object Position Estimation with VLMs for Human-Robot Interaction ICIP
Pre-trained general-purpose Vision-Language Models (VLM) hold the potential to enhance intuitive human-machine interactions due to their rich world knowledge and 2D object detection capabilities. However, VLMs for 3D coordinates detection tasks are rare. In this work, we investigate interactive abilities of VLMs by returning 3D object positions given a monocular RGB image from a wrist-mounted camera, natural language input, and robot states. We collected and curated a heterogeneous dataset of more than 100,000 images and finetuned a VLM using QLoRA with a custom regression head. By implementing conditional routing, our model maintains its ability to process general visual queries while adding specialized 3D position estimation capabilities. Our results demonstrate robust predictive performance with a median MAE of 13 mm on the test set and a five-fold improvement over a simpler baseline without finetuning. In about 25% of the cases, predictions are within a range considered acceptable for the robot to interact with objects.
comment: Accepted at Workshop on Integrating Image Processing with Large-Scale Language/Vision Models for Advanced Visual Understanding (LVLM) at IEEE International Conference on Image Processing (ICIP) 2025
Agent-Based Simulation of Trust Development in Human-Robot Teams: An Empirically-Validated Framework
This paper presents an empirically grounded agent-based model capturing trust dynamics, workload distribution, and collaborative performance in human-robot teams. The model, implemented in NetLogo 6.4.0, simulates teams of 2--10 agents performing tasks of varying complexity. We validate against Hancock et al.'s (2021) meta-analysis, achieving interval validity for 4 of 8 trust antecedent categories and strong ordinal validity (Spearman \r{ho}=0.833ρ= 0.833 \r{ho}=0.833). Sensitivity analysis using OFAT and full factorial designs (n=50n = 50 n=50 replications per condition) reveals robot reliability exhibits the strongest effect on trust (η2=0.35η^2 = 0.35 η2=0.35) and dominates task success (η2=0.93η^2 = 0.93 η2=0.93) and productivity (η2=0.89η^2 = 0.89 η2=0.89), consistent with meta-analytic findings. Trust asymmetry ratios ranged from 0.07 to 0.55 -- below the meta-analytic benchmark of 1.50 -- revealing that per-event asymmetry does not guarantee cumulative asymmetry when trust repair mechanisms remain active. Scenario analysis uncovered trust-performance decoupling: the Trust Recovery scenario achieved the highest productivity (4.29) despite the lowest trust (38.2), while the Unreliable Robot scenario produced the highest trust (73.2) despite the lowest task success (33.4\%), establishing calibration error as a critical diagnostic distinct from trust magnitude. Factorial ANOVA confirmed significant main effects for reliability, transparency, communication, and collaboration (p<.001p < .001 p<.001), explaining 45.4\% of trust variance. The open-source implementation provides an evidence-based tool for identifying overtrust and undertrust conditions prior to deployment.
riMESA: Consensus ADMM for Real-World Collaborative SLAM
Collaborative Simultaneous Localization and Mapping (C-SLAM) is a fundamental capability for multi-robot teams as it enables downstream tasks like planning and navigation. However, existing C-SLAM back-end algorithms that are required to solve this problem struggle to address the practical realities of real-world deployments (i.e. communication limitations, outlier measurements, and online operation). In this paper we propose Robust Incremental Manifold Edge-based Separable ADMM (riMESA) -- a robust, incremental, and distributed C-SLAM back-end that is resilient to outliers, reliable in the face of limited communication, and can compute accurate state estimates for a multi-robot team in real-time. Through the development of riMESA, we, more broadly, make an argument for the use of Consensus Alternating Direction Method of Multipliers as a theoretical foundation for distributed optimization tasks in robotics like C-SLAM due to its flexibility, accuracy, and fast convergence. We conclude this work with an in-depth evaluation of riMESA on a variety of C-SLAM problem scenarios and communication network conditions using both synthetic and real-world C-SLAM data. These experiments demonstrate that riMESA is able to generalize across conditions, produce accurate state estimates, operate in real-time, and outperform the accuracy of prior works by a factor >7x on real-world datasets.
Path Integral Particle Filtering for Hybrid Systems via Saltation Matrices
We present an optimal-control-based particle filtering method for state estimation in hybrid systems that undergo intermittent contact with their environments. We follow the path integral filtering framework that exploits the duality between the smoothing problem and optimal control. We leverage saltation matrices to map out the uncertainty propagation during contact events for hybrid systems. The resulting path integral optimal control problem allows for a state estimation algorithm robust to outlier effects, flexible to non-Gaussian noise distributions, that also handles the challenging contact dynamics in hybrid systems. This work offers a computationally efficient and reliable estimation algorithm for hybrid systems with stochastic dynamics. We also present extensive experimental results demonstrating that our approach consistently outperforms strong baselines across multiple settings.
RAG-RUSS: A Retrieval-Augmented Robotic Ultrasound for Autonomous Carotid Examination ICRA
Robotic ultrasound (US) has recently attracted increasing attention as a means to overcome the limitations of conventional US examinations, such as the strong operator dependence. However, the decision-making process of existing methods is often either rule-based or relies on end-to-end learning models that operate as black boxes. This has been seen as a main limit for clinical acceptance and raises safety concerns for widespread adoption in routine practice. To tackle this challenge, we introduce the RAG-RUSS, an interpretable framework capable of performing a full carotid examination in accordance with the clinical workflow while explicitly explaining both the current stage and the next planned action. Furthermore, given the scarcity of medical data, we incorporate retrieval-augmented generation to enhance generalization and reduce dependence on large-scale training datasets. The method was trained on data acquired from 28 volunteers, while an additional four volumetric scans recorded from previously unseen volunteers were reserved for testing. The results demonstrate that the method can explain the current scanning stage and autonomously plan probe motions to complete the carotid examination, encompassing both transverse and longitudinal planes.
comment: Accepted by ICRA
D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping ICLR 2026
Simulation provides a cost-effective and flexible platform for data generation and policy learning to develop robotic systems. However, bridging the gap between simulation and real-world dynamics remains a significant challenge, especially in physical parameter identification. In this work, we introduce a real-to-sim-to-real engine that leverages the Gaussian Splat representations to build a differentiable engine, enabling object mass identification from real-world visual observations and robot control signals, while enabling grasping policy learning simultaneously. Through optimizing the mass of the manipulated object, our method automatically builds high-fidelity and physically plausible digital twins. Additionally, we propose a novel approach to train force-aware grasping policies from limited data by transferring feasible human demonstrations into simulated robot demonstrations. Through comprehensive experiments, we demonstrate that our engine achieves accurate and robust performance in mass identification across various object geometries and mass values. Those optimized mass values facilitate force-aware policy learning, achieving superior and high performance in object grasping, effectively reducing the sim-to-real gap.
comment: ICLR 2026 Poster
A Deployable Bio-inspired Compliant Leg Design for Enhanced Leaping in Quadruped Robots
Quadruped robots are becoming increasingly essential for various applications, including industrial inspection and catastrophe search and rescue. These scenarios require robots to possess enhanced agility and obstacle-navigation skills. Nonetheless, the performance of current platforms is often constrained by insufficient peak motor power, limiting their ability to perform explosive jumps. To address this challenge, this paper proposes a bio-inspired method that emulates the energy-storage mechanism found in froghopper legs. We designed a Deployable Compliant Leg (DCL) utilizing a specialized 3D-printed elastic material, Polyether block amide (PEBA), featuring a lightweight internal lattice structure. This structure functions analogously to biological tendons, storing elastic energy during the robot's squatting phase and rapidly releasing it to augment motor output during the leap. The proposed mechanical design significantly enhances the robot's vertical jumping capability. Through finite element analysis (FEA) and experimental validation, we demonstrate a relative performance improvement of 17.1% in vertical jumping height.
Pro-HOI: Perceptive Root-guided Humanoid-Object Interaction
Executing reliable Humanoid-Object Interaction (HOI) tasks for humanoid robots is hindered by the lack of generalized control interfaces and robust closed-loop perception mechanisms. In this work, we introduce Perceptive Root-guided Humanoid-Object Interaction, Pro-HOI, a generalizable framework for robust humanoid loco-manipulation. First, we collect box-carrying motions that are suitable for real-world deployment and optimize penetration artifacts through a Signed Distance Field loss. Second, we propose a novel training framework that conditions the policy on a desired root-trajectory while utilizing reference motion exclusively as a reward. This design not only eliminates the need for intricate reward tuning but also establishes root trajectory as a universal interface for high-level planners, enabling simultaneous navigation and loco-manipulation. Furthermore, to ensure operational reliability, we incorporate a persistent object estimation module. By fusing real-time detection with Digital Twin, this module allows the robot to autonomously detect slippage and trigger re-grasping maneuvers. Empirical validation on a Unitree G1 robot demonstrates that Pro-HOI significantly outperforms baselines in generalization and robustness, achieving reliable long-horizon execution in complex real-world scenarios.
Fast Confidence-Aware Human Prediction via Hardware-accelerated Bayesian Inference for Safe Robot Navigation
As robots increasingly integrate into everyday environments, ensuring their safe navigation around humans becomes imperative. Efficient and safe motion planning requires robots to account for human behavior, particularly in constrained spaces such as grocery stores or care homes, where interactions with multiple individuals are common. Prior research has employed Bayesian frameworks to model human rationality based on navigational intent, enabling the prediction of probabilistic trajectories for planning purposes. In this work, we present a simple yet novel approach for confidence-aware prediction that treats future predictions as particles. This framework is highly parallelized and accelerated on an graphics processing unit (GPU). As a result, this enables longer-term predictions at a frequency of 125 Hz and can be easily extended for multi-human predictions. Compared to existing methods, our implementation supports finer prediction time steps, yielding more granular trajectory forecasts. This enhanced resolution allows motion planners to respond effectively to subtle changes in human behavior. We validate our approach through real-world experiments, demonstrating a robot safely navigating among multiple humans with diverse navigational goals. Our results highlight the methods potential for robust and efficient human-robot coexistence in dynamic environments.
From Dialogue to Execution: Mixture-of-Agents Assisted Interactive Planning for Behavior Tree-Based Long-Horizon Robot Execution
Interactive task planning with large language models (LLMs) enables robots to generate high-level action plans from natural language instructions. However, in long-horizon tasks, such approaches often require many questions, increasing user burden. Moreover, flat plan representations become difficult to manage as task complexity grows. We propose a framework that integrates Mixture-of-Agents (MoA)-based proxy answering into interactive planning and generates Behavior Trees (BTs) for structured long-term execution. The MoA consists of multiple LLM-based expert agents that answer general or domain-specific questions when possible, reducing unnecessary human intervention. The resulting BT hierarchically represents task logic and enables retry mechanisms and dynamic switching among multiple robot policies. Experiments on cocktail-making tasks show that the proposed method reduces human response requirements by approximately 27% while maintaining structural and semantic similarity to fully human-answered BTs. Real-robot experiments on a smoothie-making task further demonstrate successful long-horizon execution with adaptive policy switching and recovery from action failures. These results indicate that MoA-assisted interactive planning improves dialogue efficiency while preserving execution quality in real-world robotic tasks.
Compact Task-Aligned Imitation Learning for Laboratory Automation
Robotic laboratory automation has traditionally relied on carefully engineered motion pipelines and task-specific hardware interfaces, resulting in high design cost and limited flexibility. While recent imitation learning techniques can generate general robot behaviors, their large model sizes often require high-performance computational resources, limiting applicability in practical laboratory environments. In this study, we propose a compact imitation learning framework for laboratory automation using small foundation models. The proposed method, TVF-DiT, aligns a self-supervised vision foundation model with a vision-language model through a compact adapter, and integrates them with a Diffusion Transformer-based action expert. The entire model consists of fewer than 500M parameters, enabling inference on low-VRAM GPUs. Experiments on three real-world laboratory tasks - test tube cleaning, test tube arrangement, and powder transfer - demonstrate an average success rate of 86.6%, significantly outperforming alternative lightweight baselines. Furthermore, detailed task prompts improve vision-language alignment and task performance. These results indicate that small foundation models, when properly aligned and integrated with diffusion-based policy learning, can effectively support practical laboratory automation with limited computational resources.
SMR-Net:Robot Snap Detection Based on Multi-Scale Features and Self-Attention Network
In robot automated assembly, snap assembly precision and efficiency directly determine overall production quality. As a core prerequisite, snap detection and localization critically affect subsequent assembly success. Traditional visual methods suffer from poor robustness and large localization errors when handling complex scenarios (e.g., transparent or low-contrast snaps), failing to meet high-precision assembly demands. To address this, this paper designs a dedicated sensor and proposes SMR-Net, an self-attention-based multi-scale object detection algorithm, to synergistically enhance detection and localization performance. SMR-Net adopts an attention-enhanced multi-scale feature fusion architecture: raw sensor data is encoded via an attention-embedded feature extractor to strengthen key snap features and suppress noise; three multi-scale feature maps are processed in parallel with standard and dilated convolution for dimension unification while preserving resolution; an adaptive reweighting network dynamically assigns weights to fused features, generating fine representations integrating details and global semantics. Experimental results on Type A and Type B snap datasets show SMR-Net outperforms traditional Faster R-CNN significantly: Intersection over Union (IoU) improves by 6.52% and 5.8%, and mean Average Precision (mAP) increases by 2.8% and 1.5% respectively. This fully demonstrates the method's superiority in complex snap detection and localization tasks.
comment: snap assembly, snap detection and localization, object detection, multi-scale feature fusion, self-attention
An Open-Source Modular Benchmark for Diffusion-Based Motion Planning in Closed-Loop Autonomous Driving
Diffusion-based motion planners have achieved state-of-the-art results on benchmarks such as nuPlan, yet their evaluation within closed-loop production autonomous driving stacks remains largely unexplored. Existing evaluations abstract away ROS 2 communication latency and real-time scheduling constraints, while monolithic ONNX deployment freezes all solver parameters at export time. We present an open-source modular benchmark that addresses both gaps: using ONNX GraphSurgeon, we decompose a monolithic 18,398 node diffusion planner into three independently executable modules and reimplement the DPM-Solver++ denoising loop in native C++. Integrated as a ROS 2 node within Autoware, the open-source AD stack deployed on real vehicles worldwide, the system enables runtime-configurable solver parameters without model recompilation and per-step observability of the denoising process, breaking the black box of monolithic deployment. Unlike evaluations in standalone simulators such as CARLA, our benchmark operates within a production-grade stack and is validated through AWSIM closed-loop simulation. Through systematic comparison of DPM-Solver++ (first- and second-order) and DDIM across six step-count configurations (N in {3, 5, 7, 10, 15, 20}), we show that encoder caching yields a 3.2x latency reduction, and that second-order solving reduces FDE by 41% at N=3 compared to first-order. The complete codebase will be released as open-source, providing a direct path from simulation benchmarks to real-vehicle deployment.
comment: 8 pages, 5 figures
MiniUGV$_2$: A Compact UAV-Deployable Tracked Ground Vehicle with Manipulation Capabilities
Exploring and inspecting \emph{Hidden Spaces}, defined as environments whose entrances are accessible only to aerial robots but remain unexplored due to geometric constraints, limited flight time, and communication loss, remains a major challenge. We present miniUGV$_2$, a compact UAV-deployable tracked ground vehicle that extends UAV capabilities into confined environments. The system introduces dual articulated arms, integrated LiDAR and depth sensing, and modular electronics for enhanced autonomy. A novel tether module with an electro-permanent magnetic head enables safe deployment, retrieval, and optional detachment, thereby overcoming prior entanglement issues. Experiments demonstrate robust terrain navigation, self-righting, and manipulation of objects up to 3.5 kg, validating miniUGV$_2$ as a versatile platform for hybrid aerial-ground robotics.
HierKick: Hierarchical Reinforcement Learning for Vision-Guided Soccer Robot Control
Controlling soccer robots involves multi-time-scale decision-making, which requires balancing long-term tactical planning and short-term motion execution. Traditional end-to-end reinforcement learning (RL) methods face challenges in complex dynamic environments. This paper proposes HierKick, a vision-guided soccer robot control framework based on dual-frequency hierarchical RL. The framework adopts a hierarchical control architecture featuring a 5 Hz high-level policy that integrates YOLOv8 for real-time detection and selects tasks via a coach model, and a pre-trained 50 Hz low-level controller for precise joint control. Through this architecture, the framework achieves the four steps of approaching, aligning, dribbling, and kicking. Experimental results show that the success rates of this framework are 95.2\% in IsaacGym, 89.8\% in Mujoco, and 80\% in the real world. HierKick provides an effective hierarchical paradigm for robot control in complex environments, extendable to multi-time-scale tasks, with its modular design and skill reuse offering a new path for intelligent robot control.
comment: 15 pages, 6 figures
DRIFT: Diffusion-based Rule-Inferred For Trajectories
Trajectory generation for mobile robots in unstructured environments faces a critical dilemma: balancing kinematic smoothness for safe execution with terminal precision for fine-grained tasks. Existing generative planners often struggle with this trade-off, yielding either smooth but imprecise paths or geometrically accurate but erratic motions. To address the aforementioned shortcomings, this article proposes DRIFT (Diffusion-based Rule-Inferred for Trajectories), a conditional diffusion framework designed to generate high-fidelity reference trajectories by integrating two complementary inductive biases. First, a Relational Inductive Bias, realized via a GNN-based Structured Scene Perception (SSP) module, encodes global topological constraints to ensure holistic smoothness. Second, a Temporal Attention Bias, implemented through a novel Graph-Conditioned Time-Aware GRU (GTGRU), dynamically attends to sparse obstacles and targets for precise local maneuvering. In the end, quantitative results demonstrate that DRIFT reconciles these conflicting objectives, achieving centimeter-level imitation fidelity (0.041m FDE) and competitive smoothness (27.19 Jerk). This balance yields highly executable reference plans for downstream control.
DAM-VLA: A Dynamic Action Model-Based Vision-Language-Action Framework for Robot Manipulation ICRA2026
In dynamic environments such as warehouses, hospitals, and homes, robots must seamlessly transition between gross motion and precise manipulations to complete complex tasks. However, current Vision-Language-Action (VLA) frameworks, largely adapted from pre-trained Vision-Language Models (VLMs), often struggle to reconcile general task adaptability with the specialized precision required for intricate manipulation. To address this challenge, we propose DAM-VLA, a dynamic action model-based VLA framework. DAM-VLA integrates VLM reasoning with diffusion-based action models specialized for arm and gripper control. Specifically, it introduces (i) an action routing mechanism, using task-specific visual and linguistic cues to select appropriate action models (e.g., arm movement or gripper manipulation), (ii) a dynamic action model that fuses high-level VLM cognition with low-level visual features to predict actions, and (iii) a dual-scale action weighting mechanism that enables dynamic coordination between the arm-movement and gripper-manipulation models. Across extensive evaluations, DAM-VLA achieves superior success rates compared to state-of-the-art VLA methods in simulated (SIMPLER, FurnitureBench) and real-world settings, showing robust generalization from standard pick-and-place to demanding long-horizon and contact-rich tasks.
comment: Accepted to ICRA2026
DriveCode: Domain Specific Numerical Encoding for LLM-Based Autonomous Driving
Large language models (LLMs) have shown great promise for autonomous driving. However, discretizing numbers into tokens limits precise numerical reasoning, fails to reflect the positional significance of digits in the training objective, and makes it difficult to achieve both decoding efficiency and numerical precision. These limitations affect both the processing of sensor measurements and the generation of precise control commands, creating a fundamental barrier for deploying LLM-based autonomous driving systems. In this paper, we introduce DriveCode, a novel numerical encoding method that represents numbers as dedicated embeddings rather than discrete text tokens. DriveCode employs a number projector to map numbers into the language model's hidden space, enabling seamless integration with visual and textual features in a unified multimodal sequence. Evaluated on OmniDrive, DriveGPT4, and DriveGPT4-V2 datasets, DriveCode demonstrates superior performance in trajectory prediction and control signal generation, confirming its effectiveness for LLM-based autonomous driving systems.
comment: The project page is available at https://shiftwilliam.github.io/DriveCode
Minimalist Compliance Control
Compliance control is essential for safe physical interaction, yet its adoption is limited by hardware requirements such as force torque sensors. While recent reinforcement learning approaches aim to bypass these constraints, they often suffer from sim-to-real gaps, lack safety guarantees, and add system complexity. We propose Minimalist Compliance Control, which enables compliant behavior using only motor current or voltage signals readily available in modern servos and quasi-direct-drive motors, without force sensors, current control, or learning. External wrenches are estimated from actuator signals and Jacobians and incorporated into a task-space admittance controller, preserving sufficient force measurement accuracy for stable and responsive compliance control. Our method is embodiment-agnostic and plug-and-play with diverse high-level planners. We validate our approach on a robot arm, a dexterous hand, and two humanoid robots across multiple contact-rich tasks, using vision-language models, imitation learning, and model-based planning. The results demonstrate robust, safe, and compliant interaction across embodiments and planning paradigms.
comment: Project website: https://minimalist-compliance-control.github.io/
A Novel Reconfigurable Dexterous Hand Based on Triple-Symmetric Bricard Parallel Mechanism
This paper introduces a novel design for a robotic hand based on parallel mechanisms. The proposed hand uses a triple-symmetric Bricard linkage as its reconfigurable palm, enhancing adaptability to objects of varying shapes and sizes. Through topological and dimensional synthesis, the mechanism achieves a well-balanced degree of freedom and link configuration suitable for reconfigurable palm motion, balancing dexterity, stability, and load capacity. Furthermore, kinematic analysis is performed using screw theory and closed-loop constraints, and performance is evaluated based on workspace, stiffness, and motion/force transmission efficiency. Finally, a prototype is developed and tested through a series of grasping experiments, demonstrating the ability to perform stable and efficient manipulation across a wide range of objects. The results validate the effectiveness of the design in improving grasping versatility and operational precision, offering a promising solution for advanced robotic manipulation tasks.
comment: 8 pages, 14 figures, 2026 IEEE International Conference on Robotics & Automation
Hippo: High-performance Interior-Point and Projection-based Solver for Generic Constrained Trajectory Optimization
Trajectory optimization is the core of modern model-based robotic control and motion planning. Existing trajectory optimizers, based on sequential quadratic programming (SQP) or differential dynamic programming (DDP), are often limited by their slow computation efficiency, low modeling flexibility, and poor convergence for complex tasks requiring hard constraints. In this paper, we introduce Hippo, a solver that can handle inequality constraints using the interior-point method (IPM) with an adaptive barrier update strategy and hard equality constraints via projection or IPM. Through extensive numerical benchmarks, we show that Hippo is a robust and efficient alternative to existing state-of-the-art solvers for difficult robotic trajectory optimization problems requiring high-quality solutions, such as locomotion and manipulation.
CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives ICLR 2026
We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video. Prior work on joint human-scene reconstruction relies on data-driven priors and joint optimization with no physics in the loop, or recovers noisy geometry with artifacts that cause motion tracking policies with scene interactions to fail. In contrast, our key insight is to recover convex, clean, and simulation-ready geometry by fitting planar primitives to a point cloud reconstruction of the scene, via a simple clustering pipeline over depth, normals, and flow. To reconstruct scene geometry that might be occluded during interactions, we make use of human-scene contact modeling (e.g., we use human posture to reconstruct the occluded seat of a chair). Finally, we ensure that human and scene reconstructions are physically-plausible by using them to drive a humanoid controller via reinforcement learning. Our approach reduces motion tracking failure rates from 55.2\% to 6.9\% on human-centric video benchmarks (EMDB, PROX), while delivering a 43\% faster RL simulation throughput. We further validate it on in-the-wild videos including casually-captured videos, Internet videos, and even Sora-generated videos. This demonstrates CRISP's ability to generate physically-valid human motion and interaction environments at scale, greatly advancing real-to-sim applications for robotics and AR/VR.
comment: Published at ICLR 2026. Project page: https://crisp-real2sim.github.io/CRISP-Real2Sim/
Neuro-Symbolic Skill Discovery for Conditional Multi-Level Planning
This paper proposes a novel learning architecture for acquiring generalizable high-level symbolic skills from a few unlabeled low-level skill trajectory demonstrations. The architecture involves neural networks for symbol discovery and low-level controller acquisition and a multi-level planning pipeline that utilizes the discovered symbols and the learned low-level controllers. The discovered action symbols are automatically interpreted using visual language models that are also responsible for generating high-level plans. While extracting high-level symbols, our model preserves the low-level information so that low-level action planning can be carried out by using gradient-based planning. To assess the efficacy of our method, we tested the high and low-level planning performance of our architecture by using simulated and real-world experiments across various tasks. The experiments have shown that our method is able to manipulate objects in unseen locations and plan and execute long-horizon tasks by using novel action sequences, even in highly cluttered environments when cued by only a few demonstrations that cover small regions of the environment.
comment: 18 pages, 4 figures
Safe and Optimal Variable Impedance Control via Certified Reinforcement Learning ICRA 2026
Reinforcement learning (RL) offers a powerful approach for robots to learn complex, collaborative skills by combining Dynamic Movement Primitives (DMPs) for motion and Variable Impedance Control (VIC) for compliant interaction. However, this model-free paradigm often risks instability and unsafe exploration due to the time-varying nature of impedance gains. This work introduces Certified Gaussian Manifold Sampling (C-GMS), a novel trajectory-centric RL framework that learns combined DMP and VIC policies while guaranteeing Lyapunov stability and actuator feasibility by construction. Our approach reframes policy exploration as sampling from a mathematically defined manifold of stable gain schedules. This ensures every policy rollout is guaranteed to be stable and physically realizable, thereby eliminating the need for reward penalties or post-hoc validation. Furthermore, we provide a theoretical guarantee that our approach ensures bounded tracking error even in the presence of bounded model errors and deployment-time uncertainties. We demonstrate the effectiveness of C-GMS in simulation and verify its efficacy on a real robot, paving the way for reliable autonomous interaction in complex environments.
comment: Accepted at ICRA 2026
BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
Deploying powerful Vision-Language-Action (VLA) models on edge devices is limited by their massive size. In this paper, we take a deployment-oriented view of VLA training: we target efficiency through model design and optimization, rather than relying solely on post-hoc compression. Thus, we propose BitVLA, a fully native 1-bit VLA model for robotic manipulation, where every parameters is ternary, i.e., {-1,0,1}. BitVLA is built on the publicly available 1-bit LLM BitNet b1.58 2B4T, and is trained as a vision-language-action policy that inherits the compactness of 1-bit pretraining while retaining strong task performance. To further reduce the memory footprint of the vision backbone, we introduce Quantize-then-Distill, a post-training quantization-aware training strategy that compresses a full-precision vision encoder to 1.58-bit weights, while a full-precision teacher guides representation alignment during training. Across simulation benchmarks and real-world tasks, BitVLA matches the performance of the full-precision OpenVLA-OFT baseline, while reducing model memory by 11.0x and end-to-end latency by 4.4x. These results suggest a practical path toward training-time efficiency-accuracy co-design for embodied policies, enabling competitive manipulation capability on memory-constrained edge robotic platforms. We release the code in https://github.com/ustcwhy/BitVLA, model weights in https://huggingface.co/lxsy/bitvla-bf16.
comment: Work in progress
VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning via Online Monte Carlo Tree Search ICRA 2026
Vision-Language-Action models (VLAs) achieve strong performance in general robotic manipulation tasks by scaling imitation learning. However, existing VLAs are limited to predicting short-sighted next-action, which struggle with long-horizon trajectory tasks due to incremental deviations. To address this problem, we propose a plug-in framework named \method that effectively empowers off-the-shelf VLAs with the capability of foreseeing future states via test-time scaling. Specifically, \method samples and rolls out possible action trajectories where involved actions are rationales to generate future states via a world model, which enables \method to foresee and reason potential outcomes and search for the optimal actions. We further leverage Monte Carlo Tree Search (MCTS) to improve search efficiency in large action spaces, where step-wise VLA predictions seed the root. Meanwhile, we introduce a confidence sampling mechanism based on Kernel Density Estimation (KDE), to enable efficient exploration in MCTS without redundant VLA queries. We evaluate intermediate states in MCTS via an offline value estimation strategy, to score predicted futures and correct deviations with long-term feedback. We conducted extensive experiments in both simulators and the real world, demonstrating that our proposed VLA-Reasoner achieves significant improvements over the state-of-the-art VLAs. Our method highlights a potential pathway toward scalable test-time computation of robotic manipulation. The project website is available at: https://vla-reasoner.github.io/.
comment: 8 pages, 6 figures, Accepted by ICRA 2026
Openfly: A comprehensive platform for aerial vision-language navigation ICLR 2026
Vision-Language Navigation (VLN) aims to guide agents by leveraging language instructions and visual cues, playing a pivotal role in embodied AI. Indoor VLN has been extensively studied, whereas outdoor aerial VLN remains underexplored. The potential reason is that outdoor aerial view encompasses vast areas, making data collection more challenging, which results in a lack of benchmarks. To address this problem, we propose OpenFly, a platform comprising various rendering engines, a versatile toolchain, and a large-scale benchmark for aerial VLN. Firstly, we integrate diverse rendering engines and advanced techniques for environment simulation, including Unreal Engine, GTA V, Google Earth, and 3D Gaussian Splatting (3D GS). Particularly, 3D GS supports real-to-sim rendering, further enhancing the realism of our environments. Secondly, we develop a highly automated toolchain for aerial VLN data collection, streamlining point cloud acquisition, scene semantic segmentation, flight trajectory creation, and instruction generation. Thirdly, based on the toolchain, we construct a large-scale aerial VLN dataset with 100k trajectories, covering diverse heights and lengths across 18 scenes. Moreover, we propose OpenFly-Agent, a keyframe-aware VLN model emphasizing key observations during flight. For benchmarking, extensive experiments and analyses are conducted, evaluating several recent VLN methods and showcasing the superiority of our OpenFly platform and agent. The toolchain, dataset, and codes will be open-sourced.
comment: accepted by ICLR 2026
OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning
General-purpose robots capable of performing diverse tasks require synergistic reasoning and acting capabilities. However, recent dual-system approaches, which separate high-level reasoning from low-level acting, often suffer from challenges such as limited mutual understanding of capabilities between systems and latency issues. This paper introduces OneTwoVLA, a single unified vision-language-action model that can perform both acting (System One) and reasoning (System Two). Crucially, OneTwoVLA adaptively switches between two modes: explicitly reasoning at critical moments during task execution, and generating actions based on the most recent reasoning at other times. To further unlock OneTwoVLA's reasoning and generalization capabilities, we design a scalable pipeline for synthesizing embodied reasoning-centric vision-language data, used for co-training with robot data. We validate OneTwoVLA's effectiveness through extensive experiments, highlighting its superior performance across four key capabilities: long-horizon task planning, error detection and recovery, natural human-robot interaction, and generalizable visual grounding, enabling the model to perform long-horizon, highly dexterous manipulation tasks such as making hotpot or mixing cocktails.
DA-MMP: Learning Coordinated and Accurate Throwing with Dynamics-Aware Motion Manifold Primitives ICRA 2026
Dynamic manipulation is a key capability for advancing robot performance, enabling skills such as tossing. While recent learning-based approaches have pushed the field forward, most methods still rely on manually designed action parameterizations, limiting their ability to produce the highly coordinated motions required in complex tasks. Motion planning can generate feasible trajectories, but the dynamics gap-stemming from control inaccuracies, contact uncertainties, and aerodynamic effects-often causes large deviations between planned and executed trajectories. In this work, we propose Dynamics-Aware Motion Manifold Primitives (DA-MMP), a motion generation framework for goal-conditioned dynamic manipulation, and instantiate it on a challenging real-world ring-tossing task. Our approach extends motion manifold primitives to variable-length trajectories through a compact parameterization and learns a high-quality manifold from a large-scale dataset of planned motions. Building on this manifold, a conditional flow matching model is trained in the latent space with a small set of real-world trials, enabling the generation of throwing trajectories that account for execution dynamics. Experiments show that our method can generate coordinated and smooth motion trajectories for the ring-tossing task. In real-world evaluations, it achieves high success rates and even surpasses the performance of trained human experts. Moreover, it generalizes to novel targets beyond the training range, indicating that it successfully learns the underlying trajectory-dynamics mapping.
comment: Accepted to ICRA 2026. Project page: https://cc299792458.github.io/da-mmp/
TinyIO: Lightweight Reparameterized Inertial Odometry
Inertial odometry (IO) is a widely used approach for localization on mobile devices; however, obtaining a lightweight IO model that also achieves high accuracy remains challenging. To address this issue, we propose TinyIO, a lightweight IO method. During training, we adopt a multi-branch architecture to extract diverse motion features more effectively. At inference time, the trained multi-branch model is converted into an equivalent single-path architecture to reduce computational complexity. We further propose a Dual-Path Adaptive Attention mechanism (DPAA), which enhances TinyIO's perception of contextual motion along both channel and temporal dimensions with negligible additional parameters. Extensive experiments on public datasets demonstrate that our method attains a favorable trade-off between accuracy and model size. On the RoNIN dataset, TinyIO reduces the ATE by 23.53% compared with R-ResNet and decreases the parameter count by 3.68%.
ISS Policy : Scalable Diffusion Policy with Implicit Scene Supervision
Vision-based imitation learning has enabled impressive robotic manipulation skills, but its reliance on object appearance while ignoring the underlying 3D scene structure leads to low training efficiency and poor generalization. To address these challenges, we introduce \emph{Implicit Scene Supervision (ISS) Policy}, a 3D visuomotor DiT-based diffusion policy that predicts sequences of continuous actions from point cloud observations. We extend DiT with a novel implicit scene supervision module that encourages the model to produce outputs consistent with the scene's geometric evolution, thereby improving the performance and robustness of the policy. Notably, ISS Policy achieves state-of-the-art performance on both single-arm manipulation tasks (MetaWorld) and dexterous hand manipulation (Adroit). In real-world experiments, it also demonstrates strong generalization and robustness. Additional ablation studies show that our method scales effectively with both data and parameters. Code and videos will be released.
Large Scale Robotic Material Handling: Learning, Planning, and Control
Bulk material handling involves the efficient and precise moving of large quantities of materials, a core operation in many industries, including cargo ship unloading, waste sorting, construction, and demolition. These repetitive, labor-intensive, and safety-critical operations are typically performed using large hydraulic material handlers equipped with underactuated grippers. In this work, we present a comprehensive framework for the autonomous execution of large-scale material handling tasks. The system integrates specialized modules for environment perception, pile attack point selection, path planning, and motion control. The main contributions of this work are two reinforcement learning-based modules: an attack point planner that selects optimal grasping locations on the material pile to maximize removal efficiency and minimize the number of scoops, and a robust trajectory following controller that addresses the precision and safety challenges associated with underactuated grippers in movement, while utilizing their free-swinging nature to release material through dynamic throwing. We validate our framework through real-world experiments on a 40 t material handler in a representative worksite, focusing on two key tasks: high-throughput bulk pile management and high-precision truck loading. Comparative evaluations against human operators demonstrate the system's effectiveness in terms of precision, repeatability, and operational safety. To the best of our knowledge, this is the first complete automation of material handling tasks on a full scale.
comment: Final version published in IEEE Transactions on Field Robotics. It includes additional experiments and comparisons with classical methods
DA-VPC: Disturbance-Aware Visual Predictive Control Scheme of Docking Maneuvers for Autonomous Trolley Collection
Service robots have demonstrated significant potential for autonomous trolley collection and redistribution in public spaces like airports or warehouses to improve efficiency and reduce cost. Usually, a fully autonomous system for the collection and transportation of multiple trolleys is based on a Leader-Follower formation of mobile manipulators, where reliable docking maneuvers of the mobile base are essential to align trolleys into organized queues. However, developing a vision-based robotic docking system faces significant challenges: high precision requirements, environmental disturbances, and inherent robot constraints. To address these challenges, we propose a Disturbance-Aware Visual Predictive Control (DA-VPC) scheme that incorporates active infrared markers for robust feature extraction across diverse lighting conditions. This framework explicitly models nonholonomic kinematics and visibility constraints for image-based visual servoing (IBVS), solving the predictive control problem through optimization. It is augmented with an extended state observer (ESO) designed to counteract disturbances during trolley pushing, ensuring precise and stable docking. Experimental results across diverse environments demonstrate the robustness of this system, with quantitative evaluations confirming high docking accuracy.
Ctrl-World: A Controllable Generative World Model for Robot Manipulation
Generalist robot policies can now perform a wide range of manipulation skills, but evaluating and improving their ability with unfamiliar objects and instructions remains a significant challenge. Rigorous evaluation requires a large number of real-world rollouts, while systematic improvement demands additional corrective data with expert labels. Both of these processes are slow, costly, and difficult to scale. World models offer a promising, scalable alternative by enabling policies to rollout within imagination space. However, a key challenge is building a controllable world model that can handle multi-step interactions with generalist robot policies. This requires a world model compatible with modern generalist policies by supporting multi-view prediction, fine-grained action control, and consistent long-horizon interactions, which is not achieved by previous works. In this paper, we make a step forward by introducing a controllable multi-view world model that can be used to evaluate and improve the instruction-following ability of generalist robot policies. Our model maintains long-horizon consistency with a pose-conditioned memory retrieval mechanism and achieves precise action control through frame-level action conditioning. Trained on the DROID dataset (95k trajectories, 564 scenes), our model generates spatially and temporally consistent trajectories under novel scenarios and new camera placements for over 20 seconds. We show that our method can accurately rank policy performance without real-world robot rollouts. Moreover, by synthesizing successful trajectories in imagination and using them for supervised fine-tuning, our approach can improve policy success by 44.7\%.
comment: 17 pages
Tru-POMDP: Task Planning Under Uncertainty via Tree of Hypotheses and Open-Ended POMDPs
Task planning under uncertainty is essential for home-service robots operating in the real world. Tasks involve ambiguous human instructions, hidden or unknown object locations, and open-vocabulary object types, leading to significant open-ended uncertainty and a boundlessly large planning space. To address these challenges, we propose Tru-POMDP, a planner that combines structured belief generation using Large Language Models (LLMs) with principled POMDP planning. Tru-POMDP introduces a hierarchical Tree of Hypotheses (TOH), which systematically queries an LLM to construct high-quality particle beliefs over possible world states and human goals. We further formulate an open-ended POMDP model that enables rigorous Bayesian belief tracking and efficient belief-space planning over these LLM-generated hypotheses. Experiments on complex object rearrangement tasks across diverse kitchen environments show that Tru-POMDP significantly outperforms state-of-the-art LLM-based and LLM-tree-search hybrid planners, achieving higher success rates with significantly better plans, stronger robustness to ambiguity and occlusion, and greater planning efficiency.
BiNoMaP: Learning Category-Level Bimanual Non-Prehensile Manipulation Primitives
Non-prehensile manipulation, encompassing ungraspable actions such as pushing, poking, pivoting, and wrapping, remains underexplored due to its contact-rich and analytically intractable nature. We revisit this problem from two perspectives. First, instead of relying on single-arm setups or favorable environmental supports (e.g., walls or edges), we advocate a generalizable dual-arm configuration and establish a suite of Bimanual Non-prehensile Manipulation Primitives (BiNoMaP). Second, departing from prevailing RL-based approaches, we propose a three-stage, RL-free framework for learning structured non-prehensile skills. We begin by extracting bimanual hand motion trajectories from video demonstrations. Since these coarse trajectories suffer from perceptual noise and morphological discrepancies, we introduce a geometry-aware post-optimization algorithm to refine them into executable manipulation primitives consistent with predefined motion patterns. To enable category-level generalization, the learned primitives are further parameterized by object-relevant geometric attributes, primarily size, allowing adaptation to unseen instances with significant shape variations. Importantly, BiNoMaP supports cross-embodiment transfer: the same primitives can be deployed on two real-world dual-arm platforms with distinct kinematic configurations, without redesigning skill structures. Extensive real-robot experiments across diverse objects and spatial configurations demonstrate the effectiveness, efficiency, and strong generalization capability of our approach.
comment: Under review. The project link is https://hnuzhy.github.io/projects/BiNoMaP
Large Language Model-Assisted UAV Operations and Communications: A Multifaceted Survey and Tutorial
Uncrewed Aerial Vehicles (UAVs) are widely deployed across diverse applications due to their mobility and agility. Recent advances in Large Language Models (LLMs) offer a transformative opportunity to enhance UAV intelligence beyond conventional optimization-based and learning-based approaches. By integrating LLMs into UAV systems, advanced environmental understanding, swarm coordination, mobility optimization, and high-level task reasoning can be achieved, thereby allowing more adaptive and context-aware aerial operations. This survey systematically explores the intersection of LLMs and UAV technologies and proposes a unified framework that consolidates existing architectures, methodologies, and applications for UAVs. We first present a structured taxonomy of LLM adaptation techniques for UAVs, including pretraining, fine-tuning, Retrieval-Augmented Generation (RAG), and prompt engineering, along with key reasoning capabilities such as Chain-of-Thought (CoT) and In-Context Learning (ICL). We then examine LLM-assisted UAV communications and operations, covering navigation, mission planning, swarm control, safety, autonomy, and network management. After that, the survey further discusses Multimodal LLMs (MLLMs) for human-swarm interaction, perception-driven navigation, and collaborative control. Finally, we address ethical considerations, including bias, transparency, accountability, and Human-in-the-Loop (HITL) strategies, and outline future research directions. Overall, this work positions LLM-assisted UAVs as a foundation for intelligent and adaptive aerial systems.
comment: 40 pages, 10 figures, 13 tables
AssemMate: Graph-Based LLM for Robotic Assembly Assistance
Large Language Model (LLM)-based robotic assembly assistance has gained significant research attention. It requires the injection of domain-specific knowledge to guide the assembly process through natural language interaction with humans. Despite some progress, existing methods represent knowledge in the form of natural language text. Due to the long context and redundant content, they struggle to meet the robots' requirements for real-time and precise reasoning. In order to bridge this gap, we present AssemMate, which utilizes the graph\textemdash a concise and accurate form of knowledge representation\textemdash as input. This graph-based LLM enables knowledge graph question answering (KGQA), supporting human-robot interaction and assembly task planning for specific products. Beyond interactive QA, AssemMate also supports sensing stacked scenes and executing grasping to assist with assembly. Specifically, a self-supervised Graph Convolutional Network (GCN) encodes knowledge graph entities and relations into a latent space and aligns them with LLM's representation, enabling the LLM to understand graph information. In addition, a vision-enhanced strategy is employed to address stacked scenes in grasping. Through training and evaluation, AssemMate outperforms existing methods, achieving 6.4\% higher accuracy, 3 times faster inference, and 28 times shorter context length, while demonstrating strong generalization ability on random graphs. And our approach further demonstrates superiority through robotic grasping experiments in both simulated and real-world settings. More details can be found on the project page: https://github.com/cristina304/AssemMate.git
Bridging Perception and Planning: Towards End-to-End Planning for Signal Temporal Logic Tasks
We investigate the task and motion planning problem for Signal Temporal Logic (STL) specifications in robotics. Existing STL methods rely on pre-defined maps or mobility representations, which are ineffective in unstructured real-world environments. We propose the \emph{Structured-MoE STL Planner} (\textbf{S-MSP}), a differentiable framework that maps synchronized multi-view camera observations and an STL specification directly to a feasible trajectory. S-MSP integrates STL constraints within a unified pipeline, trained with a composite loss that combines trajectory reconstruction and STL robustness. A \emph{structure-aware} Mixture-of-Experts (MoE) model enables horizon-aware specialization by projecting sub-tasks into temporally anchored embeddings. We evaluate S-MSP using a high-fidelity simulation of factory-logistics scenarios with temporally constrained tasks. Experiments show that S-MSP outperforms single-expert baselines in STL satisfaction and trajectory feasibility. A rule-based \emph{safety filter} at inference improves physical executability without compromising logical correctness, showcasing the practicality of the approach.
DDP-WM: Disentangled Dynamics Prediction for Efficient World Models
World models are essential for autonomous robotic planning. However, the substantial computational overhead of existing dense Transformerbased models significantly hinders real-time deployment. To address this efficiency-performance bottleneck, we introduce DDP-WM, a novel world model centered on the principle of Disentangled Dynamics Prediction (DDP). We hypothesize that latent state evolution in observed scenes is heterogeneous and can be decomposed into sparse primary dynamics driven by physical interactions and secondary context-driven background updates. DDP-WM realizes this decomposition through an architecture that integrates efficient historical processing with dynamic localization to isolate primary dynamics. By employing a crossattention mechanism for background updates, the framework optimizes resource allocation and provides a smooth optimization landscape for planners. Extensive experiments demonstrate that DDP-WM achieves significant efficiency and performance across diverse tasks, including navigation, precise tabletop manipulation, and complex deformable or multi-body interactions. Specifically, on the challenging Push-T task, DDP-WM achieves an approximately 9 times inference speedup and improves the MPC success rate from 90% to98% compared to state-of-the-art dense models. The results establish a promising path for developing efficient, high-fidelity world models. Codes will be available at https://github.com/HCPLab-SYSU/DDP-WM.
comment: Efficient and high-fidelity world model. Code is available at https://github.com/HCPLab-SYSU/DDP-WM
COMRES-VLM: Coordinated Multi-Robot Exploration and Search using Vision Language Models
Autonomous exploration and object search in unknown indoor environments remain challenging for multi-robot systems (MRS). Traditional approaches often rely on greedy frontier assignment strategies with limited inter-robot coordination. In this work, we present Coordinated Multi-Robot Exploration and Search using Vision Language Models (COMRES-VLM), a novel framework that leverages Vision Language Models (VLMs) for intelligent coordination of MRS tasked with efficient exploration and target object search. COMRES-VLM integrates real-time frontier cluster extraction and topological skeleton analysis with VLM reasoning over shared occupancy maps, robot states, and optional natural language priors, in order to generate globally consistent waypoint assignments. Extensive experiments in large-scale simulated indoor environments with up to six robots demonstrate that COMRES-VLM consistently outperforms state-of-the-art coordination methods, including Capacitated Vehicle Routing Problem (CVRP) and Voronoi-based planners, achieving 10.2\% faster exploration completion and 55.7\% higher object search efficiency. Notably, COMRES-VLM enables natural language-based object search capabilities, allowing human operators to provide high-level semantic guidance that traditional algorithms cannot interpret.
Multiagent Systems
Epistemic Gain, Aleatoric Cost: Uncertainty Decomposition in Multi-Agent Debate for Math Reasoning
Multi-Agent Debate (MAD) has shown promise in leveraging collective intelligence to improve reasoning and reduce hallucinations, yet it remains unclear how information exchange shapes the underlying ability. Empirically, MAD exhibits paradoxical phenomena, such as accuracy improvement accompanied by substantial increase in token entropy, and remarkable divergence between homogeneous and heterogeneous model combinations. In this paper, we propose a Bayesian uncertainty analysis framework for MAD, which decomposes total predictive uncertainty into epistemic uncertainty reducible by debate context and aleatoric uncertainty induced by internal model noise. Across multiple model configurations, we find that effective debate hinges on achieving high epistemic gain under controlled aleatoric cost. Building on this insight, we design an uncertainty-guided multi-agent reinforcement learning (MARL) algorithm that explicitly optimizes aleatoric noise reduction and epistemic information utilization. Experiments show that our training significantly improves post-debate accuracy and stability, and enhances individual reasoning beyond single-agent RL, providing a unified Bayesian uncertainty perspective for understanding and improving MAD.
Can AI Agents Agree?
Large language models are increasingly deployed as cooperating agents, yet their behavior in adversarial consensus settings has not been systematically studied. We evaluate LLM-based agents on a Byzantine consensus game over scalar values using a synchronous all-to-all simulation. We test consensus in a no-stake setting where agents have no preferences over the final value, so evaluation focuses on agreement rather than value optimality. Across hundreds of simulations spanning model sizes, group sizes, and Byzantine fractions, we find that valid agreement is not reliable even in benign settings and degrades as group size grows. Introducing a small number of Byzantine agents further reduces success. Failures are dominated by loss of liveness, such as timeouts and stalled convergence, rather than subtle value corruption. Overall, the results suggest that reliable agreement is not yet a dependable emergent capability of current LLM-agent groups even in no-stake settings, raising caution for deployments that rely on robust coordination.
MedCollab: Causal-Driven Multi-Agent Collaboration for Full-Cycle Clinical Diagnosis via IBIS-Structured Argumentation
Large language models (LLMs) have shown promise in healthcare applications, however, their use in clinical practice is still limited by diagnostic hallucinations and insufficiently interpretable reasoning. We present MedCollab, a novel multi-agent framework that emulates the hierarchical consultation workflow of modern hospitals to autonomously navigate the full-cycle diagnostic process. The framework incorporates a dynamic specialist recruitment mechanism that adaptively assembles clinical and examination agents according to patient-specific symptoms and examination results. To ensure the rigor of clinical work, we adopt a structured Issue-Based Information System (IBIS) argumentation protocol that requires agents to provide ``Positions'' backed by traceable evidence from medical knowledge and clinical data. Furthermore, the framework constructs a Hierarchical Disease Causal Chain that transforms flattened diagnostic predictions into a structured model of pathological progression through explicit logical operators. A multi-round Consensus Mechanism iteratively filters low-quality reasoning through logic auditing and weighted voting. Evaluated on real-world clinical datasets, MedCollab significantly outperforms pure LLMs and medical multi-agent systems in Accuracy and RaTEScore, demonstrating a marked reduction in medical hallucinations. These findings indicate that MedCollab provides an extensible, transparent, and clinically compliant approach to medical decision-making.
Silo-Bench: A Scalable Environment for Evaluating Distributed Coordination in Multi-Agent LLM Systems
Large language models are increasingly deployed in multi-agent systems to overcome context limitations by distributing information across agents. Yet whether agents can reliably compute with distributed information -- rather than merely exchange it -- remains an open question. We introduce Silo-Bench, a role-agnostic benchmark of 30 algorithmic tasks across three communication complexity levels, evaluating 54 configurations over 1,620 experiments. Our experiments expose a fundamental Communication-Reasoning Gap: agents spontaneously form task-appropriate coordination topologies and exchange information actively, yet systematically fail to synthesize distributed state into correct answers. The failure is localized to the reasoning-integration stage -- agents often acquire sufficient information but cannot integrate it. This coordination overhead compounds with scale, eventually eliminating parallelization gains entirely. These findings demonstrate that naively scaling agent count cannot circumvent context limitations, and Silo-Bench provides a foundation for tracking progress toward genuinely collaborative multi-agent systems.
comment: 19 pages, 7 figures
SimAB: Simulating A/B Tests with Persona-Conditioned AI Agents for Rapid Design Evaluation
A/B testing is a standard method for validating design decisions, yet its reliance on real user traffic limits iteration speed and makes certain experiments impractical. We present SimAB, a system that reframes A/B testing as a fast, privacy-preserving simulation using persona-conditioned AI agents. Given design screenshots and a conversion goal, SimAB generates user personas, deploys them as agents that state their preference, aggregates results, and synthesizes rationales. Through a formative study with experimentation practitioners, we identified scenarios where traffic constraints hinder testing, including low-traffic pages, multi-variant comparisons, micro-optimizations, and privacy-sensitive contexts. Our design emphasizes speed, early feedback, actionable rationales, and audience specification. We evaluate SimAB against 47 historical A/B tests with known outcomes, achieving 67% overall accuracy, increasing to 83% for high-confidence cases. Additional experiments show robustness to naming and positional bias and demonstrate accuracy gains from personas. Practitioner feedback suggests that SimAB supports faster evaluation cycles and rapid screening of designs difficult to assess with traditional A/B tests.
comment: 18 pages
BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning
Large language models (LLMs) have demonstrated significant reasoning capabilities in scientific discovery but struggle to bridge the gap to physical execution in wet-labs. In these irreversible environments, probabilistic hallucinations are not merely incorrect, but also cause equipment damage or experimental failure. To address this, we propose \textbf{BioProAgent}, a neuro-symbolic framework that anchors probabilistic planning in a deterministic Finite State Machine (FSM). We introduce a State-Augmented Planning mechanism that enforces a rigorous \textit{Design-Verify-Rectify} workflow, ensuring hardware compliance before execution. Furthermore, we address the context bottleneck inherent in complex device schemas by \textit{Semantic Symbol Grounding}, reducing token consumption by $\sim$6$\times$ through symbolic abstraction. In the extended BioProBench benchmark, BioProAgent achieves 95.6\% physical compliance (compared to 21.0\% for ReAct), demonstrating that neuro-symbolic constraints are essential for reliable autonomy in irreversible physical environments. \footnote{Code at https://github.com/YuyangSunshine/bioproagent and project at https://yuyangsunshine.github.io/BioPro-Project/}
GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics
Ensuring reliable data-driven decisions is crucial in domains where analytical accuracy directly impacts safety, compliance, or operational outcomes. Decision support in such domains relies on large tabular datasets, where manual analysis is slow, costly, and error-prone. While Large Language Models (LLMs) offer promising automation potential, they face challenges in analytical reasoning, structured data handling, and ambiguity resolution. This paper introduces GateLens, an LLM-based architecture for reliable analysis of complex tabular data. Its key innovation is the use of Relational Algebra (RA) as a formal intermediate representation between natural-language reasoning and executable code, addressing the reasoning-to-code gap that can arise in direct generation approaches. In our automotive instantiation, GateLens translates natural language queries into RA expressions and generates optimized Python code. Unlike traditional multi-agent or planning-based systems that can be slow, opaque, and costly to maintain, GateLens emphasizes speed, transparency, and reliability. We validate the architecture in automotive software release analytics, where experimental results show that GateLens outperforms the existing Chain-of-Thought (CoT) + Self-Consistency (SC) based system on real-world datasets, particularly in handling complex and ambiguous queries. Ablation studies confirm the essential role of the RA layer. Industrial deployment demonstrates over 80% reduction in analysis time while maintaining high accuracy across domain-specific tasks. GateLens operates effectively in zero-shot settings without requiring few-shot examples or agent orchestration. This work advances deployable LLM system design by identifying key architectural features--intermediate formal representations, execution efficiency, and low configuration overhead--crucial for domain-specific analytical applications.
Sample-Efficient Distributionally Robust Multi-Agent Reinforcement Learning via Online Interaction ICLR 2026
Well-trained multi-agent systems can fail when deployed in real-world environments due to model mismatches between the training and deployment environments, caused by environment uncertainties including noise or adversarial attacks. Distributionally Robust Markov Games (DRMGs) enhance system resilience by optimizing for worst-case performance over a defined set of environmental uncertainties. However, current methods are limited by their dependence on simulators or large offline datasets, which are often unavailable. This paper pioneers the study of online learning in DRMGs, where agents learn directly from environmental interactions without prior data. We introduce the Multiplayer Optimistic Robust Nash Value Iteration (MORNAVI) algorithm and provide the first provable guarantees for this setting. Our theoretical analysis demonstrates that the algorithm achieves low regret and efficiently finds the optimal robust policy for uncertainty sets measured by Total Variation divergence and Kullback-Leibler divergence. These results establish a new, practical path toward developing truly robust multi-agent systems.
comment: Accepted by ICLR 2026.The first two authors contributed equally
Multi-Agent Reinforcement Learning with Communication-Constrained Priors
Communication is one of the effective means to improve the learning of cooperative policy in multi-agent systems. However, in most real-world scenarios, lossy communication is a prevalent issue. Existing multi-agent reinforcement learning with communication, due to their limited scalability and robustness, struggles to apply to complex and dynamic real-world environments. To address these challenges, we propose a generalized communication-constrained model to uniformly characterize communication conditions across different scenarios. Based on this, we utilize it as a learning prior to distinguish between lossy and lossless messages for specific scenarios. Additionally, we decouple the impact of lossy and lossless messages on distributed decision-making, drawing on a dual mutual information estimatior, and introduce a communication-constrained multi-agent reinforcement learning framework, quantifying the impact of communication messages into the global reward. Finally, we validate the effectiveness of our approach across several communication-constrained benchmarks.
When Is Diversity Rewarded in Cooperative Multi-Agent Learning?
The success of teams in robotics, nature, and society often depends on the division of labor among diverse specialists; however, a principled explanation for when such diversity surpasses a homogeneous team is still missing. Focusing on multi-agent task allocation problems, we study this question from the perspective of reward design: what kinds of objectives are best suited for heterogeneous teams? We first consider an instantaneous, non-spatial setting where the global reward is built by two generalized aggregation operators: an inner operator that maps the $N$ agents' effort allocations on individual tasks to a task score, and an outer operator that merges the $M$ task scores into the global team reward. We prove that the curvature of these operators determines whether heterogeneity can increase reward, and that for broad reward families this collapses to a simple convexity test. Next, we ask what incentivizes heterogeneity to emerge when embodied, time-extended agents must learn an effort allocation policy. To study heterogeneity in such settings, we use multi-agent reinforcement learning (MARL) as our computational paradigm, and introduce Heterogeneity Gain Parameter Search (HetGPS), a gradient-based algorithm that optimizes the parameter space of underspecified MARL environments to find scenarios where heterogeneity is advantageous. Across different environments, we show that HetGPS rediscovers the reward regimes predicted by our theory to maximize the advantage of heterogeneity, both validating HetGPS and connecting our theoretical insights to reward design in MARL. Together, these results help us understand when behavioral diversity delivers a measurable benefit.
UFO3: Weaving the Digital Agent Galaxy
Large language model (LLM)-powered agents are transforming digital devices from passive tools into proactive intelligent collaborators. However, most existing frameworks remain confined to a single OS or device, making cross-device workflows brittle and largely manual. We present UFO$^3$, a system that unifies heterogeneous endpoints, desktops, servers, mobile devices, and edge, into a single orchestration fabric. UFO$^3$ models each user request as a mutable TaskConstellation: a distributed DAG of atomic subtasks (TaskStars) with explicit control and data dependencies (TaskStarLines). The TaskConstellation continuously evolves as results stream in from distributed devices, enabling asynchronous execution, adaptive recovery, and dynamic optimization. A Constellation Orchestrator} executes tasks safely and asynchronously while applying dynamic DAG updates, and the Agent Interaction Protocol (AIP) provides persistent, low-latency channels for reliable task dispatch and result streaming. These designs dissolve the traditional boundaries between devices and platforms, allowing agents to collaborate seamlessly and amplify their collective intelligence. We evaluate UFO$^3$ on NebulaBench, a benchmark of 55 cross-device tasks across 5 machines and 10 categories. UFO$^3$ achieves 83.3% subtask completion, 70.9% task success, exposes parallelism with an average width of 1.72, and reduces end-to-end latency by 31% relative to a sequential baseline. Fault-injection experiments demonstrate graceful degradation and recovery under transient and permanent agent failures. These results show that UFO$^3$ achieves accurate, efficient, and resilient task orchestration across heterogeneous devices, uniting isolated agents into a coherent, adaptive computing fabric that extends across the landscape of ubiquitous computing.
comment: We developed UFO$^3$ as a fully engineered system with over 73K lines of code, encompassing agent implementations and integrations for Windows, Linux, and Android mobile devices. The entire project is open-sourced at https://github.com/microsoft/UFO/, accompanied by detailed documentation and tutorials at https://microsoft.github.io/UFO/
Adaptive Context Length Optimization with Low-Frequency Truncation for Multi-Agent Reinforcement Learning
Recently, deep multi-agent reinforcement learning (MARL) has demonstrated promising performance for solving challenging tasks, such as long-term dependencies and non-Markovian environments. Its success is partly attributed to conditioning policies on large fixed context length. However, such large fixed context lengths may lead to limited exploration efficiency and redundant information. In this paper, we propose a novel MARL framework to obtain adaptive and effective contextual information. Specifically, we design a central agent that dynamically optimizes context length via temporal gradient analysis, enhancing exploration to facilitate convergence to global optima in MARL. Furthermore, to enhance the adaptive optimization capability of the context length, we present an efficient input representation for the central agent, which effectively filters redundant information. By leveraging a Fourier-based low-frequency truncation method, we extract global temporal trends across decentralized agents, providing an effective and efficient representation of the MARL environment. Extensive experiments demonstrate that the proposed method achieves state-of-the-art (SOTA) performance on long-term dependency tasks, including PettingZoo, MiniGrid, Google Research Football (GRF), and StarCraft Multi-Agent Challenge v2 (SMACv2).
COMRES-VLM: Coordinated Multi-Robot Exploration and Search using Vision Language Models
Autonomous exploration and object search in unknown indoor environments remain challenging for multi-robot systems (MRS). Traditional approaches often rely on greedy frontier assignment strategies with limited inter-robot coordination. In this work, we present Coordinated Multi-Robot Exploration and Search using Vision Language Models (COMRES-VLM), a novel framework that leverages Vision Language Models (VLMs) for intelligent coordination of MRS tasked with efficient exploration and target object search. COMRES-VLM integrates real-time frontier cluster extraction and topological skeleton analysis with VLM reasoning over shared occupancy maps, robot states, and optional natural language priors, in order to generate globally consistent waypoint assignments. Extensive experiments in large-scale simulated indoor environments with up to six robots demonstrate that COMRES-VLM consistently outperforms state-of-the-art coordination methods, including Capacitated Vehicle Routing Problem (CVRP) and Voronoi-based planners, achieving 10.2\% faster exploration completion and 55.7\% higher object search efficiency. Notably, COMRES-VLM enables natural language-based object search capabilities, allowing human operators to provide high-level semantic guidance that traditional algorithms cannot interpret.
Systems and Control (EESS)
Opponent State Inference Under Partial Observability: An HMM-POMDP Framework for 2026 Formula 1 Energy Strategy
The 2026 Formula 1 technical regulations introduce a fundamental change to energy strategy: under a 50/50 internal combustion engine / battery power split with unlimited regeneration and a driver-controlled Override Mode (abbreviated MOM throughout), the optimal energy deployment policy depends not only on a driver's own state but on the hidden state of rival cars. This creates a Partially Observable Stochastic Game that cannot be solved by single-agent optimisation methods. We present a tractable two-layer inference and decision framework. The first layer is a 30-state Hidden Markov Model (HMM) that infers a probability distribution over each rival's ERS charge level, Override Mode status, and tyre degradation state from five publicly observable telemetry signals. The second layer is a Deep Q-Network (DQN) policy that takes the HMM belief state as input and selects between energy deployment strategies. We formally characterise the counter-harvest trap -- a deceptive strategy in which a car deliberately suppresses observable deployment signals to induce a rival into a failed attack -- and show that detecting it requires belief-state inference rather than reactive threshold rules. On synthetic races generated from the model's own assumptions, the HMM achieves 92.3% ERS inference accuracy (random baseline: 33.3%) and detects counter-harvest trap conditions with 95.7% recall. Pre-registration -- empirical validation begins Australian Grand Prix, 8 March 2026.
comment: 17 pages. Pre-registered theoretical framework; empirical calibration on 2026 race telemetry begins Australian Grand Prix, 8 March 2026. Paper 1 of 3. ResearchGate preprint: DOI 10.13140/RG.2.2.16034.08644
AC-Informed DC Optimal Transmission Switching via Admittance Sensitivity-Augmented Constraints and Repair Costs
AC optimal transmission switching (AC-OTS) is a computationally challenging problem due to the nonconvexity and nonlinearity of AC power-flow (PF) equations coupled with a large number of binary variables. A computationally efficient alternative is the DC-OTS model, which uses the DC PF equations, but it can yield infeasible or suboptimal switching decisions when evaluated under the full AC optimal power flow (AC-OPF). To tackle this issue, we propose an AC-Informed DC Optimal Transmission Switching (AIDC-OTS) scheme that enhances the DC-OTS model by leveraging first- and second-order admittance sensitivities-based constraints and repair/penalty costs that guide the DC OTS towards AC-feasible topologies. The resulting model initially is a Mixed-Integer Quadratically Constrained Quadratic Program (MIQCQP), which we further reformulate into solver-friendly representations, such as a Mixed-Integer Second-Order Cone Program (MISOCP) and a Mixed-Integer Linear Program (MILP). This proposed scheme yields switching topologies that are AC-feasible, while maintaining computational tractability. We validate the proposed scheme using extensive simulations across a large set of PGlib test cases, demonstrating its effectiveness, with performance benchmarks against original DC-OTS and other OTS formulations such as LPAC-OTS and QC-OTS.
comment: 10 pages
Least-Cost Overvoltage Control in PV-Rich Distribution Networks via Unbalanced Optimal Power Flow
The increasing penetration of photovoltaic (PV) generation in low-voltage distribution networks presents operational challenges, with overvoltages being among the most critical. This study introduces a tool based on Unbalanced Optimal Power Flow (UBOPF) to assess cost-effective local inverter control strategies specifically aimed at mitigating overvoltage issues. Two approaches are examined: dynamic active power curtailment and combined active and reactive power control. These strategies are tested on a residential low-voltage network with high PV penetration, where the UBOPF model with voltage-magnitude constraints was implemented in Julia using the JuMP optimization package. The results demonstrate that both methods are effective in maintaining voltage levels within regulatory limits, with the latter leading to lower PV curtailment. The analysis highlights the need to consider these control actions as ancillary services to the grid, which should be properly compensated given their effect on generator revenues.
comment: Published in journal Sustainable Energy, Grids and Networks
Digital Twin-Based Cooling System Optimization for Data Center
Data center cooling systems consume significant auxiliary energy, yet optimization studies rarely quantify the gap between theoretically optimal and operationally deployable control strategies. This paper develops a digital twin of the liquid cooling infrastructure at the Frontier exascale supercomputer, in which a hot-temperature water system comprises three parallel subloops, each serving dedicated coolant distribution unit clusters through plate heat exchangers and variable-speed pumps. The surrogate model is built based on Modelica and validated through one full calendar year of 10-minute operational data following ASHRAE Guideline 14. The model achieves a subloop coefficient of variation of the root mean square error below 2.7% and a normalized mean bias error within 2.5%. Using this validated surrogate model, a layered optimization framework evaluates three progressively constrained strategies: an analytical flow-only optimization achieves 20.4% total energy saving, unconstrained joint optimization of flow rate and supply temperature demonstrates 30.1% total energy saving, and ramp-constrained optimization of flow rate and supply temperature, enforcing actuator rate limits, can reach total energy saving of 27.8%. The analysis reveals that the baseline system operates at 2.9 times the minimum thermally safe flow rate, and the co-optimizing supply temperature with flow rate nearly doubles the savings achievable by flow reduction alone.
comment: 43 pages, 13 figures
Extending Adaptive Cruise Control with Machine Learning Intrusion Detection Systems
An Adaptive Cruise Control (ACC) system automatically adjusts the host vehicle's speed to maintain a safe following distance from a lead vehicle. In typical implementations, a feedback controller (e.g., a Proportional-Integral-Derivative (PID) controller) computes the host vehicle's acceleration using a target speed and a spacing error, defined as the difference between the measured inter-vehicle distance and a desired safe distance. ACC is often assumed to be resilient to fault-injection attacks because a Kalman filter (KF) can smooth noisy speed measurements. However, we show--through analytical proofs and simulation results--that a KF can tolerate injected speed values only up to a bounded threshold. When injected values exceed this threshold, the filter can be driven off track, causing the ACC controller to make unsafe acceleration decisions and potentially leading to collisions. Our main contribution is to augment the PID-based controller with Intrusion Detection System (IDS) outputs, yielding Intrusion Detection Systems-Based Adaptive Cruise Control (ACC-IDS). The ACC-IDS controller is simple and implementable: a binary intrusion flag switches the control law to emergency braking. We prove that augmenting ACC with an IDS, under assumed detection-performance and latency constraints, can mitigate these attacks and help preserve ACC's collision-avoidance guarantees.
Kernel Methods for Stochastic Dynamical Systems with Application to Koopman Eigenfunctions: Feynman-Kac Representations and RKHS Approximation
We extend the unified kernel framework for transport equations and Koopman eigenfunctions, developed in previous work by the authors for deterministic systems, to stochastic differential equations (SDEs). In the deterministic setting, three analytically grounded constructions-Lions-type variational principles, Green's function convolution, and resolvent operators along characteristic flows--were shown to yield identical reproducing kernels. For stochastic systems, the Koopman generator includes a second-order diffusion term, transforming the first-order hyperbolic transport equation into a second-order elliptic-parabolic PDE. This fundamental change necessitates replacing the method of characteristics with probabilistic representations based on the Feynman--Kac formula. Our main contributions include: (i) extension of all three kernel constructions to stochastic systems via Feynman--Kac path-integral representations; (ii) proof of kernel equivalence under uniform ellipticity assumptions; (iii) a collocation-based computational framework incorporating second-order differential operators; (iv) error bounds separating RKHS approximation error from Monte Carlo sampling error; (v) analysis of how diffusion affects numerical conditioning; and (vi) connections to generator EDMD, diffusion maps, and kernel analog forecasting. Numerical experiments on Ornstein--Uhlenbeck processes, nonlinear SDEs with varying diffusion strength, and multi-dimensional systems validate the theoretical developments and demonstrate that moderate diffusion can improve numerical stability through elliptic regularization.
Observer-Based Active Fault/Disturbance Compensation Control for Fully Actuated Systems SC
This paper is concerned with fault/disturbance compensation control for fully actuated systems. In particular, we explore observer-based control, incorporating an active compensation mechanism. First, we propose a novel observer with enhanced design flexibility for the fully actuated system model, enabling simultaneous estimation of system states and exogenous unknown signals, such as faults or disturbances. Then, a nonlinear controller is developed with an active fault or disturbance compensation term, leveraging the fully actuated system approach. The asymptotic stability of both the state estimation error and the closed-loop control system is systematically established. Finally, the feasibility and merits of the proposed method are validated through comparative simulations and experiments.
comment: This paper was initially accepted by SCIENCE CHINA Information Sciences on 09-Oct-2025, editorially revised on 28-Feb-2026, and has been scheduled for publication in Volume 69, Issue 5 (2026). *Corresponding author: Guang-Ren Duan
Intent-Context Synergy Reinforcement Learning for Autonomous UAV Decision-Making in Air Combat
Autonomous UAV infiltration in dynamic contested environments remains a significant challenge due to the partially observable nature of threats and the conflicting objectives of mission efficiency versus survivability. Traditional Reinforcement Learning (RL) approaches often suffer from myopic decision-making and struggle to balance these trade-offs in real-time. To address these limitations, this paper proposes an Intent-Context Synergy Reinforcement Learning (ICS-RL) framework. The framework introduces two core innovations: (1) An LSTM-based Intent Prediction Module that forecasts the future trajectories of hostile units, transforming the decision paradigm from reactive avoidance to proactive planning via state augmentation; (2) A Context-Analysis Synergy Mechanism that decomposes the mission into hierarchical sub-tasks (safe cruise, stealth planning, and hostile breakthrough). We design a heterogeneous ensemble of Dueling DQN agents, each specialized in a specific tactical context. A dynamic switching controller based on Max-Advantage values seamlessly integrates these agents, allowing the UAV to adaptively select the optimal policy without hard-coded rules. Extensive simulations demonstrate that ICS-RL significantly outperforms baselines (Standard DDQN) and traditional methods (PSO, Game Theory). The proposed method achieves a mission success rate of 88\% and reduces the average exposure frequency to 0.24 per episode, validating its superiority in ensuring robust and stealthy penetration in high-dynamic scenarios.
Battery Lifetime Prediction using Data-driven Modeling Approaches
Batteries are ubiquitous today, with applications ranging from smartphones, watches, and laptops to electric cars, drones, and electric aircraft. Lithium-ion batteries are widely used in these applications due to their high energy density, rechargeability, and low lifecycle cost. Understanding the lifetime of lithium-ion batteries is essential for their effective utilization across many domains. In this study, data-driven modeling approaches are explored to predict the lifetime of lithium-ion batteries using various measurable battery parameters. A battery dataset from NASA's electric aircraft experiments was used, which included 17 predictor variables and remaining flight time as the response variable representing battery lifetime. The dataset contained more than 4,000,000 rows. However, the original dataset provided limited directly useful information about battery utilization over time; therefore, feature engineering was performed to generate more informative variables. Additionally, dimensionality reduction using principal component analysis (PCA) was applied to reduce computational cost and model complexity by selecting a smaller number of principal components as predictors for model development. Random forest and neural network models were explored for battery lifetime prediction using the engineered features. Multiple neural network configurations were evaluated, including single- and double-hidden-layer architectures with varying numbers of nodes. Mean squared error (MSE) on the test dataset was used as the performance metric for model comparison. The results indicate that data-driven modeling approaches are effective for battery lifetime prediction, with neural network models outperforming other models based on the MSE metric. Furthermore, neural networks demonstrate robustness in handling high-dimensional battery data.
Artificial Superintelligence May be Useless: Equilibria in the Economy of Multiple AI Agents
With recent development of artificial intelligence, it is more common to adopt AI agents in economic activities. This paper explores the economic actions of agents, including human agents and AI agents, in an economic game of trading products/services, and the equilibria in this economy involving multiple agents. We derive a range of equilibrium results and their corresponding conditions using a Markov chain stationary distribution based model. One distinct feature of our model is that we consider the long-term utility generated by economic activities instead of their short-term benefits. For the model consisting of two agents, we fully characterize all the possible economic equilibria and conditions. Interestingly, we show that unless each agent can at least double (not merely increase) its marginal utility by purchasing the other agent's products/services, purchasing the other agent's products/services will not happen in any economic equilibrium. We further extend our results to three and more agents, where we characterize more economic equilibria. We find that in some equilibria, the ``more powerful'' AI agents contribute zero utility to ``less capable'' agents.
comment: 20 pages
Enhancing Hallucination Detection through Noise Injection ICLR 2026
Large Language Models (LLMs) are prone to generating plausible yet incorrect responses, known as hallucinations. Effectively detecting hallucinations is therefore crucial for the safe deployment of LLMs. Recent research has linked hallucinations to model uncertainty, suggesting that hallucinations can be detected by measuring dispersion over answer distributions obtained from multiple samples drawn from a model. While drawing from the distribution over tokens defined by the model is a natural way to obtain samples, in this work, we argue that it is suboptimal for the purpose of detecting hallucinations. We show that detection can be improved significantly by taking into account model uncertainty in the Bayesian sense. To this end, we propose a very simple, training-free approach based on perturbing an appropriate subset of model parameters, or equivalently hidden unit activations, during sampling. We demonstrate that our approach significantly improves inference-time hallucination detection over standard sampling across diverse datasets, model architectures, and uncertainty metrics.
comment: ICLR 2026 main conference paper
HyperKKL: Enabling Non-Autonomous State Estimation through Dynamic Weight Conditioning ICLR 2026
This paper proposes HyperKKL, a novel learning approach for designing Kazantzis-Kravaris/Luenberger (KKL) observers for non-autonomous nonlinear systems. While KKL observers offer a rigorous theoretical framework by immersing nonlinear dynamics into a stable linear latent space, its practical realization relies on solving Partial Differential Equations (PDE) that are analytically intractable. Current existing learning-based approximations of the KKL observer are mostly designed for autonomous systems, failing to generalize to driven dynamics without expensive retraining or online gradient updates. HyperKKL addresses this by employing a hypernetwork architecture that encodes the exogenous input signal to instantaneously generate the parameters of the KKL observer, effectively learning a family of immersion maps parameterized by the external drive. We rigorously evaluate this approach against a curriculum learning strategy that attempts to generalize from autonomous regimes via training heuristics alone. The novel approach is illustrated on four numerical simulations in benchmark examples including the Duffing, Van der Pol, Lorenz, and Rössler systems.
comment: 18 pages, 6 figures, Accepted in ICLR 2026 AI & PDE Workshop
Sparsity-Promoting Reachability Analysis and Optimization of Constrained Zonotopes
The constrained zonotope is a polytopic set representation widely used for set-based analysis and control of dynamic systems. This paper develops methods to formulate and solve optimization problems for dynamic systems in real time using constrained zonotope reachability analysis. An alternating direction method of multipliers (ADMM) algorithm is presented that makes efficient use of the constrained zonotope structure. To increase the efficiency of the ADMM iterations, reachability calculations are presented that increase the sparsity of the matrices used to define a constrained zonotope when compared to typical methods. The developed methods are used to formulate and solve predictive control, state estimation, and safety verification problems. Numerical results show that optimization times using the proposed approach are competitive with state-of-the-art QP solvers and conventional problem formulations. A combined set-valued state estimation and moving horizon estimation algorithm is presented and experimentally demonstrated in the context of robot localization.
Integrating Conductor Health into Dynamic Line Rating and Unit Commitment under Uncertainty
Dynamic line rating (DLR) enables greater utilization of existing transmission lines by leveraging real-time weather data. However, the elevated temperature operation (ETO) of conductors under DLR is often overlooked, despite its long-term impact on conductor health. This paper addresses this issue by 1) quantifying risk-based depreciation costs associated with ETO and 2) proposing a Conductor Health-Aware Unit Commitment (CHA-UC) that internalizes these costs in operational decisions. CHA-UC incorporates a robust linear approximation of conductor temperature and integration of expected depreciation costs due to hourly ETO into the objective function. Case studies on the Texas 123-bus backbone test system using NOAA weather data demonstrate that the proposed CHA-UC model reduces the total cost by 0.74\% and renewable curtailment by 85\% compared to static line rating (SLR) and outperforms quantile regression forest-based methods, while conventional DLR operation without risk consideration resulted in higher costs due to excessive ETO. Further analysis of the commitment decisions and the line temperature statistics confirms that the CHA-UC achieves safer line flows by shifting generator commitments. Finally, we examine the emergent correlation behaviors arising between wind generation and DLR forecast errors, and show that CHA-UC adaptively manages this effect by relaxing flows for risk-hedging conditions while tightening flows for risk-amplifying ones.
Toward Safe and Energy-Efficient 5G NR V2X Communications in Rural Environments
Connected braking can reduce fatal collisions in connected and autonomous vehicles (CAVs) by using reliable, low-latency 5G New Radio (NR) links, especially NR Sidelink Vehicle-to-Everything (V2X). In rural areas, road side units are sparse and power-constrained, so energy efficiency must be considered alongside safety. This paper studies how three communication control factors including subcarrier spacing ($\mathrm{SCS}$), modulation and coding scheme ($\mathrm{MCS}$), and transmit power ($P_{\mathrm{t}}$) should be configured to balance safety and energy consumption in rural scenarios in light and heavy traffic scenarios. Safety is quantified by the packet receive ratio ($\mathrm{PRR}$) against the minimum communication distance $D_{\mathrm{comm}}$, defined as the distance that the vehicle travels during the transmission of the safety message. Results show that, under heavy traffic, increasing $P_{\mathrm{t}}$ and selecting a low-rate $\mathrm{MCS}$ at $\mathrm{SCS} = 30$ kHz sustains high $\mathrm{PRR}$ at $D_{\mathrm{comm}}$, albeit with higher energy cost. In light traffic, maintaining lower $P_\mathrm{t}$ with low $\mathrm{MCS}$ levels achieves a favorable reliability-energy trade-off while preserving acceptable $\mathrm{PRR}$ at $D_{\mathrm{comm}}$. These findings demonstrate the necessity of adaptive, energy-aware strategy to guarantee both safety and energy efficiency in rural V2X systems.
comment: Accepted version
Towards Native AI in 6G Standardization: The Roadmap of Semantic Communication
Semantic communication (SemCom) has emerged as a transformative paradigm for future 6G networks, offering task-oriented and meaning-aware transmission that fundamentally redefines traditional bit-centric design. Recognized by leading standardization bodies including the institute of electrical and electronics engineers (IEEE) and the international telecommunication union (ITU), and actively discussed within the 3rd generation partnership project (3GPP) working groups, SemCom is rapidly gaining traction as a foundational enabler for native-AI 6G. This paper presents a comprehensive overview of recent progress in SemCom from both academic and industrial perspectives, with a focus on its ongoing and upcoming standardization activities. We systematically examine advances in representative application scenarios, architectural design, semantic-traditional system compatibility, unified evaluation metrics, and validation methodologies. Furthermore, we highlight several key enabling technologies, such as joint source-channel coding (JSCC), SemCom-based multiple access (MA) technologies such as model division MA (MDMA), and semantic knowledge base (KB), that support the practical implementation of SemCom in standard-compliant systems. Additionally, we present a case study for channel state information (CSI) feedback, illustrating the concrete performance gains of SemCom under 3GPP-compliant fading channels. Finally, we discuss emerging challenges and research opportunities for incorporating semantic-native mechanisms into the evolving 6G standardization landscape, and provide forward-looking insights into its development and global adoption.
Large Scale Robotic Material Handling: Learning, Planning, and Control
Bulk material handling involves the efficient and precise moving of large quantities of materials, a core operation in many industries, including cargo ship unloading, waste sorting, construction, and demolition. These repetitive, labor-intensive, and safety-critical operations are typically performed using large hydraulic material handlers equipped with underactuated grippers. In this work, we present a comprehensive framework for the autonomous execution of large-scale material handling tasks. The system integrates specialized modules for environment perception, pile attack point selection, path planning, and motion control. The main contributions of this work are two reinforcement learning-based modules: an attack point planner that selects optimal grasping locations on the material pile to maximize removal efficiency and minimize the number of scoops, and a robust trajectory following controller that addresses the precision and safety challenges associated with underactuated grippers in movement, while utilizing their free-swinging nature to release material through dynamic throwing. We validate our framework through real-world experiments on a 40 t material handler in a representative worksite, focusing on two key tasks: high-throughput bulk pile management and high-precision truck loading. Comparative evaluations against human operators demonstrate the system's effectiveness in terms of precision, repeatability, and operational safety. To the best of our knowledge, this is the first complete automation of material handling tasks on a full scale.
comment: Final version published in IEEE Transactions on Field Robotics. It includes additional experiments and comparisons with classical methods
Query-Efficient Zeroth-Order Algorithms for Nonconvex Constrained Optimization
Zeroth-order optimization (ZO) has been a powerful framework for solving black-box problems, which estimates gradients using zeroth-order data to update variables iteratively. The practical applicability of ZO critically depends on the efficiency of single-step gradient estimation and the overall query complexities. However, existing constrained ZO algorithms cannot achieve efficiency on both simultaneously. In this work, we consider a general constrained optimization model with black-box objective and constraint functions. To solve it, we propose novel algorithms that can achieve the best-known overall query complexity bound of $\mathcal{O}(d/ε^4)$ to find an $ε$-stationary solution ($d$ is the dimension of variables), while reducing the queries for estimating a single-step gradient from $\mathcal{O}(d)$ to $\mathcal{O}(1)$. Specifically, we integrate block gradient estimators with gradient descent ascent, which leads to two algorithms, ZOB-GDA and ZOB-SGDA, respectively. Instead of constructing full gradients, they estimate only partial gradients along random blocks of dimensions, where the adjustable block sizes enable high single-step efficiency without sacrificing convergence guarantees. Our theoretical results establish the finite-sample convergence of the proposed algorithms for nonconvex optimization. Finally, numerical experiments demonstrate the superior performance of our algorithms compared to existing methods.
comment: 35 pages, 4 figures
Bridging Perception and Planning: Towards End-to-End Planning for Signal Temporal Logic Tasks
We investigate the task and motion planning problem for Signal Temporal Logic (STL) specifications in robotics. Existing STL methods rely on pre-defined maps or mobility representations, which are ineffective in unstructured real-world environments. We propose the \emph{Structured-MoE STL Planner} (\textbf{S-MSP}), a differentiable framework that maps synchronized multi-view camera observations and an STL specification directly to a feasible trajectory. S-MSP integrates STL constraints within a unified pipeline, trained with a composite loss that combines trajectory reconstruction and STL robustness. A \emph{structure-aware} Mixture-of-Experts (MoE) model enables horizon-aware specialization by projecting sub-tasks into temporally anchored embeddings. We evaluate S-MSP using a high-fidelity simulation of factory-logistics scenarios with temporally constrained tasks. Experiments show that S-MSP outperforms single-expert baselines in STL satisfaction and trajectory feasibility. A rule-based \emph{safety filter} at inference improves physical executability without compromising logical correctness, showcasing the practicality of the approach.
Robotics
Online Generation of Collision-Free Trajectories in Dynamic Environments
In this paper, we present an online method for converting an arbitrary geometric path represented by a sequence of states, generated by any planner (e.g., sampling-based planners like RRT or PRM, search-based planners like ARA*, etc.), into a corresponding kinematically feasible, jerk-limited trajectory. The method generates a sequence of quintic/quartic splines that can be discretized at a user-specified control rate, and then streamed to a low-level robot controller. Our approach enables real-time adaptation to newly captured changes in the environment. It can also be re-invoked at any time instance to generate a new trajectory from the robot's current to a desired target state or sequence of states. We can guarantee that the trajectory will remain collision-free for a certain amount of time in dynamic environments, while allowing bounded geometric deviation from the original path. The kinematic constraints are taken into account, including limited jerk. We validate the approach in a comparative simulation study against the competing method, demonstrating favorable behavior w.r.t. smoothness, computational time, and real-time performance, particularly in scenarios with frequent changes of target states (up to 1 [kHz]). Experiments on a real robot demonstrate that the proposed approach can be used in real-world scenarios including human presence.
comment: Submitted to IEEE Robotics and Automation Letters (RA-L)
UniHM: Unified Dexterous Hand Manipulation with Vision Language Model ICLR 2026
Planning physically feasible dexterous hand manipulation is a central challenge in robotic manipulation and Embodied AI. Prior work typically relies on object-centric cues or precise hand-object interaction sequences, foregoing the rich, compositional guidance of open-vocabulary instruction. We introduce UniHM, the first framework for unified dexterous hand manipulation guided by free-form language commands. We propose a Unified Hand-Dexterous Tokenizer that maps heterogeneous dexterous-hand morphologies into a single shared codebook, improving cross-dexterous hand generalization and scalability to new morphologies. Our vision language action model is trained solely on human-object interaction data, eliminating the need for massive real-world teleoperation datasets, and demonstrates strong generalizability in producing human-like manipulation sequences from open-ended language instructions. To ensure physical realism, we introduce a physics-guided dynamic refinement module that performs segment-wise joint optimization under generative and temporal priors, yielding smooth and physically feasible manipulation sequences. Across multiple datasets and real-world evaluations, UniHM attains state-of-the-art results on both seen and unseen objects and trajectories, demonstrating strong generalization and high physical feasibility. Our project page at \href{https://unihm.github.io/}{https://unihm.github.io/}.
comment: Accepted by ICLR 2026
Keyframe-Guided Structured Rewards for Reinforcement Learning in Long-Horizon Laboratory Robotics
Long-horizon precision manipulation in laboratory automation, such as pipette tip attachment and liquid transfer, requires policies that respect strict procedural logic while operating in continuous, high-dimensional state spaces. However, existing approaches struggle with reward sparsity, multi-stage structural constraints, and noisy or imperfect demonstrations, leading to inefficient exploration and unstable convergence. We propose a Keyframe-Guided Reward Generation Framework that automatically extracts kinematics-aware keyframes from demonstrations, generates stage-wise targets via a diffusion-based predictor in latent space, and constructs a geometric progress-based reward to guide online reinforcement learning. The framework integrates multi-view visual encoding, latent similarity-based progress tracking, and human-in-the-loop reinforcement fine-tuning on a Vision-Language-Action backbone to align policy optimization with the intrinsic stepwise logic of biological protocols. Across four real-world laboratory tasks, including high-precision pipette attachment and dynamic liquid transfer, our method achieves an average success rate of 82% after 40--60 minutes of online fine-tuning. Compared with HG-DAgger (42%) and Hil-ConRFT (47%), our approach demonstrates the effectiveness of structured keyframe-guided rewards in overcoming exploration bottlenecks and providing a scalable solution for high-precision, long-horizon robotic laboratory automation.
Wild-Drive: Off-Road Scene Captioning and Path Planning via Robust Multi-modal Routing and Efficient Large Language Model
Explainability and transparent decision-making are essential for the safe deployment of autonomous driving systems. Scene captioning summarizes environmental conditions and risk factors in natural language, improving transparency, safety, and human--robot interaction. However, most existing approaches target structured urban scenarios; in off-road environments, they are vulnerable to single-modality degradations caused by rain, fog, snow, and darkness, and they lack a unified framework that jointly models structured scene captioning and path planning. To bridge this gap, we propose Wild-Drive, an efficient framework for off-road scene captioning and path planning. Wild-Drive adopts modern multimodal encoders and introduces a task-conditioned modality-routing bridge, MoRo-Former, to adaptively aggregate reliable information under degraded sensing. It then integrates an efficient large language model (LLM), together with a planning token and a gate recurrent unit (GRU) decoder, to generate structured captions and predict future trajectories. We also build the OR-C2P Benchmark, which covers structured off-road scene captioning and path planning under diverse sensor corruption conditions. Experiments on OR-C2P dataset and a self-collected dataset show that Wild-Drive outperforms prior LLM-based methods and remains more stable under degraded sensing. The code and benchmark will be publicly available at https://github.com/wangzihanggg/Wild-Drive.
Optimal Solutions for the Moving Target Vehicle Routing Problem via Branch-and-Price with Relaxed Continuity ICAPS 2026
The Moving Target Vehicle Routing Problem (MT-VRP) seeks trajectories for several agents that intercept a set of moving targets, subject to speed, time window, and capacity constraints. We introduce an exact algorithm, Branch-and-Price with Relaxed Continuity (BPRC), for the MT-VRP. The main challenge in a branch-and-price approach for the MT-VRP is the pricing subproblem, which is complicated by moving targets and time-dependent travel costs between targets. Our key contribution is a new labeling algorithm that solves this subproblem by means of a novel dominance criterion tailored for problems with moving targets. Numerical results on instances with up to 25 targets show that our algorithm finds optimal solutions more than an order of magnitude faster than a baseline based on previous work, showing particular strength in scenarios with limited agent capacities.
comment: Accepted to ICAPS 2026
Validation of Space Robotics in Underwater Environments via Disturbance Robustness Equivalency
We present an experimental validation framework for space robotics that leverages underwater environments to approximate microgravity dynamics. While neutral buoyancy conditions make underwater robotics an excellent platform for space robotics validation, there are still dynamical and environmental differences that need to be overcome. Given a high-level space mission specification, expressed in terms of a Signal Temporal Logic specification, we overcome these differences via the notion of maximal disturbance robustness of the mission. We formulate the motion planning problem such that the original space mission and the validation mission achieve the same disturbance robustness degree. The validation platform then executes its mission plan using a near-identical control strategy to the space mission where the closed-loop controller considers the spacecraft dynamics. Evaluating our validation framework relies on estimating disturbances during execution and comparing them to the disturbance robustness degree, providing practical evidence of operation in the space environment. Our evaluation features a dual-experiment setup: an underwater robot operating under near-neutral buoyancy conditions to validate the planning and control strategy of either an experimental planar spacecraft platform or a CubeSat in a high-fidelity space dynamics simulator.
comment: 8 pages, 5 figures, 1 table
TGM-VLA: Task-Guided Mixup for Sampling-Efficient and Robust Robotic Manipulation
The performance of robotic imitation learning is fundamentally limited by data quality and training strategies. Prevalent sampling strategies on RLBench suffer from severe keyframe redundancy and imbalanced temporal distribution, leading to inefficient memory usage and unstable optimization. Moreover, reprojecting point clouds onto multi-view images with a black background--while more efficient than voxel-based methods--often causes dark objects to be indistinguishable and hard to manipulate. In this work, we propose a novel holistic framework that significantly improves both model performance and training efficiency. First, we redesign and optimize the keyframe sampling strategy, reducing memory consumption by 80% and accelerating training speed by 5x. Second, we augment the model with a color inversion projection branch--a simple yet effective module that resolves the ambiguity of dark objects. Finally, we propose a task-guided mixup technique that dynamically fuses point clouds and action heatmaps according to task instructions, greatly improving robustness to distractors and performance in multi-goal scenarios. Extensive experiments demonstrate that our method achieves state-of-the-art performance with a 90.5% success rate on RLBench and 68.8% on the COLOSSEUM benchmark under challenging interference conditions. Our code and checkpoints are available at https://github.com/PuFanqi23/TGM-VLA.
comment: 8 pages, 7 figures
I-Perceive: A Foundation Model for Active Perception with Language Instructions
Active perception, the ability of a robot to proactively adjust its viewpoint to acquire task-relevant information, is essential for robust operation in unstructured real-world environments. While critical for downstream tasks such as manipulation, existing approaches have largely been confined to local settings (e.g., table-top scenes) with fixed perception objectives (e.g., occlusion reduction). Addressing active perception with open-ended intents in large-scale environments remains an open challenge. To bridge this gap, we propose I-Perceive, a foundation model for active perception conditioned on natural language instructions, designed for mobile manipulators and indoor environments. I-Perceive predicts camera views that follows open-ended language instructions, based on image-based scene contexts. By fusing a Vision-Language Model (VLM) backbone with a geometric foundation model, I-Perceive bridges semantic and geometric understanding, thus enabling effective reasoning for active perception. We train I-Perceive on a diverse dataset comprising real-world scene-scanning data and simulation data, both processed via an automated and scalable data generation pipeline. Experiments demonstrate that I-Perceive significantly outperforms state-of-the-art VLMs in both prediction accuracy and instruction following of generated camera views, and exhibits strong zero-shot generalization to novel scenes and tasks.
AI-IO: An Aerodynamics-Inspired Real-Time Inertial Odometry for Quadrotors ICRA 2026
Inertial Odometry (IO) has gained attention in quadrotor applications due to its sole reliance on inertial measurement units (IMUs), attributed to its lightweight design, low cost, and robust performance across diverse environments. However, most existing learning-based inertial odometry systems for quadrotors either use only IMU data or include additional dynamics-related inputs such as thrust, but still lack a principled formulation of the underlying physical model to be learned. This lack of interpretability hampers the model's ability to generalize and often limits its accuracy. In this work, we approach the inertial odometry learning problem from a different perspective. Inspired by the aerodynamics model and IMU measurement model, we identify the key physical quantity--rotor speed measurements required for inertial odometry and design a transformer-based inertial odometry. By incorporating rotor speed measurements, the proposed model improves velocity prediction accuracy by 36.9%. Furthermore, the transformer architecture more effectively exploits temporal dependencies for denoising and aerodynamics modeling, yielding an additional 22.4% accuracy gain over previous results. To support evaluation, we also provide a real-world quadrotor flight dataset capturing IMU measurements and rotor speed for high-speed motion. Finally, combined with an uncertainty-aware extended Kalman filter (EKF), our framework is validated across multiple datasets and real-time systems, demonstrating superior accuracy, generalization, and real-time performance. We share the code and data to promote further research (https://github.com/SJTU-ViSYS-team/AI-IO).
comment: 8 pages, 8 figures, 2026 IEEE International Conference on Robotics(ICRA 2026)
LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models
Vision-Language-Action (VLA) models achieve over 95% success on standard benchmarks. However, through systematic experiments, we find that current state-of-the-art VLA models largely ignore language instructions. Prior work lacks: (1) systematic semantic perturbation diagnostics, (2) a benchmark that forces language understanding by design, and (3) linguistically diverse training data. This paper constructs the LangGap benchmark, based on a four-dimensional semantic perturbation method -- varying instruction semantics while keeping the tabletop layout fixed -- revealing language understanding deficits in π0.5. Existing benchmarks like LIBERO assign only one task per layout, underutilizing available objects and target locations; LangGap fully diversifies pick-and-place tasks under identical layouts, forcing models to truly understand language. Experiments show that targeted data augmentation can partially close the language gap -- success rate improves from 0% to 90% with single-task training, and 0% to 28% with multi-task training. However, as semantic diversity of extended tasks increases, model learning capacity proves severely insufficient; even trained tasks perform poorly. This reveals a fundamental challenge for VLA models in understanding diverse language instructions -- precisely the long-term value of LangGap.
comment: 7 pages, 3 figures. Code and benchmark will be available at https://github.com/YC11Hou/langgap
Planning Method for Skill-Based Control of Robots Using a PLC as Skill Trigger ICME '25
Skill-based programming of robots provides a flexible approach for automation. Existing solutions neglect the optimization of motion sequences, leading to inefficiencies in execution. This work introduces a planning method that enhances skill-based robot programming by integrating motion sequence optimization. This optimization leads to a new MoveContinuousSkill. The software for executing the MoveContinuousSkill is implemented on a Programmable Logic Controller and applied across multiple robotic systems. Experimental results demonstrate a significant improvement in execution time through optimized motion sequence.
comment: 6 pages, 3 figures, 2 tables, submitted to the 19th CIRP Conference on Intelligent Computation in Manufacturing Engineering - CIRP ICME '25, 16-18 July 2025, Ischia (Naples), Italy, has been officially accepted for publication in Procedia CIRP, ISSN: 2212-8271, where the Elsevier's copyright policy applies, and is currently in print
Optimal-Horizon Social Robot Navigation in Heterogeneous Crowds
Navigating social robots in dense, dynamic crowds is challenging due to environmental uncertainty and complex human-robot interactions. While Model Predictive Control (MPC) offers strong real-time performance, its reliance on a fixed prediction horizon limits adaptability to changing environments and social dynamics. Furthermore, most MPC approaches treat pedestrians as homogeneous obstacles, ignoring social heterogeneity and cooperative or adversarial interactions, which often causes the Frozen Robot Problem in partially observable real-world environments. In this paper, we identify the planning horizon as a socially conditioned decision variable rather than a fixed design choice. Building on this insight, we propose an optimal-horizon social navigation framework that optimizes MPC foresight online according to inferred social context. A spatio-temporal Transformer infers pedestrian cooperation attributes from local trajectory observations, which serve as social priors for a reinforcement learning policy that optimally selects the prediction horizon under a task-driven objective. The resulting horizon-aware MPC incorporates socially conditioned safety constraints to balance navigation efficiency and interaction safety. Extensive simulations and real-world robot experiments demonstrate that optimal foresight selection is critical for robust social navigation in partially observable crowds. Compared to state-of-the-art baselines, the proposed approach achieves a 6.8\% improvement in success rate, reduces collisions by 50\%, and shortens navigation time by 19\%, with a low timeout rate of 0.8\%, validating the necessity of socially optimal planning horizons for efficient and safe robot navigation in crowded environments. Code and videos are available at Under Review.
comment: 7 pages, 5 figures
Zero-Shot Robotic Manipulation via 3D Gaussian Splatting-Enhanced Multimodal Retrieval-Augmented Generation
Existing end-to-end approaches of robotic manipulation often lack generalization to unseen objects or tasks due to limited data and poor interpretability. While recent Multimodal Large Language Models (MLLMs) demonstrate strong commonsense reasoning, they struggle with geometric and spatial understanding required for pose prediction. In this paper, we propose RobMRAG, a 3D Gaussian Splatting-Enhanced Multimodal Retrieval-Augmented Generation (MRAG) framework for zero-shot robotic manipulation. Specifically, we construct a multi-source manipulation knowledge base containing object contact frames, task completion frames, and pose parameters. During inference, a Hierarchical Multimodal Retrieval module first employs a three-priority hybrid retrieval strategy to find task-relevant object prototypes, then selects the geometrically closest reference example based on pixel-level similarity and Instance Matching Distance (IMD). We further introduce a 3D-Aware Pose Refinement module based on 3D Gaussian Splatting into the MRAG framework, which aligns the pose of the reference object to the target object in 3D space. The aligned results are reprojected onto the image plane and used as input to the MLLM to enhance the generation of the final pose parameters. Extensive experiments show that on a test set containing 30 categories of household objects, our method improves the success rate by 7.76% compared to the best-performing zero-shot baseline under the same setting, and by 6.54% compared to the state-of-the-art supervised baseline. Our results validate that RobMRAG effectively bridges the gap between high-level semantic reasoning and low-level geometric execution, enabling robotic systems that generalize to unseen objects while remaining inherently interpretable.
comment: 9 pages, 5 figures
Test-Driven Agentic Framework for Reliable Robot Controller
In this work, we present a test-driven, agentic framework for synthesizing a deployable low-level robot controller for navigation tasks. Given a 2D map with an image of an ultrasonic sensor-based robot, or a 3D robotic simulation environment, our framework iteratively refines the generated controller code using diagnostic feedback from structured test suites to achieve task success. We propose a dual-tier repair strategy to refine the generated code that alternates between prompt-level refinement and direct code editing. We evaluate the approach across 2D navigation tasks and 3D navigation in the Webots simulator. Experimental results show that test-driven synthesis substantially improves controller reliability and robustness over one-shot controller generation, especially when the initial prompt is underspecified. The source code and demonstration videos are available at: https://shivanshutripath.github.io/robotic_controller.github.io.
HydroShear: Hydroelastic Shear Simulation for Tactile Sim-to-Real Reinforcement Learning
In this paper, we address the problem of tactile sim-to-real policy transfer for contact-rich tasks. Existing methods primarily focus on vision-based sensors and emphasize image rendering quality while providing overly simplistic models of force and shear. Consequently, these models exhibit a large sim-to-real gap for many dexterous tasks. Here, we present HydroShear, a non-holonomic hydroelastic tactile simulator that advances the state-of-the-art by modeling: a) stick-slip transitions, b) path-dependent force and shear build up, and c) full SE(3) object-sensor interactions. HydroShear extends hydroelastic contact models using Signed Distance Functions (SDFs) to track the displacements of the on-surface points of an indenter during physical interaction with the sensor membrane. Our approach generates physics-based, computationally efficient force fields from arbitrary watertight geometries while remaining agnostic to the underlying physics engine. In experiments with GelSight Minis, HydroShear more faithfully reproduces real tactile shear compared to existing methods. This fidelity enables zero-shot sim-to-real transfer of reinforcement learning policies across four tasks: peg insertion, bin packing, book shelving for insertion, and drawer pulling for fine gripper control under slip. Our method achieves a 93% average success rate, outperforming policies trained on tactile images (34%) and alternative shear simulation methods (58%-61%).
comment: Project page: https://hydroshear.github.io
TMR-VLA:Vision-Language-Action Model for Magnetic Motion Control of Tri-leg Silicone-based Soft Robot ICRA 2025
In-vivo environments, magnetically actuated soft robots offer advantages such as wireless operation and precise control, showing promising potential for painless detection and therapeutic procedures. We developed a trileg magnetically driven soft robot (TMR) whose multi-legged design enables more flexible gaits and diverse motion patterns. For the silicone made of reconfigurable soft robots, its navigation ability can be separated into sequential motions, namely squatting, rotation, lifting a leg, walking and so on. Its motion and behavior depend on its bending shapes. To bridge motion type description and specific low-level voltage control, we introduced TMR-VLA, an end-to-end multi-modal system for a trileg magnetic soft robot capable of performing hybrid motion types, which is promising for developing a navigation ability by adapting its shape to language-constrained motion types. The TMR-VLA deploys embodied endoluminal localization ability from EndoVLA, and fuses sequential frames and natural language commands as input. Low-level voltage output is generated based on the current observation state and specific motion type description. The result shows the TMR-VLA can predict how the voltage applied to TMR will change the dynamics of a silicon-made soft robot. The TMR-VLA reached a 74% average success rate.
comment: ICRA 2025
Decentralized Multi-Robot Obstacle Detection and Tracking in a Maritime Scenario
Autonomous aerial-surface robot teams offer a scalable solution for maritime monitoring, but deployment remains difficult due to water-induced visual artifacts and bandwidth-limited coordination. This paper presents a decentralized multi-robot framework to detect and track floating containers using multiple UAVs cooperating with an autonomous surface vessel. Each UAV runs a YOLOv8 detector augmented with stereo disparity and maintains per-target EKF tracks with uncertainty-aware data association. Robots exchange compact track summaries that are fused conservatively using Covariance Intersection, preserving estimator consistency under unknown cross-correlations. An information-driven allocator assigns targets and selects UAV hover viewpoints by trading expected uncertainty reduction in travel effort and safety separation. Implemented in ROS, the proposed system is validated in simulations and compared with representative tracking and fusion baselines, showing improved identity continuity and localization accuracy with modest communication overhead.
comment: 8 pages, 10 figures
Developing Fundamental Diagrams for Urban Air Mobility Traffic Based on Physical Experiments
Urban Air Mobility (UAM) is an emerging application of unmanned aerial vehicles that promises to reduce travel time and alleviate congestion in urban transportation systems. As drone density increases, UAM traffic is expected to experience congestion similar to that in ground traffic. However, the fundamental characteristics of UAM traffic, particularly under real-world operating conditions, remain largely unexplored. This study proposes a general framework for constructing the fundamental diagram (FD) of UAM traffic by integrating theoretical analysis with physical experiments. To the best of our knowledge, this is the first study to derive UAM FDs using real-world physical experiment data. On the theoretical side, we design two drone control laws for collision avoidance and develop simulation-based traffic generation methods to produce diverse UAM traffic scenarios. Based on Edie's definition, traffic flow theory is then applied with a near-stationary traffic condition filtering method to construct the FD. To account for real-world disturbances and modeling uncertainties, we further conduct physical experiments on a reduced-scale testbed using Bitcraze Crazyflie drones. Both simulation and physical experiment trajectory data are collected and organized into the UAMTra2Flow dataset, which is analyzed using the proposed framework. Preliminary results indicate that classical FD structures for ground transportation, especially the Underwood model, are applicable to UAM systems. Notably, FD curves obtained from physical experiments exhibit deviations from simulation-based results, highlighting the importance of experimental validation. Finally, results from the reduced-scale testbed are scaled to realistic operating conditions to provide practical insights for future UAM traffic systems. The dataset and code for this paper are publicly available at https://github.com/CATS-Lab/UAM-FD.
Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning ICLR 2026
Scalable and realistic simulation of multi-agent traffic behavior is critical for advancing autonomous driving technologies. Although existing data-driven simulators have made significant strides in this domain, they predominantly rely on supervised learning to align simulated distributions with real-world driving scenarios. A persistent challenge, however, lies in the distributional shift that arises between training and testing, which often undermines model generalization in unseen environments. To address this limitation, we propose SMART-R1, a novel R1-style reinforcement fine-tuning paradigm tailored for next-token prediction models to better align agent behavior with human preferences and evaluation metrics. Our approach introduces a metric-oriented policy optimization algorithm to improve distribution alignment and an iterative "SFT-RFT-SFT" training strategy that alternates between Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) to maximize performance gains. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) validate the effectiveness of this simple yet powerful R1-style training framework in enhancing foundation models. The results on the Waymo Open Sim Agents Challenge (WOSAC) showcase that SMART-R1 achieves state-of-the-art performance with an overall realism meta score of 0.7858, ranking first on the leaderboard at the time of submission.
comment: Accepted by ICLR 2026
TwinRL-VLA: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation
Despite strong generalization capabilities, Vision-Language-Action (VLA) models remain constrained by the high cost of expert demonstrations and insufficient real-world interaction. While online reinforcement learning (RL) has shown promise in improving general foundation models, applying RL to VLA manipulation in real-world settings is still hindered by low exploration efficiency and a restricted exploration space. Through systematic real-world experiments, we observe that the effective exploration space of online RL is closely tied to the data distribution of supervised fine-tuning (SFT). Motivated by this observation, we propose TwinRL, a digital twin-real-world collaborative RL framework designed to scale and guide exploration for VLA models. First, a high-fidelity digital twin is efficiently reconstructed from smartphone-captured scenes, enabling realistic bidirectional transfer between real and simulated environments. During the SFT warm-up stage, we introduce an exploration space expansion strategy using digital twins to broaden the support of the data trajectory distribution. Building on this enhanced initialization, we propose a sim-to-real guided exploration strategy to further accelerate online RL. Specifically, TwinRL performs efficient and parallel online RL in the digital twin prior to deployment, effectively bridging the gap between offline and online training stages. Subsequently, we exploit efficient digital twin sampling to identify failure-prone yet informative configurations, which are used to guide targeted human-in-the-loop rollouts on the real robot. In our experiments, TwinRL approaches 100% success in both in-distribution regions covered by real-world demonstrations and out-of-distribution regions, delivering at least a 30% speedup over prior real-world RL methods and requiring only about 20 minutes on average across four tasks.
MorphArtGrasp: Morphology-Aware Cross-Embodiment Dexterous Hand Articulation Generation for Grasping
Dexterous grasping with multi-fingered hands remains challenging due to high-dimensional articulations and the cost of optimization-based pipelines. Existing end-to-end methods require training on large-scale datasets for specific hands, limiting their ability to generalize across different embodiments. We propose MorphArtGrasp, an eigengrasp-based, end-to-end framework for cross-embodiment grasp generation. From a hand's morphology description, we derive a morphology embedding and an eigengrasp set. Conditioned on these, together with the object point cloud and wrist pose, an amplitude predictor regresses articulation coefficients in a low-dimensional space, which are decoded into full joint articulations. Articulation learning is supervised with a Kinematic-Aware Articulation Loss (KAL) that emphasizes fingertip-relevant motions and injects morphology-specific structure. In simulation on unseen objects across three dexterous hands, MorphArtGrasp attains a 91.9% average grasp success rate with less than 0.4 seconds inference per grasp. With few-shot adaptation to an unseen hand, it achieves 85.6% success on unseen objects in simulation, and real-world experiments on this few-shot-generalized hand achieve an 87% success rate. The code and additional materials are available on our project website https://connor-zh.github.io/MorphArtGrasp.
Embodied intelligent industrial robotics: Framework and techniques
The combination of embodied intelligence and robots has great prospects and is becoming increasingly common. In order to work more efficiently, accurately, reliably, and safely in industrial scenarios, robots should have at least general knowledge, working-environment knowledge, and operating-object knowledge. These pose significant challenges to existing embodied intelligent robotics (EIR) techniques. Thus, this paper first briefly reviews the history of industrial robotics and analyzes the limitations of mainstream EIR frameworks. Then, a new knowledge-driven technical framework of embodied intelligent industrial robotics (EIIR) is proposed for various industrial environments. It has five modules: a world model, a high-level task planner, a low-level skill controller, a simulator, and a physical system. The development of techniques related to each module are also thoroughly reviewed, and recent progress regarding their adaption to industrial applications are discussed. A case study of real-world assembly system is given to demonstrate the newly proposed EIIR framework's applicability and potentiality. Finally, the key challenges that EIIR encounters in industrial scenarios are summarized and future research directions are suggested. The authors believe that EIIR technology is shaping the next generation of industrial robotics and EIIR-based industrial systems supply a new technological paradigm for intelligent manufacturing. It is expected that this review could serve as a valuable reference for scholars and engineers that are interested in industrial embodied intelligence. Together, scholars can use this research to drive their rapid advancement and application of EIIR techniques. The authors would continue to track and contribute new studies in the project page https://github.com/jackyzengl/EIIR
comment: 71 pages, 13 figures. The associated project can be found at https://github.com/jackyzengl/EIIR
Beyond Frame-wise Tracking: A Trajectory-based Paradigm for Efficient Point Cloud Tracking ICRA 2026
LiDAR-based 3D single object tracking (3D SOT) is a critical task in robotics and autonomous systems. Existing methods typically follow frame-wise motion estimation or a sequence-based paradigm. However, the two-frame methods are efficient but lack long-term temporal context, making them vulnerable in sparse or occluded scenes, while sequence-based methods that process multiple point clouds gain robustness at a significant computational cost. To resolve this dilemma, we propose a novel trajectory-based paradigm and its instantiation, TrajTrack. TrajTrack is a lightweight framework that enhances a base two-frame tracker by implicitly learning motion continuity from historical bounding box trajectories alone-without requiring additional, costly point cloud inputs. It first generates a fast, explicit motion proposal and then uses an implicit motion modeling module to predict the future trajectory, which in turn refines and corrects the initial proposal. Extensive experiments on the large-scale NuScenes benchmark show that TrajTrack achieves new state-of-the-art performance, dramatically improving tracking precision by 3.02% over a strong baseline while running at 55 FPS. Besides, we also demonstrate the strong generalizability of TrajTrack across different base trackers. Code is available at https://github.com/FiBonaCci225/TrajTrack.
comment: Acceptted in ICRA 2026
High-Performance Dual-Arm Task and Motion Planning for Tabletop Rearrangement ICRA 2026
We propose Synchronous Dual-Arm Rearrangement Planner (SDAR), a task and motion planning (TAMP) framework for tabletop rearrangement, where two robot arms equipped with 2-finger grippers must work together in close proximity to rearrange objects whose start and goal configurations are strongly entangled. To tackle such challenges, SDAR tightly knit together its dependency-driven task planner (SDAR-T) and synchronous dual-arm motion planner (SDAR-M), to intelligently sift through a large number of possible task and motion plans. Specifically, SDAR-T applies a simple yet effective strategy to decompose the global object dependency graph induced by the rearrangement task, to produce more optimal dual-arm task plans than solutions derived from optimal task plans for a single arm. Leveraging state-of-the-art GPU SIMD-based motion planning tools, SDAR-M employs a layered motion planning strategy to sift through many task plans for the best synchronous dual-arm motion plan while ensuring high levels of success rate. Comprehensive evaluation demonstrates that SDAR delivers a 100% success rate in solving complex, non-monotone, long-horizon tabletop rearrangement tasks with solution quality far exceeding the previous state-of-the-art. Experiments on two UR-5e arms further confirm SDAR directly and reliably transfers to robot hardware. Source code and supplementary materials are available at https://github.com/arc-l/dual-arm.
comment: ICRA 2026 Submission
Scalable Multi-Task Learning through Spiking Neural Networks with Adaptive Task-Switching Policy for Intelligent Autonomous Agents
Training resource-constrained autonomous agents on multiple tasks simultaneously is crucial for adapting to diverse real-world environments. Recent works employ reinforcement learning (RL) approach, but they still suffer from sub-optimal multi-task performance due to task interference. State-of-the-art works employ Spiking Neural Networks (SNNs) to improve RL-based multi-task learning and enable low-power/energy operations through network enhancements and spike-driven data stream processing. However, they rely on fixed task-switching intervals during its training, thus limiting its performance and scalability. To address this, we propose SwitchMT, a novel methodology that employs adaptive task-switching for effective, scalable, and simultaneous multi-task learning. SwitchMT employs the following key ideas: (1) leveraging a Deep Spiking Q-Network with active dendrites and dueling structure, that utilizes task-specific context signals to create specialized sub-networks; and (2) devising an adaptive task-switching policy that leverages both rewards and internal dynamics of the network parameters. Experimental results demonstrate that SwitchMT achieves competitive scores in multiple Atari games (i.e., Pong: -8.8, Breakout: 5.6, and Enduro: 355.2) and longer game episodes as compared to the state-of-the-art. These results also highlight the effectiveness of SwitchMT methodology in addressing task interference without increasing the network complexity, enabling intelligent autonomous agents with scalable multi-task learning capabilities.
comment: Accepted at the 63rd ACM/IEEE Design Automation Conference (DAC), July 26-29, 2026 in Long Beach, CA, USA
SLAP: Shortcut Learning for Abstract Planning ICLR
Long-horizon decision-making with sparse rewards and continuous states and actions remains a fundamental challenge in AI and robotics. Task and motion planning (TAMP) is a model-based framework that addresses this challenge by planning hierarchically with abstract actions (options). These options are manually defined, limiting the agent to behaviors that we as human engineers know how to program (pick, place, move). In this work, we propose Shortcut Learning for Abstract Planning (SLAP), a method that leverages existing TAMP options to automatically discover new ones. Our key idea is to use model-free reinforcement learning (RL) to learn shortcuts in the abstract planning graph induced by the existing options in TAMP. Without any additional assumptions or inputs, shortcut learning leads to shorter solutions than pure planning, and higher task success rates than flat and hierarchical RL. Qualitatively, SLAP discovers dynamic physical improvisations (e.g., slap, wiggle, wipe) that differ significantly from the manually-defined ones. In experiments in four simulated robotic environments, we show that SLAP solves and generalizes to a wide range of tasks, reducing overall plan lengths by over 50% and consistently outperforming planning and RL baselines.
comment: Published at the International Conference on Learning Representations (ICLR) 2026. Code available at https://github.com/isabelliu0/SLAP
Robust Differentiable Collision Detection for General Objects
Collision detection is a core component of robotics applications such as simulation, control, and planning. Traditional algorithms like GJK+EPA compute witness points (i.e., the closest or deepest-penetration pairs between two objects) but are inherently non-differentiable, preventing gradient flow and limiting gradient-based optimization in contact-rich tasks such as grasping and manipulation. Recent work introduced efficient first-order randomized smoothing to make witness points differentiable; however, their direction-based formulation is restricted to convex objects and lacks robustness for complex geometries. In this work, we propose a robust and efficient differentiable collision detection framework that supports both convex and concave objects across diverse scales and configurations. Our method introduces distance-based first-order randomized smoothing, adaptive sampling, and equivalent gradient transport for robust and informative gradient computation. Experiments on complex meshes from DexGraspNet and Objaverse show significant improvements over existing baselines. Finally, we demonstrate a direct application of our method for dexterous grasp synthesis to refine the grasp quality. The code is available at https://github.com/JYChen18/DiffCollision.
Design and Control of a Compact Series Elastic Actuator Module for Robots in MRI Scanners
Robotic assistance has broadened the capabilities of magnetic resonance imaging (MRI)-guided medical interventions, yet force-controlled actuators tailored for MRI environments remain limited. In this study, we present a novel MRI-compatible rotary series elastic actuator (SEA) module that employs velocity-sourced ultrasonic motors for force-controlled operation within MRI scanners. Unlike prior MRI-compatible SEA designs, our module uses a transmission force sensing SEA architecture, with four off-the-shelf compression springs placed between the gearbox and motor housings. To enable precise torque control, we develop a controller based on a disturbance observer, specifically designed for velocity-sourced motors. This controller improves torque regulation, even under varying external impedance, enhancing the actuator's suitability for MRI-guided medical interventions. Experimental validation confirms effective torque control in both 3 Tesla MRI and non-MRI settings, achieving a 5% settling time of 0.05 seconds and steady-state error within 2.5% of the actuator's maximum output torque. Notably, the controller maintains consistent performance across both low and high impedance conditions.
ExtremControl: Low-Latency Humanoid Teleoperation with Direct Extremity Control
Building a low-latency humanoid teleoperation system is essential for collecting diverse reactive and dynamic demonstrations. However, existing approaches rely on heavily pre-processed human-to-humanoid motion retargeting and position-only PD control, resulting in substantial latency that severely limits responsiveness and prevents tasks requiring rapid feedback and fast reactions. To address this problem, we propose ExtremControl, a low latency whole-body control framework that: (1) operates directly on SE(3) poses of selected rigid links, primarily humanoid extremities, to avoid full-body retargeting; (2) utilizes a Cartesian-space mapping to directly convert human motion to humanoid link targets; and (3) incorporates velocity feedforward control at low level to support highly responsive behavior under rapidly changing control interfaces. We further provide a unified theoretical formulation of ExtremControl and systematically validate its effectiveness through experiments in both simulation and real-world environments. Building on ExtremControl, we implement a low-latency humanoid teleoperation system that supports both optical motion capture and VR-based motion tracking, achieving end-to-end latency as low as 50ms and enabling highly responsive behaviors such as ping-pong ball balancing, juggling, and real-time return, thereby substantially surpassing the 200ms latency limit observed in prior work.
comment: Project website: https://extremcontrol.github.io/
Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition ICRA 2026
Deep learning methods for Visual Place Recognition (VPR) have advanced significantly, largely driven by large-scale datasets. However, most existing approaches are trained on a single dataset, which can introduce dataset-specific inductive biases and limit model generalization. While multi-dataset joint training offers a promising solution for developing universal VPR models, divergences among training datasets can saturate the limited information capacity in feature aggregation layers, leading to suboptimal performance. To address these challenges, we propose Query-based Adaptive Aggregation (QAA), a novel feature aggregation technique that leverages learned queries as reference codebooks to effectively enhance information capacity without significant computational or parameter complexity. We show that computing the Cross-query Similarity (CS) between query-level image features and reference codebooks provides a simple yet effective way to generate robust descriptors. Our results demonstrate that QAA outperforms state-of-the-art models, achieving balanced generalization across diverse datasets while maintaining peak performance comparable to dataset-specific models. Ablation studies further explore QAA's mechanisms and scalability. Visualizations reveal that the learned queries exhibit diverse attention patterns across datasets. Project page: http://xjh19971.github.io/QAA.
comment: 8 pages, 4 figures, accepted at ICRA 2026
Multiagent Systems
NERFIFY: A Multi-Agent Framework for Turning NeRF Papers into Code CVPR 2026
The proliferation of neural radiance field (NeRF) research requires significant efforts to reimplement papers before building upon them. We introduce NERFIFY, a multi-agent framework that reliably converts NeRF research papers into trainable Nerfstudio plugins, in contrast to generic paper-to-code methods and frontier models like GPT-5 that usually fail to produce runnable code. NERFIFY achieves domain-specific executability through six key innovations: (1) Context-free grammar (CFG): LLM synthesis is constrained by Nerfstudio formalized as a CFG, ensuring generated code satisfies architectural invariants. (2) Graph-of-Thought code synthesis: Specialized multi-file-agents generate repositories in topological dependency order, validating contracts and errors at each node. (3) Compositional citation recovery: Agents automatically retrieve and integrate components (samplers, encoders, proposal networks) from citation graphs of references. (4) Visual feedback: Artifacts are diagnosed through PSNR-minima ROI analysis, cross-view geometric validation, and VLM-guided patching to iteratively improve quality. (5) Knowledge enhancement: Beyond reproduction, methods can be improved with novel optimizations. (6) Benchmarking: An evaluation framework is designed for NeRF paper-to-code synthesis across 30 diverse papers. On papers without public implementations, NERFIFY achieves visual quality matching expert human code (+/-0.5 dB PSNR, +/-0.2 SSIM) while reducing implementation time from weeks to minutes. NERFIFY demonstrates that a domain-aware design enables code translation for complex vision papers, potentiating accelerated and democratized reproducible research. Code, data and implementations will be publicly released.
comment: Accepted to CVPR 2026. Project page: https://seemandhar.github.io/NERFIFY/
MO-MIX: Multi-Objective Multi-Agent Cooperative Decision-Making With Deep Reinforcement Learning
Deep reinforcement learning (RL) has been applied extensively to solve complex decision-making problems. In many real-world scenarios, tasks often have several conflicting objectives and may require multiple agents to cooperate, which are the multi-objective multi-agent decision-making problems. However, only few works have been conducted on this intersection. Existing approaches are limited to separate fields and can only handle multi-agent decision-making with a single objective, or multi-objective decision-making with a single agent. In this paper, we propose MO-MIX to solve the multi-objective multi-agent reinforcement learning (MOMARL) problem. Our approach is based on the centralized training with decentralized execution (CTDE) framework. A weight vector representing preference over the objectives is fed into the decentralized agent network as a condition for local action-value function estimation, while a mixing network with parallel architecture is used to estimate the joint action-value function. In addition, an exploration guide approach is applied to improve the uniformity of the final non-dominated solutions. Experiments demonstrate that the proposed method can effectively solve the multi-objective multi-agent cooperative decision-making problem and generate an approximation of the Pareto set. Our approach not only significantly outperforms the baseline method in all four kinds of evaluation metrics, but also requires less computational cost.
comment: 15 pages, 10 figures, published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
The Alignment Flywheel: A Governance-Centric Hybrid MAS for Architecture-Agnostic Safety
Multi-agent systems provide mature methodologies for role decomposition, coordination, and normative governance, capabilities that remain essential as increasingly powerful autonomous decision components are embedded within agent-based systems. While learned and generative models substantially expand system capability, their safety behavior is often entangled with training, making it opaque, difficult to audit, and costly to update after deployment. This paper formalizes the Alignment Flywheel as a governance-centric hybrid MAS architecture that decouples decision generation from safety governance. A Proposer, representing any autonomous decision component, generates candidate trajectories, while a Safety Oracle returns raw safety signals through a stable interface. An enforcement layer applies explicit risk policy at runtime, and a governance MAS supervises the Oracle through auditing, uncertainty-driven verification, and versioned refinement. The central engineering principle is patch locality: many newly observed safety failures can be mitigated by updating the governed oracle artifact and its release pipeline rather than retracting or retraining the underlying decision component. The architecture is implementation-agnostic with respect to both the Proposer and the Safety Oracle, and specifies the roles, artifacts, protocols, and release semantics needed for runtime gating, audit intake, signed patching, and staged rollout across distributed deployments. The result is a hybrid MAS engineering framework for integrating highly capable but fallible autonomous systems under explicit, version-controlled, and auditable oversight.
Developing Fundamental Diagrams for Urban Air Mobility Traffic Based on Physical Experiments
Urban Air Mobility (UAM) is an emerging application of unmanned aerial vehicles that promises to reduce travel time and alleviate congestion in urban transportation systems. As drone density increases, UAM traffic is expected to experience congestion similar to that in ground traffic. However, the fundamental characteristics of UAM traffic, particularly under real-world operating conditions, remain largely unexplored. This study proposes a general framework for constructing the fundamental diagram (FD) of UAM traffic by integrating theoretical analysis with physical experiments. To the best of our knowledge, this is the first study to derive UAM FDs using real-world physical experiment data. On the theoretical side, we design two drone control laws for collision avoidance and develop simulation-based traffic generation methods to produce diverse UAM traffic scenarios. Based on Edie's definition, traffic flow theory is then applied with a near-stationary traffic condition filtering method to construct the FD. To account for real-world disturbances and modeling uncertainties, we further conduct physical experiments on a reduced-scale testbed using Bitcraze Crazyflie drones. Both simulation and physical experiment trajectory data are collected and organized into the UAMTra2Flow dataset, which is analyzed using the proposed framework. Preliminary results indicate that classical FD structures for ground transportation, especially the Underwood model, are applicable to UAM systems. Notably, FD curves obtained from physical experiments exhibit deviations from simulation-based results, highlighting the importance of experimental validation. Finally, results from the reduced-scale testbed are scaled to realistic operating conditions to provide practical insights for future UAM traffic systems. The dataset and code for this paper are publicly available at https://github.com/CATS-Lab/UAM-FD.
InnoGym: Benchmarking the Innovation Potential of AI Agents ICLR 2026
LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.
comment: ICLR 2026
LightMem: Lightweight and Efficient Memory-Augmented Generation ICLR 2026
Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. On LongMemEval and LoCoMo, using GPT and Qwen backbones, LightMem consistently surpasses strong baselines, improving QA accuracy by up to 7.7% / 29.3%, reducing total token usage by up to 38x / 20.9x and API calls by up to 30x / 55.5x, while purely online test-time costs are even lower, achieving up to 106x / 117x token reduction and 159x / 310x fewer API calls. The code is available at https://github.com/zjunlp/LightMem.
comment: ICLR 2026
MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks
While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity - agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity - and (2) efficacy uncertainty - MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MASOrchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented subagents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and subagents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA, while achieving more than 10x efficiency over strong baselines. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.
comment: Preprint; Work in Progress
Systems and Control (EESS)
Quantitative Monitoring of Signal First-Order Logic
Runtime monitoring checks, during execution, whether a partial signal produced by a hybrid system satisfies its specification. Signal First-Order Logic (SFO) offers expressive real-time specifications over such signals, but currently comes only with Boolean semantics and has no tool support. We provide the first robustness-based quantitative semantics for SFO, enabling the expression and evaluation of rich real-time properties beyond the scope of existing formalisms such as Signal Temporal Logic. To enable online monitoring, we identify a past-time fragment of SFO and give a pastification procedure that transforms bounded-response SFO formulas into equisatisfiable formulas in this fragment. We then develop an efficient runtime monitoring algorithm for this past-time fragment and evaluate its performance on a set of benchmarks, demonstrating the practicality and effectiveness of our approach. To the best of our knowledge, this is the first publicly available prototype for online quantitative monitoring of full SFO.
comment: Full version of the FM 2026 paper
A Closed-loop Framework to Discriminate Models Using Optimal Control
Predicting the response of an observed system to a known input is a fruitful first step to accurately control the system's dynamics. Despite the recent advances in fully data-driven algorithms, the most interpretable way to reach this goal is through mechanistic mathematical modeling. Here, we leverage optimal control and propose a closed-loop iterative method to choose among a set of candidate models the one that most accurately predict an observed system. We assume that one has control over an input of the observed system and access to measurements of its response. Our approach is to identify the input control that maximally discriminates the response of the candidate models, allowing us to determine which model is best by comparing such responses with the observed data. We demonstrate our proposed framework in numerical simulations before applying it during an electrophysiology experiment, successfully discriminating between different models for photocurrents produced via opsin dynamics.
comment: 13 pages, 9 figures
Precision Switching Schedule for Efficient Control Implementations
Modern cyber-physical systems, such as automotive control, rely on feedback controllers that regulate the system towards desired a setpoint. In practice, however, the controller must also be scheduled efficiently on resource-constrained processors, where the choice of numerical precision for controller implementation directly affects both control quality and computational cost. This trade-off is critical: higher precision improves control performance but increases runtime, while lower precision executes faster in the processor but may degrade overall system performance. In this work, we propose the first approach for a precision switching schedule, where the controller switches between different floating-point precisions to balance control performance and enhance computational efficiency. We formulate this problem as a multi-objective optimization, expressed as a Mixed-Integer Quadratic Program (MIQP) with sound linearizations and error bounds that capture roundoff effects from different precision implementations. Our method efficiently computes a switching schedule that ensures the system output remains within a specified reference band. Through experimental evaluation on standard benchmark control systems, we demonstrate that switching between 32-bit and 16-bit floating-point implementations offers an average runtime reduction of 26.5% compared to 32-bit execution and a 27.6% improvement in control performance over 16-bit execution, while maintaining near-optimal overall performance.
Depth-adapted adaptive optics for three-photon microscopy
Three-photon (3-P) fluorescence microscopy enables deep in vivo imaging with subcellular resolution, but its performance is fundamentally constrained by the maximum permissible laser power required to avoid tissue heating and photodamage. Under these power-limited conditions, fluorescence signal generation, image contrast, and achievable imaging depth are strongly affected by the illumination beam profile and aberration correction strategy. In this paper, we showed that using a fixed illumination beam size was suboptimal across different imaging depths. We further showed that conventional Zernike-based adaptive optics (AO) correction degrades under reduced Gaussian illumination beam sizes due to loss of modal orthogonality. This degradation results in slow convergence, unintended focal and field-of-view shifts, and excessive wavefront deformations. To overcome these limitations, we introduced a depth-adapted AO framework in which both the illumination beam profile and the aberration correction basis were dynamically matched to the imaging conditions. By combining depth-optimised beam underfilling with a bespoke set of illumination-matched aberration modes, we achieved faster and more stable AO convergence, enhanced fluorescence signal and image quality during deep in vivo multi-channel neuroimaging. Together, these results established a practical and robust AO-enabled three-photon microscopy strategy that maximised imaging performance under realistic power constraints.
Time Stepped Cyber Physical Simulation of DoS, DoD, and FDI Attacks on the IEEE 14 Bus System
Reliable grid operation depends on accurate and timely telemetry, making modern power systems vulnerable to communication layer cyberattacks. This paper evaluates how Denial of Service (DoS), Denial of Data (DoD), and False Data Injection (FDI) attacks disrupt the IEEE 14 bus system using a MATLAB only, time stepped simulation framework built on MATPOWER. The framework emulates a 24 hour operating cycle with sinusoidal load variation, introduces attack specific manipulation of load and voltage data, and performs full AC power flow solves with reactive limit enforcement (PV PQ switching). At each timestep, the system logs true and measured voltages, generator P/Q output, system losses, and voltage limit violations to capture transient cyber physical effects. Results show that DoD causes the largest physical distortions and reactive power stress, DoS masks natural variability and degrades situational awareness, and FDI creates significant discrepancies between true and perceived voltages. The study provides a compact, reproducible benchmark for analyzing cyber induced instability and informing future defense strategies.
comment: Its been accepted to IEEE Southeastcon 2026
Enhanced Hydrogen Electrolyzer with Integrated Energy Storage to Provide Grid-Forming Services for Off-Grid ReP2H Application
This article proposes an energy storage-enhanced hydrogen electrolyzer (ESEHE) to provide grid-forming (GFM) services for off-grid renewable power to hydrogen (ReP2H) systems. Unlike conventional ReP2H systems that use a centralized energy storage (ES) plant, the proposed topology directly connects batteries to the DC buses of electrolysis rectifiers. A tailored virtual synchronous machine (VSM) control framework enables the electrolyzer to autonomously provide real and reactive power support. A coordinated frequency-splitting energy extraction strategy is designed to exploit both the battery and the electrolysis stack's electrical double-layer (EDL) effect on different timescales, maximizing active power support while mitigating battery and stack degradation. An adaptive equalization control strategy is further developed to balance the battery state of charge (SOC) among multiple ESEHEs operating in parallel, which optimizes energy distribution and extends battery life. Real-time simulations on StarSim validate the proposed topology and control strategies. Techno-economic analysis shows that, compared with conventional off-grid ReP2H systems based on a centralized ES plant, the ESEHE improves overall energy efficiency by 0.23% and reduces the initial total converter investment cost by roughly 6%, mainly due to the elimination of bidirectional AC/DC conversion and its associated losses in the centralized ES plant.
Integrated Guidance and Control for Path-Following with Bounded Inputs
Precise motion control of underactuated surface vessels is a crucial task in various maritime applications. In this work, we develop a nonlinear motion control strategy for surface vessels inspired by the pursuit guidance philosophy. Any sufficiently smooth path can be seen as a continuum of virtual targets moving along a specified path, which the pursuer is trying to catch. Contrary to the traditional path-following methods, this work develops an integrated guidance and control approach capable of following any smooth path (unlike the ones composed of a finite number of straight lines and circles). The approach relies on steering the vehicle such that its velocity vector aligns with the line-of-sight (the line joining the moving virtual target and the surface vessel), resulting in a tail-chase scenario. This leads to a path-following behavior. This integrated approach also overcomes the disadvantages inherent in the traditional two-loop-based approaches. Additionally, the proposed work takes into account the asymmetric actuator constraints in the design, which makes the design close to realistic scenarios. Furthermore, the control law has been derived within a nonlinear framework using sliding mode, and thus remains applicable for a wider envelope. The stability of the proposed control strategy is formally proven. Numerical simulations for various specified paths validate the controller's accurate path-following performance.
comment: 30 pages, 9 figures
Test-Driven Agentic Framework for Reliable Robot Controller
In this work, we present a test-driven, agentic framework for synthesizing a deployable low-level robot controller for navigation tasks. Given a 2D map with an image of an ultrasonic sensor-based robot, or a 3D robotic simulation environment, our framework iteratively refines the generated controller code using diagnostic feedback from structured test suites to achieve task success. We propose a dual-tier repair strategy to refine the generated code that alternates between prompt-level refinement and direct code editing. We evaluate the approach across 2D navigation tasks and 3D navigation in the Webots simulator. Experimental results show that test-driven synthesis substantially improves controller reliability and robustness over one-shot controller generation, especially when the initial prompt is underspecified. The source code and demonstration videos are available at: https://shivanshutripath.github.io/robotic_controller.github.io.
Curtail Renewables to Enhance Flexibility: A Regulated Forecast-based Dispatch Approach
This paper considers the flexibility degradation problem caused by excessive flexible ramping product (FRP) requirements with high variable energy resource (VER) penetration}. Based on the rolling-window co-optimization model of energy and FRP, theoretical analysis of this paper reveals a unit dispatch transfer effect, in which high FRP requirements under forecast-based dispatch (FBD) constrain real-time flexibility and distort economic efficiency. To alleviate this effect, a regulated forecast-based dispatch (RFBD) approach is proposed, which moderately caps VER outputs and enhances system flexibility. Simulation results demonstrate that the proposed approach effectively lowers FRP requirements and reduces operating cost compared with FBD.
TMR-VLA:Vision-Language-Action Model for Magnetic Motion Control of Tri-leg Silicone-based Soft Robot ICRA 2025
In-vivo environments, magnetically actuated soft robots offer advantages such as wireless operation and precise control, showing promising potential for painless detection and therapeutic procedures. We developed a trileg magnetically driven soft robot (TMR) whose multi-legged design enables more flexible gaits and diverse motion patterns. For the silicone made of reconfigurable soft robots, its navigation ability can be separated into sequential motions, namely squatting, rotation, lifting a leg, walking and so on. Its motion and behavior depend on its bending shapes. To bridge motion type description and specific low-level voltage control, we introduced TMR-VLA, an end-to-end multi-modal system for a trileg magnetic soft robot capable of performing hybrid motion types, which is promising for developing a navigation ability by adapting its shape to language-constrained motion types. The TMR-VLA deploys embodied endoluminal localization ability from EndoVLA, and fuses sequential frames and natural language commands as input. Low-level voltage output is generated based on the current observation state and specific motion type description. The result shows the TMR-VLA can predict how the voltage applied to TMR will change the dynamics of a silicon-made soft robot. The TMR-VLA reached a 74% average success rate.
comment: ICRA 2025
Grid Integration of AI Data Centers: A Critical Review of Energy Storage Solutions
Artificial intelligence (AI) is driving a rapid expansion of data centers (DCs). These facilities consume large amounts of electricity and introduce new challenges for power systems. AI workloads cause rapid power changes and high peak demand. These behaviors are different from traditional data centers (TDCs) and can affect grid stability and reliability. This paper reviews how energy storage systems (ESSs) can help integrate AI data DCs with the electric grid. We examine storage solutions at multiple levels, including grid-scale batteries, UPS systems, rack-level storage, and chip-level buffering. Each layer operates at a different time scale and serves a different purpose. Grid-interactive UPS (GiUPS) systems can respond quickly to disturbances and assist with frequency regulation or voltage ride through. Large battery energy storage systems (BESSs) can smooth power demand, support renewable on-site generation, and provide grid services. Rack-level and server-level storage help manage fast power fluctuations close to computing hardware. We also discuss other technologies such as fuel cells (FCs) and thermal energy storage (TE) that can support co-generation and reduce emissions. In addition, second-life battery energy storage (SLBESS) are reviewed as a lower-cost option for large installations whether supporting UPS battery or as a backup generation. The paper compares the benefits, challenges, and coordination requirements of these solutions. Overall, the study provides a structured view of how energy storage can improve reliability, flexibility, and sustainability when connecting future AI data centers to the power grid.
comment: 21 pages, 8 figures, 3 tables
Adaptive Channel Estimation and Hybrid Beamforming for RIS aided Vehicular Communication
Reconfigurable intelligent surface (RIS) constitutes a disruptive technology for enhancing vehicular communication performance through reconfigurable propagation environments. In this paper, we propose an adaptive channel estimation framework and hybrid beamforming optimization strategy for RIS-aided vehicular multiple-input multiple-output (MIMO) systems operating in high-mobility scenarios. To address severe Doppler effects and rapid channel variations, we design a velocity-aware pilot scheme that progressively estimates cascaded channels across two timescales, leveraging tensor decomposition and adaptive grouping of passive elements. This framework dynamically balances channel estimation accuracy and spectral efficiency, significantly reducing training overhead. Furthermore, we develop a low-complexity hybrid beamforming algorithm for both narrowband single vehicle user equipment (VUE) and broadband multi-VUE systems. For single-VUE scenarios, we derive closed-form active beamforming solutions and optimize passive beamforming via alternating optimization. For multi-VUE broadband systems, we jointly optimize subcarrier allocation, power distribution, and beamforming to maximize system throughput while mitigating inter-carrier interference (ICI) caused by Doppler spread, subject to quality-of-service (QoS) constraints and RIS hardware limitations. Our simulation results demonstrate that the proposed methods achieve substantial performance gains in channel estimation efficiency, beamforming robustness, and system throughput compared to conventional schemes, particularly under high mobility conditions.
Hereditary Geometric Meta-RL: Nonlocal Generalization via Task Symmetries
Meta-Reinforcement Learning (Meta-RL) commonly generalizes via smoothness in the task encoding. While this enables local generalization around each training task, it requires dense coverage of the task space and leaves richer task space structure untapped. In response, we develop a geometric perspective that endows the task space with a "hereditary geometry" induced by the inherent symmetries of the underlying system. Concretely, the agent reuses a policy learned at the train time by transforming states and actions through actions of a Lie group. This converts Meta-RL into symmetry discovery rather than smooth extrapolation, enabling the agent to generalize to wider regions of the task space. We show that when the task space is inherited from the symmetries of the underlying system, the task space embeds into a subgroup of those symmetries whose actions are linearizable, connected, and compact--properties that enable efficient learning and inference at the test time. To learn these structures, we develop a differential symmetry discovery method. This collapses functional invariance constraints and thereby improves numerical stability and sample efficiency over functional approaches. Empirically, on a two-dimensional navigation task, our method efficiently recovers the ground-truth symmetry and generalizes across the entire task space, while a common baseline generalizes only near training tasks.
comment: Accepted to 2026 American Control Conference
Uncertainty-Aware Grid Planning in the Real World: A Method Enabling Large-Scale, Two-Stage Adaptive Robust Optimization for Capacity Expansion Planning
Capacity expansion models are frequently used to inform multi-billion dollar grid infrastructure decisions, a context in which there is significant uncertainty surrounding the future need for and performance of such infrastructure. However, despite much academic literature on the topic, virtually no grid planning processes use capacity expansion models that endogenously consider uncertainty, an oversight which frequently leads to short-sighted infrastructure decisions. This is partially due to a technology transfer gap, but it is also due to a lack of methods that work at large scale. In this paper we introduce a method for endogenizing uncertainty into capacity expansion models, a variant of adaptive robust optimization, that addresses this gap. We apply the method to a real-world capacity expansion planning problem, that of the State of California, and compare its performance to that of traditional adaptive robust optimization. We find that both the traditional method and our method identify increased transmission investment as a key lever for increasing robustness and adaptability, while helping to avoid downside risks that current deterministic planning processes may be exposing ratepayers to. Our method performs similarly to the traditional method in terms of outcomes, while significantly reducing computational complexity, making it scalable to real-world planning problems.
comment: Preprint submitted to INFORMS Journal on Optimization
Behavioral Generative Agents for Energy Operations
Problem definition: Accurately modeling consumer behavior in energy operations is challenging due to uncertainty, behavioral heterogeneity, and limited empirical data-particularly in low-frequency, high-impact events. While generative AI trained on large-scale human data offers new opportunities to study decision behavior, its role in operational applications remains unclear. We examine how generative agents can support customer behavior discovery in energy operations, complementing rather than replacing human-based experiments. Methodology/results: We introduce a novel approach leveraging generative agents-artificial agents powered by large language models-to simulate sequential customer decisions under dynamic electricity prices and outage risks. We find that these agents behave more optimally and rationally in simpler market scenarios, while their performance becomes more variable and suboptimal as task complexity rises. Furthermore, the agents exhibit heterogeneous customer preferences, consistently maintaining distinct, persona-driven reasoning patterns in both operational decisions and textual reasoning. Comparisons with dynamic programming and greedy policy benchmarks show alignment between specific personas and distinct heuristic decision policies. In low-frequency, high-impact events such as blackouts, agents prioritize energy reliability over cost or profit, demonstrating their ability to uncover behavioral patterns beyond the rigidity of traditional mathematical models. Managerial Implications: Our findings suggest that behavioral generative agents can serve as scalable and flexible tools for studying consumer behavior in energy operations. By enabling controlled experiments across heterogeneous customer types and rare events, these agents can enhance the design of energy management systems and support more informed analysis of energy policies and incentive programs.
Release Date Optimization in MRP Using Clearing Functions
This paper integrates a clearing function (CF)-based release planning approach into Material Requirements Planning (MRP) to address its limitations in modeling capacity constraints and dynamic lead times. The proposed optimization model replaces MRP's backward scheduling step while preserving its overall structure. Performance is evaluated through simulation experiments on two flow shop systems that explore a range of demand uncertainties and utilization levels. Computational results show that the proposed approach is capable of yielding significant improvements over the conventional backward scheduling approach, due to its ability to compute planned lead times for individual production orders as opposed to BOM items.
Traffic-Aware Grid Planning for Dynamic Wireless Electric Vehicle Charging
Dynamic Wireless Electric Vehicle Charging (DWC) on electrified roadways is an emerging technology that can significantly reduce battery sizes, eliminate charging downtime, and alleviate range anxiety, specially for long-haul transportation and fleet operations of electric vehicles (EVs). However, these systems introduce new challenges for power system planning due to their short-duration and high-power demands which can strain the grid if not properly managed. As the energy demands from DWC depend on vehicle speed, density, dwell time in charging zones, and load profiles along road segments, there is a need for integrated planning of such systems, jointly considering both traffic behavior and EV energy consumption. In this paper, we propose a traffic-aware grid planning framework for DWC. We leverage a macroscopic Cell Transmission Model of traffic flow to estimate real-time, spatiotemporal EV charging demand from DWC corridors. The demand model is then integrated into an AC Optimal Power Flow based formulation to optimally size a microgrid that supports DWC under varying traffic conditions while minimizing the cost of operation. Our framework explicitly models how spatiotemporal traffic patterns affect the utilization of grid resources to obtain system designs that achieve lower costs and are easier to operationalize as compared to planning models that rely on worst-case traffic data. We demonstrate the framework on data from a 14-mile segment of the I-210W highway in California, USA, evaluating multiple traffic scenarios like free-flow, severe congestion, accidents of varying severity, and natural disasters like forest fires. Our results demonstrate that traffic-aware grid planning significantly reduces infrastructure costs as compared to worst-scenario based modeling, while ensuring reliability of service in terms of meeting charging demands under diverse traffic conditions.
comment: There are certain major changes to the formulation expressed in the paper, which needs to be properly addressed before I can resubmit it. Will share the updated version as soon as those changes are done
Robotics
SafeGen-LLM: Enhancing Safety Generalization in Task Planning for Robotic Systems
Safety-critical task planning in robotic systems remains challenging: classical planners suffer from poor scalability, Reinforcement Learning (RL)-based methods generalize poorly, and base Large Language Models (LLMs) cannot guarantee safety. To address this gap, we propose safety-generalizable large language models, named SafeGen-LLM. SafeGen-LLM can not only enhance the safety satisfaction of task plans but also generalize well to novel safety properties in various domains. We first construct a multi-domain Planning Domain Definition Language 3 (PDDL3) benchmark with explicit safety constraints. Then, we introduce a two-stage post-training framework: Supervised Fine-Tuning (SFT) on a constraint-compliant planning dataset to learn planning syntax and semantics, and Group Relative Policy Optimization (GRPO) guided by fine-grained reward machines derived from formal verification to enforce safety alignment and by curriculum learning to better handle complex tasks. Extensive experiments show that SafeGen-LLM achieves strong safety generalization and outperforms frontier proprietary baselines across multi-domain planning tasks and multiple input formats (e.g., PDDLs and natural language).
comment: 12 pages, 6 figures
Evaluating Accuracy of Vine Robot Shape Sensing with Distributed Inertial Measurement Units
Soft, tip-extending vine robots are well suited for navigating tight, debris-filled environments, making them ideal for urban search and rescue. Sensing the full shape of a vine robot's body is helpful both for localizing information from other sensors placed along the robot body and for determining the robot's configuration within the space being explored. Prior approaches have localized vine robot tips using a single inertial measurement unit (IMU) combined with force sensing or length estimation, while one method demonstrated full-body shape sensing using distributed IMUs on a passively steered robot in controlled maze environments. However, the accuracy of distributed IMU-based shape sensing under active steering, varying robot lengths, and different sensor spacings has not been systematically quantified. In this work, we experimentally evaluate the accuracy of vine robot shape sensing using distributed IMUs along the robot body. We quantify IMU drift, measuring an average orientation drift rate of 1.33 degrees/min across 15 sensors. For passive steering, mean tip position error was 11% of robot length. For active steering, mean tip position error increased to 16%. During growth experiments across lengths from 30-175 cm, mean tip error was 8%, with a positive trend with increasing length. We also analyze the influence of sensor spacing and observe that intermediate spacings can minimize error for single-curvature shapes. These results demonstrate the feasibility of distributed IMU-based shape sensing for vine robots while highlighting key limitations and opportunities for improved modeling and algorithmic integration for field deployment.
How IMU Drift Influences Multi-Radar Inertial Odometry for Ground Robots in Subterranean Terrains ICRA
Reliable radar inertial odometry (RIO) requires mitigating IMU bias drift, a challenge that intensifies in subterranean environments due to extreme temperatures and gravity-induced accelerations. Cost-effective IMUs such as the Pixhawk, when paired with FMCW TI IWR6843AOP EVM radars, suffer from drift-induced degradation compounded by sparse, noisy, and flickering radar returns, making fusion less stable than LiDAR-based odometry. Yet, LiDAR fails under smoke, dust, and aerosols, whereas FMCW radars remain compact, lightweight, cost-effective, and robust in these situations. To address these challenges, we propose a two-stage MRIO framework that combines an IMU bias estimator for resilient localization and mapping in GPS-denied subterranean environments affected by smoke. Radar-based ego-velocity estimation is formulated through a least-squares approach and incorporated into an EKF for online IMU bias correction; the corrected IMU accelerations are fused with heterogeneous measurements from multiple radars and an IMU to refine odometry. The proposed framework further supports radar-only mapping by exploiting the robot's estimated translational and rotational displacements. In subterranean field trials, MRIO delivers robust localization and mapping, outperforming EKF-RIO. It maintains accuracy across cost-efficient FMCW radar setups and different IMUs, showing resilience with Pixhawk and higher-grade units such as VectorNav. The implementation will be provided as an open-source resource to the community (code available at https://github.com/LTU-RAI/MRIO
comment: Accepted in IEEE International Conference on Robotics and Automation (ICRA), 2026
Humanoid Robots as First Assistants in Endoscopic Surgery
Humanoid robots have become a focal point of technological ambition, with claims of surgical capability within years in mainstream discourse. These projections are aspirational yet lack empirical grounding. To date, no humanoid has assisted a surgeon through an actual procedure, let alone performed one. The work described here breaks this new ground. Here we report a proof of concept in which a teleoperated Unitree G1 provided endoscopic visualization while an attending otolaryngologist performed a cadaveric sphenoidectomy. The procedure was completed successfully, with stable visualization maintained throughout. Teleoperation allowed assessment of whether the humanoid form factor could meet the physical demands of surgical assistance in terms of sustenance and precision; the cognitive demands were satisfied -- for now -- by the operator. Post-procedure analysis identified engineering targets for clinical translation, alongside near-term opportunities such as autonomous diagnostic scoping. This work establishes form-factor feasibility for humanoid surgical assistance while identifying challenges for continued development.
Robust Skills, Brittle Grounding: Diagnosing Restricted Generalization in Vision-Language Action Policies via Multi-Object Picking
Vision-language action (VLA) policies often report strong manipulation benchmark performance with relatively few demonstrations, but it remains unclear whether this reflects robust language-to-object grounding or reliance on object--location correlations that do not transfer beyond the training distribution. We present a controlled multi-object picking study that progressively increases object placement variability up to full workspace randomization and evaluates held-out object--location pairings that break familiar associations without increasing spatial difficulty. Across these stress tests and data scaling, we find that for representative VLA policies, including SmolVLA and $π_{0.5}$, execution of the manipulation primitive remains substantially more reliable than instruction-conditioned task success in harder regimes, suggesting that manipulation skill acquisition is decoupled from instruction following. We recommend augmenting manipulation benchmarks with task ladders and decomposed metrics that separately measure primitive execution and instruction-conditioned success to better diagnose instruction-grounded generalization.
Planning from Observation and Interaction
Observational learning requires an agent to learn to perform a task by referencing only observations of the performed task. This work investigates the equivalent setting in real-world robot learning where access to hand-designed rewards and demonstrator actions are not assumed. To address this data-constrained setting, this work presents a planning-based Inverse Reinforcement Learning (IRL) algorithm for world modeling from observation and interaction alone. Experiments conducted entirely in the real-world demonstrate that this paradigm is effective for learning image-based manipulation tasks from scratch in under an hour, without assuming prior knowledge, pre-training, or data of any kind beyond task observations. Moreover, this work demonstrates that the learned world model representation is capable of online transfer learning in the real-world from scratch. In comparison to existing approaches, including IRL, RL, and Behavior Cloning (BC), which have more restrictive assumptions, the proposed approach demonstrates significantly greater sample efficiency and success rates, enabling a practical path forward for online world modeling and planning from observation and interaction. Videos and more at: https://uwrobotlearning.github.io/mpail2/.
Geometry-based pneumatic actuators for soft robotics
Soft pneumatic actuators enable safe human-machine interaction with lightweight and powerful applied parts. On the other side, they suffer design limitations as regards complex actuation patterns, including minimum bending radii, multi-states capabilities and structural stability. We present geometry-based pneumatic actuators (GPAs), a design and implementation approach that introduces constraint layers with configurable CNC heat-sealed chambers. The approach achieves predictable deformation, near-zero bending radii, multi-states actuation, and enables customizable and repeatable complex actuated geometries. Mathematical modeling reveals predictable linear angle transformations and validates nonlinear torque-angle relationships across diverse configurations. We demonstrate versatility of the GPAs approach through three applications: a 49 g wrist exoskeleton reducing muscle activity by up to 51%, a 30.8 g haptic interface delivering 8 N force feedback with fast response, and a 208 g bipedal robot achieving multi-gait locomotion. GPAs establish a configurable platform for next-generation wearable robotics, haptic systems, and soft locomotion devices.
Curriculum Reinforcement Learning for Quadrotor Racing with Random Obstacles
Autonomous drone racing has attracted increasing interest as a research topic for exploring the limits of agile flight. However, existing studies primarily focus on obstacle-free racetracks, while the perception and dynamic challenges introduced by obstacles remain underexplored, often resulting in low success rates and limited robustness in real-world flight. To this end, we propose a novel vision-based curriculum reinforcement learning framework for training a robust controller capable of addressing unseen obstacles in drone racing. We combine multi-stage cu rriculum learning, domain randomization, and a multi-scene updating strategy to address the conflicting challenges of obstacle avoidance and gate traversal. Our end-to-end control policy is implemented as a single network, allowing high-speed flight of quadrotors in environments with variable obstacles. Both hardware-in-the-loop and real-world experiments demonstrate that our method achieves faster lap times and higher success rates than existing approaches, effectively advancing drone racing in obstacle-rich environments. The video and code are available at: https://github.com/SJTU-ViSYS-team/CRL-Drone-Racing.
Autonomous Inspection of Power Line Insulators with UAV on an Unmapped Transmission Tower
This paper introduces an online inspection algorithm that enables an autonomous UAV to fly around a transmission tower and obtain detailed inspection images without a prior map of the tower. Our algorithm relies on camera-LiDAR sensor fusion for online detection and localization of insulators. In particular, the algorithm is based on insulator detection using a convolutional neural network, projection of LiDAR points onto the image, and filtering them using the bounding boxes. The detection pipeline is coupled with several proposed insulator localization methods based on DBSCAN, RANSAC, and PCA algorithms. The performance of the proposed online inspection algorithm and camera-LiDAR sensor fusion pipeline is demonstrated through simulation and real-world flights. In simulation, we showed that our single-flight inspection strategy can save up to 24 % of total inspection time, compared to the two-flight strategy of scanning the tower and afterwards visiting the inspection waypoints in the optimal way. In a real-world experiment, the best performing proposed method achieves a mean horizontal and vertical localization error for the insulator of 0.16 +- 0.08 m and 0.16 +- 0.11 m, respectively. Compared to the most relevant approach, the proposed method achieves more than an order of magnitude lower variance in horizontal insulator localization error.
comment: 8 pages, 9 figues
Learning Robust Control Policies for Inverted Pose on Miniature Blimp Robots
The ability to achieve and maintain inverted poses is essential for unlocking the full agility of miniature blimp robots (MBRs). However, developing reliable control methods for MBRs remains challenging due to their complex and underactuated dynamics. To address this challenge, we propose a novel framework that enables robust control policy learning for inverted pose on MBRs. The proposed framework operates through three core stages: First, a high-fidelity three-dimensional (3D) simulation environment was constructed, which was calibrated against real-world MBR motion data to ensure accurate replication of inverted-state dynamics. Second, a robust policy for MBR inverted control was trained within the simulation environment via a domain randomization strategy and a modified Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. Third, a mapping layer was designed to bridge the sim-to-real gap for the learned policy deployment. Comprehensive evaluations in the simulation environment demonstrate that the learned policy achieves a higher success rate compared to the energy-shaping controller. Furthermore, experimental results confirm that the learned policy with a mapping layer enables an MBR to achieve and maintain a fully upside-down pose in real-world settings.
Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos
Vision-Language Navigation (VLN) agents often struggle with long-horizon reasoning in unseen environments, particularly when facing ambiguous, coarse-grained instructions. While recent advances use knowledge graph to enhance reasoning, the potential of multimodal event knowledge inspired by human episodic memory remains underexplored. In this work, we propose an event-centric knowledge enhancement strategy for automated process knowledge mining and feature fusion to solve coarse-grained instruction and long-horizon reasoning in VLN task. First, we construct YE-KG, the first large-scale multimodal spatiotemporal knowledge graph, with over 86k nodes and 83k edges, derived from real-world indoor videos. By leveraging multimodal large language models (i.e., LLaVa, GPT4), we extract unstructured video streams into structured semantic-action-effect events to serve as explicit episodic memory. Second, we introduce STE-VLN, which integrates the above graph into VLN models via a Coarse-to-Fine Hierarchical Retrieval mechanism. This allows agents to retrieve causal event sequences and dynamically fuse them with egocentric visual observations. Experiments on REVERIE, R2R, and R2R-CE benchmarks demonstrate the efficiency of our event-centric strategy, outperforming state-of-the-art approaches across diverse action spaces. Our data and code are available on the project website https://sites.google.com/view/y-event-kg/.
Learning to Build: Autonomous Robotic Assembly of Stable Structures Without Predefined Plans
This paper presents a novel autonomous robotic assembly framework for constructing stable structures without relying on predefined architectural blueprints. Instead of following fixed plans, construction tasks are defined through targets and obstacles, allowing the system to adapt more flexibly to environmental uncertainty and variations during the building process. A reinforcement learning (RL) policy, trained using deep Q-learning with successor features, serves as the decision-making component. As a proof of concept, we evaluate the approach on a benchmark of 15 2D robotic assembly tasks of discrete block construction. Experiments using a real-world closed-loop robotic setup demonstrate the feasibility of the method and its ability to handle construction noise. The results suggest that our framework offers a promising direction for more adaptable and robust robotic construction in real-world environments.
Teleoperated Omni-directional Dual Arm Mobile Manipulation Robotic System with Shared Control for Retail Store
The swiftly expanding retail sector is increasingly adopting autonomous mobile robots empowered by artificial intelligence and machine learning algorithms to gain an edge in the competitive market. However, these autonomous robots encounter challenges in adapting to the dynamic nature of retail products, often struggling to operate autonomously in novel situations. In this study, we introduce an omni-directional dual-arm mobile robot specifically tailored for use in retail environments. Additionally, we propose a tele-operation method that enables shared control between the robot and a human operator. This approach utilizes a Virtual Reality (VR) motion capture system to capture the operator's commands, which are then transmitted to the robot located remotely in a retail setting. Furthermore, the robot is equipped with heterogeneous grippers on both manipulators, facilitating the handling of a wide range of items. We validate the efficacy of the proposed system through testing in a mockup of retail environment, demonstrating its ability to manipulate various commonly encountered retail items using both single and dual-arm coordinated manipulation techniques.
comment: This work has been accepted for publication in the Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC 2024). $©$ IEEE. The final version is available via IEEE Xplore
ABPolicy: Asynchronous B-Spline Flow Policy for Real-Time and Smooth Robotic Manipulation
Robotic manipulation requires policies that are smooth and responsive to evolving observations. However, synchronous inference in the raw action space introduces several challenges, including intra-chunk jitter, inter-chunk discontinuities, and stop-and-go execution. These issues undermine a policy's smoothness and its responsiveness to environmental changes. We propose ABPolicy, an asynchronous flow-matching policy that operates in a B-spline control-point action space. First, the B-spline representation ensures intra-chunk smoothness. Second, we introduce bidirectional action prediction coupled with refitting optimization to enforce inter-chunk continuity. Finally, by leveraging asynchronous inference, ABPolicy delivers real-time, continuous updates. We evaluate ABPolicy across seven tasks encompassing both static settings and dynamic settings with moving objects. Empirical results indicate that ABPolicy reduces trajectory jerk, leading to smoother motion and improved performance. Project website: https://teee000.github.io/ABPolicy/.
TSC: Topology-Conditioned Stackelberg Coordination for Multi-Agent Reinforcement Learning in Interactive Driving
Safe and efficient autonomous driving in dense traffic is fundamentally a decentralized multi-agent coordination problem, where interactions at conflict points such as merging and weaving must be resolved reliably under partial observability. With only local and incomplete cues, interaction patterns can change rapidly, often causing unstable behaviors such as oscillatory yielding or unsafe commitments. Existing multi-agent reinforcement learning (MARL) approaches either adopt synchronous decision-making, which exacerbate non-stationarity, or depend on centralized sequencing mechanisms that scale poorly as traffic density increases. To address these limitations, we propose Topology-conditioned Stackelberg Coordination (TSC), a learning framework for decentralized interactive driving under communication-free execution, which extracts a time-varying directed priority graph from braid-inspired weaving relations between trajectories, thereby defining local leader-follower dependencies without constructing a global order of play. Conditioned on this graph, TSC endogenously factorizes dense interactions into graph-local Stackelberg subgames and, under centralized training and decentralized execution (CTDE), learns a sequential coordination policy that anticipates leaders via action prediction and trains followers through action-conditioned value learning to approximate local best responses, improving training stability and safety in dense traffic. Experiments across four dense traffic scenarios show that TSC achieves superior performance over representative MARL baselines across key metrics, most notably reducing collisions while maintaining competitive traffic efficiency and control smoothness.
comment: 12 pages, 8 figures
AoE: Always-on Egocentric Human Video Collection for Embodied AI
Embodied foundation models require large-scale, high-quality real-world interaction data for pre-training and scaling. However, existing data collection methods suffer from high infrastructure costs, complex hardware dependencies, and limited interaction scope, making scalable expansion challenging. In fact, humans themselves are ideal physically embodied agents. Therefore, obtaining egocentric real-world interaction data from globally distributed "human agents" offers advantages of low cost and sustainability. To this end, we propose the Always-on Egocentric (AoE) data collection system, which aims to simplify hardware dependencies by leveraging humans themselves and their smartphones, enabling low-cost, highly efficient, and scene-agnostic real-world interaction data collection to address the challenge of data scarcity. Specifically, we first employ an ergonomic neck-mounted smartphone holder to enable low-barrier, large-scale egocentric data collection through a cloud-edge collaborative architecture. Second, we develop a cross-platform mobile APP that leverages on-device compute for real-time processing, while the cloud hosts automated labeling and filtering pipelines that transform raw videos into high-quality training data. Finally, the AoE system supports distributed Ego video data collection by anyone, anytime, and anywhere. We evaluate AoE on data preprocessing quality and downstream tasks, demonstrating that high-quality egocentric data significantly boosts real-world generalization.
Altitude-Aware Visual Place Recognition in Top-Down View
To address the challenge of aerial visual place recognition (VPR) problem under significant altitude variations, this study proposes an altitude-adaptive VPR approach that integrates ground feature density analysis with image classification techniques. The proposed method estimates airborne platforms' relative altitude by analyzing the density of ground features in images, then applies relative altitude-based cropping to generate canonical query images, which are subsequently used in a classification-based VPR strategy for localization. Extensive experiments across diverse terrains and altitude conditions demonstrate that the proposed approach achieves high accuracy and robustness in both altitude estimation and VPR under significant altitude changes. Compared to conventional methods relying on barometric altimeters or Time-of-Flight (ToF) sensors, this solution requires no additional hardware and offers a plug-and-play solution for downstream applications, {making it suitable for small- and medium-sized airborne platforms operating in diverse environments, including rural and urban areas.} Under significant altitude variations, incorporating our relative altitude estimation module into the VPR retrieval pipeline boosts average R@1 and R@5 by 29.85\% and 60.20\%, respectively, compared with applying VPR retrieval alone. Furthermore, compared to traditional {Monocular Metric Depth Estimation (MMDE) methods}, the proposed method reduces the mean error by 202.1 m, yielding average additional improvements of 31.4\% in R@1 and 44\% in R@5. These results demonstrate that our method establishes a robust, vision-only framework for three-dimensional visual place recognition, offering a practical and scalable solution for accurate airborne platforms localization under large altitude variations and limited sensor availability.
Hybrid Offline-Online Reinforcement Learning for Sensorless, High-Precision Force Regulation in Surgical Robotic Grasping
Precise grasp force regulation in tendon-driven surgical instruments is fundamentally limited by nonlinear coupling between motor dynamics, transmission compliance, friction, and distal mechanics. Existing solutions typically rely on distal force sensing or analytical compensation, increasing hardware complexity or degrading performance under dynamic motion. We present a sensorless control framework that combines physics-consistent modeling and hybrid reinforcement learning to achieve high-precision distal force regulation in a proximally actuated surgical end-effector. We develop a first-principles digital twin of the da Vinci Xi grasping mechanism that captures coupled electrical, transmission, and jaw dynamics within a unified differential-algebraic formulation. To safely learn control policies in this stiff and highly nonlinear system, we introduce a three-stage pipeline:(i)a receding-horizon CMA-ES oracle that generates dynamically feasible expert trajectories,(ii)fully offline policy learning via Implicit Q-Learning to ensure stable initialization without unsafe exploration, and (iii)online refinement using TD3 for adaptation to on-policy dynamics. The resulting policy directly maps proximal measurements to motor voltages and requires no distal sensing. In simulation, the controller maintains grasp force within 1% of the desired reference during multi-harmonic jaw motion. Hardware experiments demonstrate average force errors below 4% across diverse trajectories, validating sim-to-real transfer. The learned policy contains approximately 71k param and executes at kH rates, enabling real-time deployment. These results demonstrate that high-fidelity modeling combined with structured offline-online RL can recover precise distal force behavior without additional sensing, offering a scalable and mechanically compatible solution for surgical robotic manipulation.
OmniXtreme: Breaking the Generality Barrier in High-Dynamic Humanoid Control
High-fidelity motion tracking serves as the ultimate litmus test for generalizable, human-level motor skills. However, current policies often hit a "generality barrier": as motion libraries scale in diversity, tracking fidelity inevitably collapses - especially for real-world deployment of high-dynamic motions. We identify this failure as the result of two compounding factors: the learning bottleneck in scaling multi-motion optimization and the physical executability constraints that arise in real-world actuation. To overcome these challenges, we introduce OmniXtreme, a scalable framework that decouples general motor skill learning from sim-to-real physical skill refinement. Our approach uses a flow-matching policy with high-capacity architectures to scale representation capacity without interference-intensive multi-motion RL optimization, followed by an actuation-aware refinement phase that ensures robust performance on physical hardware. Extensive experiments demonstrate that OmniXtreme maintains high-fidelity tracking across diverse, high-difficulty datasets. On real robots, the unified policy successfully executes multiple extreme motions, effectively breaking the long-standing fidelity-scalability trade-off in high-dynamic humanoid control.
OmniTrack: General Motion Tracking via Physics-Consistent Reference
Learning motion tracking from rich human motion data is a foundational task for achieving general control in humanoid robots, enabling them to perform diverse behaviors. However, discrepancies in morphology and dynamics between humans and robots, combined with data noise, introduce physically infeasible artifacts in reference motions, such as floating and penetration. During both training and execution, these artifacts create a conflict between following inaccurate reference motions and maintaining the robot's stability, hindering the development of a generalizable motion tracking policy. To address these challenges, we introduce OmniTrack, a general tracking framework that explicitly decouples physical feasibility from general motion tracking. In the first stage, a privileged generalist policy generates physically plausible motions that strictly adhere to the robot's dynamics via trajectory rollout in simulation. In the second stage, the general control policy is trained to track these physically feasible motions, ensuring stable and coherent control transfer to the real robot. Experiments show that OmniTrack improves tracking accuracy and demonstrates strong generalization to unseen motions. In real-world tests, OmniTrack achieves hour-long, consistent, and stable tracking, including complex acrobatic motions such as flips and cartwheels. Additionally, we show that OmniTrack supports human-style stable and dynamic online teleoperation, highlighting its robustness and adaptability to varying user inputs.
comment: website: https://omnitrack-humanoid.github.io/
Acceleration-Based Control of Fixed-Wing UAVs for Guidance Applications
Acceleration-commanded guidance laws (e.g., proportional navigation) are attractive for high-level decision making, but their direct deployment on fixed-wing UAVs is challenging because accelerations are not directly actuated and must be realized through attitude and thrust under flight-envelope constraints. This paper presents an acceleration-level outer-loop control framework that converts commanded tangential and normal accelerations into executable body-rate and normalized thrust commands compatible with mainstream autopilots (e.g., PX4/APM). For the normal channel, we derive an engineering mapping from the desired normal acceleration to roll- and pitch-rate commands that regulate the direction and magnitude of the lift vector under small-angle assumptions. For the tangential channel, we introduce an energy-based formulation inspired by total energy control and identify an empirical thrust-energy acceleration relationship directly from flight data, avoiding explicit propulsion modeling or thrust bench calibration. We further discuss priority handling between normal and tangential accelerations under saturation and non-level maneuvers. Extensive real-flight experiments on a VTOL fixed-wing platform demonstrate accurate acceleration tracking and enable practical implementation of proportional navigation using only body-rate and normalized thrust interfaces.
StemVLA:An Open-Source Vision-Language-Action Model with Future 3D Spatial Geometry Knowledge and 4D Historical Representation
Vision-language-action (VLA) models integrate visual observations and language instructions to predict robot actions, demonstrating promising generalization in manipulation tasks. However, most existing approaches primarily rely on direct mappings from 2D visual inputs to action sequences, without explicitly modeling the underlying 3D spatial structure or temporal world dynamics. Such representations may limit spatial reasoning and long-horizon decision-making in dynamic environments. To address this limitation, we propose StemVLA, a novel framework that explicitly incorporates both future-oriented 3D spatial knowledge and historical 4D spatiotemporal representations into action prediction. First, instead of relying solely on observed images, StemVLA forecasts structured 3D future spatial-geometric world knowledge, enabling the model to anticipate upcoming scene geometry and object configurations. Second, to capture temporal consistency and motion dynamics, we feed historical image frames into a pretrained video-geometry transformer backbone to extract implicit 3D world representations, and further aggregate them across time using a temporal attention module, termed VideoFormer [20], forming a unified 4D historical spatiotemporal representation. By jointly modeling 2D observations, predicted 3D future structure, and aggregated 4D temporal dynamics, StemVLA enables more comprehensive world understanding for robot manipulation. Extensive experiments in simulation demonstrate that StemVLA significantly improves long-horizon task success and achieves state-of-the-art performance on the CALVIN ABC-D benchmark [46], achieving an average sequence length of XXX.
comment: Preprint
SAGE-LLM: Towards Safe and Generalizable LLM Controller with Fuzzy-CBF Verification and Graph-Structured Knowledge Retrieval for UAV Decision
In UAV dynamic decision, complex and variable hazardous factors pose severe challenges to the generalization capability of algorithms. Despite offering semantic understanding and scene generalization, Large Language Models (LLM) lack domain-specific UAV control knowledge and formal safety assurances, restricting their direct applicability. To bridge this gap, this paper proposes a train-free two-layer decision architecture based on LLMs, integrating high-level safety planning with low-level precise control. The framework introduces three key contributions: 1) A fuzzy Control Barrier Function verification mechanism for semantically-augmented actions, providing provable safety certification for LLM outputs. 2) A star-hierarchical graph-based retrieval-augmented generation system, enabling efficient, elastic, and interpretable scene adaptation. 3) Systematic experimental validation in pursuit-evasion scenarios with unknown obstacles and emergent threats, demonstrating that our SAGE-LLM maintains performance while significantly enhancing safety and generalization without online training. The proposed framework demonstrates strong extensibility, suggesting its potential for generalization to broader embodied intelligence systems and safety-critical control domains.
A Reliable Indoor Navigation System for Humans Using AR-based Technique
Reliable navigation systems are not available indoors, such as in campuses and small areas. Users must depend on confusing, time-consuming static signage or floor maps. In this paper, an AR-based technique has been applied to campus and small-site navigation, where Vuforia Area Target is used for environment modeling. AI navigation's NavMesh component is used for navigation purposes, and the A* algorithm is used within this component for shortest path calculation. Compared to Dijkstra's algorithm, it can reach a solution about two to three times faster for smaller search spaces. In many cases, Dijkstra's algorithm has difficulty performing well in high-complexity environments where memory usage grows and processing times increase. Compared to older approaches such as GPS, real-time processing and AR overlays can be combined to provide intuitive directions for users while dynamically updating the path in response to environmental changes. Experimental results indicate significantly improved navigation accuracy, better user experience, and greater efficiency compared to traditional methods. These results show that AR technology integrated with existing pathfinding algorithms is feasible and scalable, making it a user-friendly solution for indoor navigation. Although highly effective in limited and defined indoor spaces, further optimization of NavMesh is required for large or highly dynamic environments.
comment: 6 pages, 6 figures, 2 tables, Presented at 7th International Conference on Advances in Science and Technology (ICAST 2024-25)
Interpretable Multimodal Gesture Recognition for Drone and Mobile Robot Teleoperation via Log-Likelihood Ratio Fusion
Human operators are still frequently exposed to hazardous environments such as disaster zones and industrial facilities, where intuitive and reliable teleoperation of mobile robots and Unmanned Aerial Vehicles (UAVs) is essential. In this context, hands-free teleoperation enhances operator mobility and situational awareness, thereby improving safety in hazardous environments. While vision-based gesture recognition has been explored as one method for hands-free teleoperation, its performance often deteriorates under occlusions, lighting variations, and cluttered backgrounds, limiting its applicability in real-world operations. To overcome these limitations, we propose a multimodal gesture recognition framework that integrates inertial data (accelerometer, gyroscope, and orientation) from Apple Watches on both wrists with capacitive sensing signals from custom gloves. We design a late fusion strategy based on the log-likelihood ratio (LLR), which not only enhances recognition performance but also provides interpretability by quantifying modality-specific contributions. To support this research, we introduce a new dataset of 20 distinct gestures inspired by aircraft marshalling signals, comprising synchronized RGB video, IMU, and capacitive sensor data. Experimental results demonstrate that our framework achieves performance comparable to a state-of-the-art vision-based baseline while significantly reducing computational cost, model size, and training time, making it well suited for real-time robot control. We therefore underscore the potential of sensor-based multimodal fusion as a robust and interpretable solution for gesture-driven mobile robot and drone teleoperation.
Physics-Embedded Neural ODEs for Learning Antagonistic Pneumatic Artificial Muscle Dynamics
Pneumatic artificial muscles (PAMs) enable compliant actuation for soft wearable, assistive, and interactive robots. When arranged antagonistically, PAMs can provide variable impedance through co-contraction but exhibit coupled, nonlinear, and hysteretic dynamics that challenge modeling and control. This paper presents a hybrid neural ordinary differential equation (Neural ODE) framework that embeds physical structure into a learned model of antagonistic PAM dynamics. The formulation combines parametric joint mechanics and pneumatic state dynamics with a neural network force component that captures antagonistic coupling and rate-dependent hysteresis. The forward model predicts joint motion and chamber pressures with a mean R$^2$ of 0.88 across 225 co-contraction conditions. An inverse formulation, derived from the learned dynamics, computes pressure commands offline for desired motion and stiffness profiles, tracked in closed loop during execution. Experimental validation demonstrates reliable stiffness control across 126-176 N/mm and consistent impedance behavior across operating velocities, in contrast to a static model, which shows degraded stiffness consistency at higher velocities.
SpikingTac: A Miniaturized Neuromorphic Visuotactile Sensor for High-Precision Dynamic Tactile Imprint Tracking
High-speed event-driven tactile sensors are essential for achieving human-like dynamic manipulation, yet their integration is often limited by the bulkiness of standard event cameras. This paper presents SpikingTac, a miniaturized, highly integrated neuromorphic tactile sensor featuring a custom standalone event camera module, achieved with a total material cost of less than \$150. We construct a global dynamic state map coupled with an unsupervised denoising network to enable precise tracking at a 1000~Hz perception rate and 350~Hz tracking frequency. Addressing the viscoelastic hysteresis of silicone elastomers, we propose a hysteresis-aware incremental update law with a spatial gain damping mechanism. Experimental results demonstrate exceptional zero-point stability, achieving a 100\% return-to-origin success rate with a minimal mean bias of 0.8039 pixels, even under extreme torsional deformations. In dynamic tasks, SpikingTac limits the obstacle-avoidance overshoot to 6.2~mm, representing a 5-fold performance improvement over conventional frame-based sensors. Furthermore, the sensor achieves sub-millimeter geometric accuracy, with Root Mean Square Error (RMSE) of 0.0952~mm in localization and 0.0452~mm in radius measurement.
FAVLA: A Force-Adaptive Fast-Slow VLA model for Contact-Rich Robotic Manipulation
Force/torque feedback can substantially improve Vision-Language-Action (VLA) models on contact-rich manipulation, but most existing approaches fuse all modalities at a single operating frequency. This design ignores the mismatched sampling rates of real robot sensors, forcing downsampling of the high-frequency contact cues needed for reactive correction. Combined with common VLM-action-expert (AE) pipelines that execute action chunks largely open loop between expensive VLM updates, unified-frequency fusion often yields delayed responses to impacts, stick-slip, and force spikes. We propose FAVLA, a force-adaptive fast-slow VLA that decouples slow perception planning from fast contact-aware control. FAVLA runs a slow VLM at a fixed low frequency to encode modalities to produce latent representations and to predict near-future force variation. A fast AE then executes at a variable high frequency, conditioning on the latest force sequence data to generate reactive actions. We further introduce a force adapter that injects high-frequency force features into multiple AE layers, and adaptively schedules the AE's execution frequency based on the VLM's predicted force variation. Extensive experiments on contact-rich tasks demonstrate that FAVLA significantly outperforms baselines, achieving superior reactivity and success rates, especially with a smaller contact force during manipulation.
MicroPush: A Simulator and Benchmark for Contact-Rich Cell Pushing and Assembly with a Magnetic Rolling Microrobot
Magnetic rolling microrobots enable gentle manipulation in confined microfluidic environments, yet autonomy for contact-rich behaviors such as cell pushing and multi-target assembly remains difficult to develop and evaluate reproducibly. We present MicroPush, an open-source simulator and benchmark suite for magnetic rolling microrobots in cluttered 2D scenes. MicroPush combines an overdamped interaction model with contact-aware stick--slip effects, lightweight near-field damping, optional Poiseuille background flow, and a calibrated mapping from actuation frequency to free-space rolling speed. On top of the simulator core, we provide a modular planning--control stack with a two-phase strategy for contact establishment and goal-directed pushing, together with a deterministic benchmark protocol with fixed tasks, staged execution, and unified CSV logging for single-object transport and hexagonal assembly. We report success, time, and tracking metrics, and an actuation-variation measure $E_{Δω}$. Results show that controller stability dominates performance under flow disturbances, while planner choice can influence command smoothness over long-horizon sequences via waypoint progression. MicroPush enables reproducible comparison and ablation of planning, control, and learning methods for microscale contact-rich micromanipulation.
comment: 13 pages, 8 figures
KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning
Memory-augmented Large Language Models (LLMs) have demonstrated remarkable capability for complex and long-horizon embodied planning. By keeping track of past experiences and environmental states, memory enables LLMs to maintain a global view, thereby avoiding repetitive exploration. However, existing approaches often store the memory as raw text, leading to excessively long prompts and high prefill latency. While it is possible to store and reuse the KV caches, the efficiency benefits are greatly undermined due to frequent KV cache updates. In this paper, we propose KEEP, a KV-cache-centric memory management system for efficient embodied planning. KEEP features 3 key innovations: (1) a Static-Dynamic Memory Construction algorithm that reduces KV cache recomputation by mixed-granularity memory group; (2) a Multi-hop Memory Re-computation algorithm that dynamically identifies important cross-attention among different memory groups and reconstructs memory interactions iteratively; (3) a Layer-balanced Memory Loading that eliminates unbalanced KV cache loading and cross-attention computation across different layers. Extensive experimental results have demonstrated that KEEP achieves 2.68x speedup with negligible accuracy loss compared with text-based memory methods on ALFRED dataset. Compared with the KV re-computation method CacheBlend (EuroSys'25), KEEP shows 4.13% success rate improvement and 1.90x time-to-first-token (TTFT) reduction. Our code is available on https://github.com/PKU-SEC-Lab/KEEP_Embodied_Memory.
comment: DAC 2026
VCA: Vision-Click-Action Framework for Precise Manipulation of Segmented Objects in Target Ambiguous Environments
The reliance on language in Vision-Language-Action (VLA) models introduces ambiguity, cognitive overhead, and difficulties in precise object identification and sequential task execution, particularly in environments with multiple visually similar objects. To address these limitations, we propose Vision-Click-Action (VCA), a framework that replaces verbose textual commands with direct, click-based visual interaction using pretrained segmentation models. By allowing operators to specify target objects clearly through visual selection in the robot's 2D camera view, VCA reduces interpretation errors, lowers cognitive load, and provides a practical and scalable alternative to language-driven interfaces for real-world robotic manipulation. Experimental results validate that the proposed VCA framework achieves effective instance-level manipulation of specified target objects. Experiment videos are available at https://robrosinc.github.io/vca/.
comment: Submitted to UR 2026
Tilt-X: Enabling Compliant Aerial Manipulation through a Tiltable-Extensible Continuum Manipulator ICRA
Aerial manipulators extend the reach and manipulation capabilities of uncrewed multirotor aerial vehicles for inspection, agriculture, sampling, and delivery. Continuum arm aerial manipulation systems offer lightweight, dexterous, and compliant interaction opportunities. Existing designs allow manipulation only below the UAV which restricts their deployability in multiple directions and through clutter. They are also sensitive to propeller downwash. Addressing these limitations, we present Tilt-X, a continuum arm aerial manipulator that integrates a tilting mechanism, a telescopic stage, and a cable-driven continuum section. We present its design and kinematic model and validate it through flight demonstrations. Tilt-X enables a volumetric workspace with up to 75 mm extension and planar orientations between 0$^\circ$ to 90$^\circ$. Experiments comparing end effector pose with and without downwash quantitatively measure its accuracy, providing critical evidence to guide the design and control of reliable aerial manipulators. Results show stabilisation of end effector pose as the manipulator extends out of the propeller influence zone.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026
Acoustic Sensing for Universal Jamming Grippers ICRA 2026
Universal jamming grippers excel at grasping unknown objects due to their compliant bodies. Traditional tactile sensors can compromise this compliance, reducing grasping performance. We present acoustic sensing as a form of morphological sensing, where the gripper's soft body itself becomes the sensor. A speaker and microphone are placed inside the gripper cavity, away from the deformable membrane, fully preserving compliance. Sound propagates through the gripper and object, encoding object properties, which are then reconstructed via machine learning. Our sensor achieves high spatial resolution in sensing object size (2.6 mm error) and orientation (0.6 deg error), remains robust to external noise levels of 80 dBA, and discriminates object materials (up to 100% accuracy) and 16 everyday objects (85.6% accuracy). We validate the sensor in a realistic tactile object sorting task, achieving 53 minutes of uninterrupted grasping and sensing, confirming the preserved grasping performance. Finally, we demonstrate that disentangled acoustic representations can be learned, improving robustness to irrelevant acoustic variations.
comment: Accepted at ICRA 2026, supplementary material under https://rbo.gitlab-pages.tu-berlin.de/papers/acoustic-jamming-icra26/
Layered Safety: Enhancing Autonomous Collision Avoidance via Multistage CBF Safety Filters
This paper presents a general end-to-end framework for constructing robust and reliable layered safety filters that can be leveraged to perform dynamic collision avoidance over a broad range of applications using only local perception data. Given a robot-centric point cloud, we begin by constructing an occupancy map which is used to synthesize a Poisson safety function (PSF). The resultant PSF is employed as a control barrier function (CBF) within two distinct safety filtering stages. In the first stage, we propose a predictive safety filter to compute optimal safe trajectories based on nominal potentially-unsafe commands. The resultant short-term plans are constrained to satisfy the CBF condition along a finite prediction horizon. In the second stage, instantaneous velocity commands are further refined by a real-time CBF-based safety filter and tracked by the full-order low-level robot controller. Assuming accurate tracking of velocity commands, we obtain formal guarantees of safety for the full-order system. We validate the optimality and robustness of our multistage architecture, in comparison to traditional single-stage safety filters, via a detailed Pareto analysis. We further demonstrate the effectiveness and generality of our collision avoidance methodology on multiple legged robot platforms across a variety of real-world dynamic scenarios.
Geometric Look-Angle Shaping Strategy for Enclosed Inspection
This paper introduces inspection through GLASS, a Geometric Look-Angle Shaping Strategy for enclosed regions using unmanned aerial vehicles. In doing so, the vehicles guidance command is constructed through a bounded, geometry-consistent shaping of the look angle relative to a desired standoff path. By embedding a smooth, hyperbolic-tangent-type shaping function within a polar geometric framework, GLASS ensures global existence of the guidance dynamics. It avoids the far-field limitations inherent to conventional formulations. Lyapunov stability analysis establishes asymptotic convergence to a prescribed inspection standoff under explicit curvature feasibility conditions, along with analytical settling-time characteristics. The proposed strategy incorporates maximum turn-rate constraints without inducing singularities throughout the workspace. High-fidelity six-degree-of-freedom quadrotor simulations demonstrate the effectiveness of GLASS in representative enclosed inspection scenarios, highlighting a practically viable guidance framework for autonomous enclosed inspection missions.
comment: Preprinted submitted to ICUAS 2026
Off-Road Navigation via Implicit Neural Representation of Terrain Traversability
Autonomous off-road navigation requires robots to estimate terrain traversability from onboard sensors and plan motion accordingly. Conventional approaches typically rely on sampling-based planners such as MPPI to generate short-term control actions that aim to minimize traversal time and risk measures derived from the traversability estimates. These planners can react quickly but optimize only over a short look-ahead window, limiting their ability to reason about the full path geometry, which is important for navigating in challenging off-road environments. Moreover, they lack the ability to adjust speed based on the terrain-induced vibrations, which is important for smooth navigation on challenging terrains. In this paper, we introduce TRAIL (Traversability with an Implicit Learned Representation), an off-road navigation framework that leverages an implicit neural representation to model terrain properties as a continuous field that can be queried at arbitrary locations. This representation yields spatial gradients that enable integration with a novel gradient-based trajectory optimization method that adapts the path geometry and speed profile based on terrain traversability.
comment: Full version: 10 pages
HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning
Vision-Language-Action (VLA) models have shown strong performance in robotic manipulation, but often struggle in long-horizon or out-of-distribution scenarios due to the lack of explicit mechanisms for multimodal reasoning and anticipating how the world will evolve under action. Recent works introduce textual chain-of-thought or visual subgoal prediction within VLA models to reason, but still fail to offer a unified human-like reasoning framework for joint textual reasoning, visual foresight, and action prediction. To this end, we propose HALO, a unified VLA model that enables embodied multimodal chain-of-thought (EM-CoT) reasoning through a sequential process of textual task reasoning, visual subgoal prediction for fine-grained guidance, and EM-CoT-augmented action prediction. We instantiate HALO with a Mixture-of-Transformers (MoT) architecture that decouples semantic reasoning, visual foresight, and action prediction into specialized experts while allowing seamless cross-expert collaboration. To enable HALO learning at scale, we introduce an automated pipeline to synthesize EM-CoT training data along with a carefully crafted training recipe. Extensive experiments demonstrate that: (1) HALO achieves superior performance in both simulated and real-world environments, surpassing baseline policy pi_0 by 34.1% on RoboTwin benchmark; (2) all proposed components of the training recipe and EM-CoT design help improve task success rate; and (3) HALO exhibits strong generalization capabilities under aggressive unseen environmental randomization with our proposed EM-CoT reasoning.
System Design of the Ultra Mobility Vehicle: A Driving, Balancing, and Jumping Bicycle Robot
Trials cyclists and mountain bike riders can hop, jump, balance, and drive on one or both wheels. This versatility allows them to achieve speed and energy-efficiency on smooth terrain and agility over rough terrain. Inspired by these athletes, we present the design and control of a robotic platform, Ultra Mobility Vehicle (UMV), which combines a bicycle and a reaction mass to move dynamically with minimal actuated degrees of freedom. We employ a simulation-driven design optimization process to synthesize a spatial linkage topology with a focus on vertical jump height and momentum-based balancing on a single wheel contact. Using a constrained Reinforcement Learning (RL) framework, we demonstrate zero-shot transfer of diverse athletic behaviors, including track-stands, jumps, wheelies, rear wheel hopping, and front flips. This 23.5 kg robot is capable of high speeds (8 m/s) and jumping on and over large obstacles (1 m tall, or 130% of the robot's nominal height).
comment: 19 Pages, 11 figures, 3 movies, 2 tables
Agile legged locomotion in reconfigurable modular robots
Legged machines are becoming increasingly agile and adaptive but they have so far lacked the morphological diversity of legged animals, which have been rearranged and reshaped to fill millions of niches. Unlike their biological counterparts, legged machines have largely converged over the past decade to canonical quadrupedal and bipedal architectures that cannot be easily reconfigured to meet new tasks or recover from injury. Here we introduce autonomous modular legs: agile yet minimal, single-degree-of-freedom jointed links that can learn complex dynamic behaviors and may be freely attached to form multilegged machines at the meter scale. This enables rapid repair, redesign, and recombination of highly-dynamic modular agents that move quickly and acrobatically (non-quasistatically) through unstructured environments. Because each module is itself a complete agent, the bodies that contain them can sustain deep structural damage that would completely disable other legged robots. We also show how to encode the vast space of possible body configurations into a compact latent design space that can be efficiently explored, revealing a wide diversity of novel legged forms.
Mixed formulation and structure-preserving discretization of Cosserat rod dynamics in a port-Hamiltonian framework
An energy-based modeling framework for the nonlinear dynamics of spatial Cosserat rods undergoing large displacements and rotations is proposed. The mixed formulation features independent displacement, velocity and stress variables and is further objective and locking-free. Finite rotations are represented using a director formulation that avoids singularities and yields a constant mass matrix. This results in an infinite-dimensional nonlinear port-Hamiltonian (PH) system governed by partial differential-algebraic equations with a quadratic energy functional. Using a time-differentiated compliance form of the stress-strain relations allows for the imposition of kinematic constraints, such as inextensibility or shear-rigidity. A structure-preserving finite element discretization leads to a finite-dimensional system with PH structure, thus facilitating the design of an energy-momentum consistent integration scheme. Dissipative material behavior (via the generalized-Maxwell model) and non-standard actuation approaches (via pneumatic chambers or tendons) integrate naturally into the framework. As illustrated by selected numerical examples, the present framework establishes a new approach to energy-momentum consistent formulations in computational mechanics involving finite rotations.
comment: 39 pages, 16 figures
Apple: Toward General Active Perception via Reinforcement Learning ICLR 2026
Active perception is a fundamental skill that enables us humans to deal with uncertainty in our inherently partially observable environment. For senses such as touch, where the information is sparse and local, active perception becomes crucial. In recent years, active perception has emerged as an important research domain in robotics. However, current methods are often bound to specific tasks or make strong assumptions, which limit their generality. To address this gap, this work introduces APPLE (Active Perception Policy Learning) - a novel framework that leverages reinforcement learning (RL) to address a range of different active perception problems. APPLE jointly trains a transformer-based perception module and decision-making policy with a unified optimization objective, learning how to actively gather information. By design, APPLE is not limited to a specific task and can, in principle, be applied to a wide range of active perception problems. We evaluate two variants of APPLE across different tasks, including tactile exploration problems from the Tactile MNIST benchmark. Experiments demonstrate the efficacy of APPLE, achieving high accuracies on both regression and classification tasks. These findings underscore the potential of APPLE as a versatile and general framework for advancing active perception in robotics. Project page: https://timschneider42.github.io/apple
comment: 27 pages; 21 figures; accepted at the Fourteenth International Conference on Learning Representations (ICLR 2026)
Model Predictive Control with Reference Learning for Soft Robotic Intracranial Pressure Waveform Modulation
This paper introduces a learning-based control framework for a soft robotic actuator system designed to modulate intracranial pressure (ICP) waveforms, which is essential for studying cerebrospinal fluid dynamics and pathological processes underlying neurological disorders. A two-layer framework is proposed to safely achieve a desired ICP waveform modulation. First, a model predictive controller (MPC) with a disturbance observer is used for offset-free tracking of the system's motor position reference trajectory under safety constraints. Second, to address the unknown nonlinear dependence of ICP on the motor position, we employ a Bayesian optimization (BO) algorithm used for online learning of a motor position reference trajectory that yields the desired ICP modulation. The framework is experimentally validated using a test bench with a brain phantom that replicates realistic ICP dynamics in vitro. Compared to a previously employed proportional-integral-derivative controller, the MPC reduces mean and maximum motor position reference tracking errors by 83 % and 73 %, respectively. In less than 20 iterations, the BO algorithm learns a motor position reference trajectory that yields an ICP waveform with the desired mean and amplitude.
Parallel Continuous-Time Relative Localization with Augmented Clamped Non-Uniform B-Splines
Accurate relative localization is critical for multi-robot cooperation. In robot swarms, measurements from different robots arrive asynchronously and with clock time-offsets. Although Continuous-Time (CT) formulations have proved effective for handling asynchronous measurements in single-robot SLAM and calibration, extending CT methods to multi-robot settings faces great challenges to achieve high-accuracy, low-latency, and high-frequency performance. Especially, existing CT methods suffer from the inherent query-time delay of unclamped B-splines and high computational cost. This paper proposes CT-RIO, a novel Continuous-Time Relative-Inertial Odometry framework. We employ Clamped Non-Uniform B-splines (C-NUBS) to represent robot states for the first time, eliminating the query-time delay. We further augment C-NUBS with closed-form extension and shrinkage operations that preserve the spline shape, making it suitable for online estimation and enabling flexible knot management. This flexibility leads to the concept of knot-keyknot strategy, which supports spline extension at high-frequency while retaining sparse keyknots for adaptive relative-motion modeling. We then formulate a sliding-window relative localization problem that operates purely on relative kinematics and inter-robot constraints. To meet the demanding computation required at swarm scale, we decompose the tightly-coupled optimization into robot-wise sub-problems and solve them in parallel using incremental asynchronous block coordinate descent. Extensive experiments show that CT-RIO converges from time-offsets as large as 263 ms to sub-millisecond within 3 s, and achieves RMSEs of 0.046 m and 1.8 °. It consistently outperforms state-of-the-art methods, with improvements of up to 60% under high-speed motion.
comment: 26 pages, 23 figures, submitted to IEEE Transactions on Robotics
Attentive Feature Aggregation or: How Policies Learn to Stop Worrying about Robustness and Attend to Task-Relevant Visual Cues
The adoption of pre-trained visual representations (PVRs), leveraging features from large-scale vision models, has become a popular paradigm for training visuomotor policies. However, these powerful representations can encode a broad range of task-irrelevant scene information, making the resulting trained policies vulnerable to out-of-domain visual changes and distractors. In this work we address visuomotor policy feature pooling as a solution to the observed lack of robustness in perturbed scenes. We achieve this via Attentive Feature Aggregation (AFA), a lightweight, trainable pooling mechanism that learns to naturally attend to task-relevant visual cues, ignoring even semantically rich scene distractors. Through extensive experiments in both simulation and the real world, we demonstrate that policies trained with AFA significantly outperform standard pooling approaches in the presence of visual perturbations, without requiring expensive dataset augmentation or fine-tuning of the PVR. Our findings show that ignoring extraneous visual information is a crucial step towards deploying robust and generalisable visuomotor policies. Project Page: tsagkas.github.io/afa
comment: This paper stems from a split of our earlier work "When Pre-trained Visual Representations Fall Short: Limitations in Visuo-Motor Robot Learning." While "The Temporal Trap" replaces the original and focuses on temporal entanglement, this companion study examines policy robustness and task-relevant visual cue selection. arXiv admin note: text overlap with arXiv:2502.03270
Distributed Lloyd-Based algorithm for uncertainty-aware multi-robot under-canopy flocking
In this letter, we present a distributed algorithm for flocking in complex environments that operates at constant altitude, without explicit communication, no a priori information about the environment, and by using only on-board sensing and computation capabilities. We provide sufficient conditions to guarantee collision avoidance with obstacles and other robots without exceeding a desired maximum distance from a predefined set of neighbors (flocking or proximity maintenance constraint) during the mission. The proposed approach allows to operate in crowded scenarios and to explicitly deal with tracking errors and on-board sensing errors. The algorithm was verified through simulations with varying number of UAVs and also through numerous real-world experiments in a dense forest involving up to four UAVs.
Adversarial Fine-tuning in Offline-to-Online Reinforcement Learning for Robust Robot Control
Offline reinforcement learning enables sample-efficient policy acquisition without risky online interaction, yet policies trained on static datasets remain brittle under action-space perturbations such as actuator faults. This study introduces an offline-to-online framework that trains policies on clean data and then performs adversarial fine-tuning, where perturbations are injected into executed actions to induce compensatory behavior and improve resilience. A performance-aware curriculum further adjusts the perturbation probability during training via an exponential-moving-average signal, balancing robustness and stability throughout the learning process. Experiments on continuous-control locomotion tasks demonstrate that the proposed method consistently improves robustness over offline-only baselines and converges faster than training from scratch. Matching the fine-tuning and evaluation conditions yields the strongest robustness to action-space perturbations, while the adaptive curriculum strategy mitigates the degradation of nominal performance observed with the linear curriculum strategy. Overall, the results show that adversarial fine-tuning enables adaptive and robust control under uncertain environments, bridging the gap between offline efficiency and online adaptability.
comment: 15 main pages, 8 supplementary material pages
Motion-aware Event Suppression for Event Cameras
In this work, we introduce the first framework for Motion-aware Event Suppression, which learns to filter events triggered by IMOs and ego-motion in real time. Our model jointly segments IMOs in the current event stream while predicting their future motion, enabling anticipatory suppression of dynamic events before they occur. Our lightweight architecture achieves 173 Hz inference on consumer-grade GPUs with less than 1 GB of memory usage, outperforming previous state-of-the-art methods on the challenging EVIMO benchmark by 67\% in segmentation accuracy while operating at a 53\% higher inference rate. Moreover, we demonstrate significant benefits for downstream applications: our method accelerates Vision Transformer inference by 83\% via token pruning and improves event-based visual odometry accuracy, reducing Absolute Trajectory Error (ATE) by 13\%.
Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward AAAI 2026
Existing reinforcement learning (RL) methods struggle with long-horizon robotic manipulation tasks, particularly those involving sparse rewards. While action chunking is a promising paradigm for robotic manipulation, using RL to directly learn continuous action chunks in a stable and data-efficient manner remains a critical challenge. This paper introduces AC3 (Actor-Critic for Continuous Chunks), a novel RL framework that learns to generate high-dimensional, continuous action sequences. To make this learning process stable and data-efficient, AC3 incorporates targeted stabilization mechanisms for both the actor and the critic. First, to ensure reliable policy improvement, the actor is trained with an asymmetric update rule, learning exclusively from successful trajectories. Second, to enable effective value learning despite sparse rewards, the critic's update is stabilized using intra-chunk $n$-step returns and further enriched by a self-supervised module providing intrinsic rewards at anchor points aligned with each action chunk. We conducted extensive experiments on 25 tasks from the BiGym and RLBench benchmarks. Results show that by using only a few demonstrations and a simple model architecture, AC3 achieves superior success rates on most tasks, validating its effective design.
comment: 14 pages, 13 figures, Accepted by AAAI 2026 (oral)
SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation
Embodied navigation that adheres to social norms remains an open research challenge. Our SocialNav is a foundational model for socially-aware navigation with a hierarchical "brain-action" architecture, capable of understanding high-level social norms and generating low-level, socially compliant trajectories. To enable such dual capabilities, we construct the SocNav Dataset, a large-scale collection of 7 million samples, comprising (1) a Cognitive Activation Dataset providing social reasoning signals such as chain-of-thought explanations and social traversability prediction, and (2) an Expert Trajectories Pyramid aggregating diverse navigation demonstrations from internet videos, simulated environments, and real-world robots. A multi-stage training pipeline is proposed to gradually inject and refine navigation intelligence: we first inject general navigation skills and social norms understanding into the model via imitation learning, and then refine such skills through a deliberately designed Socially-Aware Flow Exploration GRPO (SAFE-GRPO), the first flow-based reinforcement learning framework for embodied navigation that explicitly rewards socially compliant behaviors. SocialNav achieves +38% success rate and +46% social compliance rate compared to the state-of-the-art method, demonstrating strong gains in both navigation performance and social compliance. Our project page: https://amap-eai.github.io/SocialNav/
DAGS-SLAM: Dynamic-Aware 3DGS SLAM via Spatiotemporal Motion Probability and Uncertainty-Aware Scheduling
Mobile robots and IoT devices demand real-time localization and dense reconstruction under tight compute and energy budgets. While 3D Gaussian Splatting (3DGS) enables efficient dense SLAM, dynamic objects and occlusions still degrade tracking and mapping. Existing dynamic 3DGS-SLAM often relies on heavy optical flow and per-frame segmentation, which is costly for mobile deployment and brittle under challenging illumination. We present DAGS-SLAM, a dynamic-aware 3DGS-SLAM system that maintains a spatiotemporal motion probability (MP) state per Gaussian and triggers semantics on demand via an uncertainty-aware scheduler. DAGS-SLAM fuses lightweight YOLO instance priors with geometric cues to estimate and temporally update MP, propagates MP to the front-end for dynamic-aware correspondence selection, and suppresses dynamic artifacts in the back-end via MP-guided optimization. Experiments on public dynamic RGB-D benchmarks show improved reconstruction and robust tracking while sustaining real-time throughput on a commodity GPU, demonstrating a practical speed-accuracy tradeoff with reduced semantic invocations toward mobile deployment.
CLEAR-IR: Clarity-Enhanced Active Reconstruction of Infrared Imagery
This paper presents a novel approach for enabling robust robotic perception in dark environments using infrared (IR) stream. IR stream is less susceptible to noise than RGB in low-light conditions. However, it is dominated by active emitter patterns that hinder high-level tasks such as object detection, tracking and localisation. To address this, a Deep Multi-scale Aware Overcomplete (DeepMAO) inspired architecture is proposed that reconstructs clean IR images from emitter populated input, improving both image quality and downstream robotic performance. This approach outperforms existing enhancement techniques and enables reliable operation of vision driven robotic systems across illumination conditions from well-lit to extreme low-light scenes. The results outline the ability of this work to be able to mimic RGB styling from the scene and its applicability on robotics tasks that were trained on RGB images, opening the possibility of doing these tasks in extreme low-light without on-board lighting.
comment: 8 pages, 6 figures, 2 tables
SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios
Autonomous agents operating in the real world must interact continuously with existing physical and semantic infrastructure, track delayed consequences, and verify outcomes over time. Everyday environments are rich in tangible control interfaces (TCIs)-e.g., light switches, appliance panels, and embedded GUI-posing core challenges for lifelong embodied agents, including partial observability, causal reasoning across time, and failure-aware verification under real-world constraints. Yet, current benchmarks rarely consider such long-horizon interaction and causality requirements. We introduce SWITCH (Semantic World Interface Tasks for Control & Handling), an embodied, task-driven benchmark created through iterative releases to probe these gaps. Its first iteration, SWITCH-Basic, evaluates five complementary abilities-task-aware VQA, semantic UI grounding, action generation, state transition prediction, and result verification-under ego-centric RGB video input and device diversity across 351 tasks spanning 98 real devices/appliances. Results from commercial and open LMMMs reveal systematic failures, highlighting critical gaps for lifelong agent deployment. SWITCH provides data, code, and held-out splits to enable reproducible non-contaminated evaluation and community contributions toward more challenging future iterations of the benchmark and the creation of relevant training data. Benchmark resources are available at: https://github.com/BAAI-Agents/SWITCH.
RoboMIND 2.0: A Multimodal, Bimanual Mobile Manipulation Dataset for Generalizable Embodied Intelligence
While data-driven imitation learning has revolutionized robotic manipulation, current approaches remain constrained by the scarcity of large-scale, diverse real-world demonstrations. Consequently, the ability of existing models to generalize across long-horizon bimanual tasks and mobile manipulation in unstructured environments remains limited. To bridge this gap, we present RoboMIND 2.0, a comprehensive real-world dataset comprising over 310K dual-arm manipulation trajectories collected across six distinct robot embodiments and 739 complex tasks. Crucially, to support research in contact-rich and spatially extended tasks, the dataset incorporates 12K tactile-enhanced episodes and 20K mobile manipulation trajectories. Complementing this physical data, we construct high-fidelity digital twins of our real-world environments, releasing an additional 20K-trajectory simulated dataset to facilitate robust sim-to-real transfer. To fully exploit the potential of RoboMIND 2.0, we propose MIND-2 system, a hierarchical dual-system frame-work optimized via offline reinforcement learning. MIND-2 integrates a high-level semantic planner (MIND-2-VLM) to decompose abstract natural language instructions into grounded subgoals, coupled with a low-level Vision-Language-Action executor (MIND-2-VLA), which generates precise, proprioception-aware motor actions.
Less is more -- the Dispatcher/ Executor principle for multi-task Reinforcement Learning
Humans instinctively know how to neglect details when it comes to solve complex decision making problems in environments with unforeseeable variations. This abstraction process seems to be a vital property for most biological systems and helps to 'abstract away' unnecessary details and boost generalisation. In this work we introduce the dispatcher/ executor principle for the design of multi-task Reinforcement Learning controllers. It suggests to partition the controller in two entities, one that understands the task (the dispatcher) and one that computes the controls for the specific device (the executor) - and to connect these two by a strongly regularizing communication channel. The core rationale behind this position paper is that changes in structure and design principles can improve generalisation properties and drastically enforce data-efficiency. It is in some sense a 'yes, and ...' response to the current trend of using large neural networks trained on vast amounts of data and bet on emerging generalisation properties. While we agree on the power of scaling - in the sense of Sutton's 'bitter lesson' - we will give some evidence, that considering structure and adding design principles can be a valuable and critical component in particular when data is not abundant and infinite, but is a precious resource.
comment: Videos showing the results can be found at https://sites.google.com/view/dispatcher-executor
Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving
In this work, we reconceptualize autonomous driving as a generalized language problem and formulate the trajectory planning task as next waypoint prediction. We introduce Max-V1, a novel framework for one-stage end-to-end autonomous driving, named in tribute to the renowned Dutch racing driver Max Verstappen. Our framework presents a single-pass generation paradigm that aligns with the inherent sequentiality of driving. This approach leverages the generative capacity of the Vision-Language Model (VLM) to enable end-to-end trajectory prediction directly from front-view camera input. The efficacy of this method is underpinned by a principled supervision strategy derived from statistical modeling. This provides a well-defined learning objective, which makes the framework highly amenable to mastering complex driving policies through imitation learning from large-scale expert demonstrations. Empirically, our method achieves state-of-the-art performance on the nuScenes dataset, delivering an overall improvement of over 30% compared to prior baselines. Furthermore, it exhibits superior generalization performance on cross-domain datasets acquired from diverse vehicles, demonstrating notable potential for cross-vehicle robustness and adaptability. With these empirical strengths, this work introduces a model that enables fundamental driving behaviors, laying the foundation for the development of more capable self-driving agents. Code will be available upon publication.
Mixed-Initiative Dialog for Human-Robot Collaborative Manipulation
Effective robotic systems for long-horizon human-robot collaboration must adapt to a wide range of human partners, whose physical behavior, willingness to assist, and understanding of the robot's capabilities may change over time. This demands a tightly coupled communication loop that grants both agents the flexibility to propose, accept, or decline requests as they coordinate toward completing the task effectively. We apply a Mixed-Initiative dialog paradigm to Collaborative human-roBot teaming and propose MICoBot, a system that handles the common scenario where both agents, using natural language, take initiative in formulating, accepting, or rejecting proposals on who can best complete different steps of a task. To handle diverse, task-directed dialog, and find successful collaborative strategies that minimize human effort, MICoBot makes decisions at three levels: (1) a meta-planner considers human dialog to formulate and code a high-level collaboration strategy, (2) a planner optimally allocates the remaining steps to either agent based on the robot's capabilities (measured by a simulation-pretrained affordance model) and the human's estimated availability to help, and (3) an action executor decides the low-level actions to perform or words to say to the human. In physical robot trials with 18 unique human participants, MICoBot significantly improves task success and user experience over a pure LLM baseline and standard agent allocation models. See additional videos and materials at https://robin-lab.cs.utexas.edu/MicoBot/.
comment: Project website at https://robin-lab.cs.utexas.edu/MicoBot/
Beyond Ground: Map-Free LiDAR Relocalization for UAVs
Localization is a fundamental capability in unmanned aerial vehicle (UAV) systems. Map-free LiDAR relocalization offers an effective solution for achieving high-precision positioning in environments with weak or unavailable GNSS signals. However, existing LiDAR relocalization methods are primarily tailored to autonomous driving, exhibiting significantly degraded accuracy in UAV scenarios. In this paper, we propose MAILS, a novel map-free LiDAR relocalization framework for UAVs. A Locality-Preserving Sliding Window Attention module is first introduced to extract locally discriminative geometric features from sparse point clouds. To handle substantial yaw rotations and altitude variations encountered during UAV flight, we then design a coordinate-independent feature initialization module and a locally invariant positional encoding mechanism, which together significantly enhance the robustness of feature extraction. Furthermore, existing LiDAR-based relocalization datasets fail to capture real-world UAV flight characteristics, such as irregular trajectories and varying altitudes. To address this gap, we construct a large-scale LiDAR localization dataset for UAVs, which comprises four scenes and various flight trajectories, designed to evaluate UAV relocalization performance under realistic conditions. Extensive experiments demonstrate that our method achieves satisfactory localization precision and consistently outperforms existing techniques by a significant margin. Our code and dataset will be released soon.
comment: 18 pages, 16 figures
BEV-VLM: Trajectory Planning via Unified BEV Abstraction
This paper introduces BEV-VLM, a novel approach for trajectory planning in autonomous driving that leverages Vision-Language Models (VLMs) with Bird's-Eye View (BEV) feature maps as visual input. Unlike conventional trajectory planning approaches that rely solely on raw visual data (e.g., camera images), our method utilizes a highly compressed and informative BEV representation generated by fusing camera and LiDAR data, with subsequent alignment to High-Definition (HD) maps. This unified BEV-HD map format provides a geometrically consistent and semantically rich scene description, which enables VLMs to perform accurate and robust trajectory planning. Experimental results on the nuScenes dataset demonstrate that, compared with state-of-the-art vision-only methods, our approach achieves a 53.1% improvement in planning accuracy and realizes complete collision avoidance in evaluation scenarios. Our work highlights that VLMs can effectively interpret processed visual representations such as BEV features, expanding their applicability beyond raw image inputs for the task of trajectory planning.
Embodiment-Aware Generalist Specialist Distillation for Unified Humanoid Whole-Body Control
Humanoid Whole-Body Controllers trained with reinforcement learning (RL) have recently achieved remarkable performance, yet many target a single robot embodiment. Variations in dynamics, degrees of freedom (DoFs), and kinematic topology still hinder a single policy from commanding diverse humanoids. Moreover, obtaining a generalist policy that not only transfers across embodiments but also supports richer behaviors-beyond simple walking to squatting, leaning-remains especially challenging. In this work, we tackle these obstacles by introducing EAGLE, an iterative generalist-specialist distillation framework that produces a single unified policy that controls multiple heterogeneous humanoids without per-robot reward tuning. During each cycle, embodiment-specific specialists are forked from the current generalist, refined on their respective robots, and new skills are distilled back into the generalist by training on the pooled embodiment set. Repeating this loop until performance convergence produces a robust Whole-Body Controller validated on robots such as Unitree H1, G1, and Fourier N1. We conducted experiments on five different robots in simulation and four in real-world settings. Through quantitative evaluations, EAGLE achieves high tracking accuracy and robustness compared to other methods, marking a step toward scalable, fleet-level humanoid control. See more details at https://eagle-wbc.github.io/
Human Autonomy and Sense of Agency in Human-Robot Interaction: A Systematic Literature Review
Human autonomy and sense of agency are increasingly recognised as critical for user well-being, motivation, and the ethical deployment of robots in human-robot interaction (HRI). Given the rapid development of artificial intelligence, robot capabilities and their potential to function as colleagues and companions are growing. This systematic literature review synthesises 22 empirical studies selected from an initial pool of 728 articles published between 2011 and 2024. Articles were retrieved from major scientific databases and identified based on empirical focus and conceptual relevance, namely, how to preserve and promote human autonomy and sense of agency in HRI. Derived through thematic synthesis, five clusters of potentially influential factors are revealed: robot adaptiveness, communication style, anthropomorphism, presence of a robot and individual differences. Measured through psychometric scales or the intentional binding paradigm, perceptions of autonomy and agency varied across industrial, educational, healthcare, care, and hospitality settings. The review underscores the theoretical differences between both concepts, but their yet entangled use in HRI. Despite increasing interest, the current body of empirical evidence remains limited and fragmented, underscoring the necessity for standardised definitions, more robust operationalisations, and further exploratory and qualitative research. By identifying existing gaps and highlighting emerging trends, this review contributes to the development of human-centered, autonomy-supportive robot design strategies that uphold ethical and psychological principles, ultimately supporting well-being in human-robot interaction.
Automating the Refinement of Reinforcement Learning Specifications
Logical specifications have been shown to help reinforcement learning algorithms in achieving complex tasks. However, when a task is under-specified, agents might fail to learn useful policies. In this work, we explore the possibility of improving coarse-grained logical specifications via an exploration-guided strategy. We propose AutoSpec, a framework that searches for a logical specification refinement whose satisfaction implies satisfaction of the original specification, but which provides additional guidance therefore making it easier for reinforcement learning algorithms to learn useful policies. AutoSpec is applicable to reinforcement learning tasks specified via the SpectRL specification logic. We exploit the compositional nature of specifications written in SpectRL, and design four refinement procedures that modify the abstract graph of the specification by either refining its existing edge specifications or by introducing new edge specifications. We prove that all four procedures maintain specification soundness, i.e. any trajectory satisfying the refined specification also satisfies the original. We then show how AutoSpec can be integrated with existing reinforcement learning algorithms for learning policies from logical specifications. Our experiments demonstrate that AutoSpec yields promising improvements in terms of the complexity of control tasks that can be solved, when refined logical specifications produced by AutoSpec are utilized.
comment: Fourteenth International Conference on Learning Representations 2026 https://ambadkar.com/autospec
CO^3: Cooperative Unsupervised 3D Representation Learning for Autonomous Driving
Unsupervised contrastive learning for indoor-scene point clouds has achieved great successes. However, unsupervised learning point clouds in outdoor scenes remains challenging because previous methods need to reconstruct the whole scene and capture partial views for the contrastive objective. This is infeasible in outdoor scenes with moving objects, obstacles, and sensors. In this paper, we propose CO^3, namely Cooperative Contrastive Learning and Contextual Shape Prediction, to learn 3D representation for outdoor-scene point clouds in an unsupervised manner. CO^3 has several merits compared to existing methods. (1) It utilizes LiDAR point clouds from vehicle-side and infrastructure-side to build views that differ enough but meanwhile maintain common semantic information for contrastive learning, which are more appropriate than views built by previous methods. (2) Alongside the contrastive objective, shape context prediction is proposed as pre-training goal and brings more task-relevant information for unsupervised 3D point cloud representation learning, which are beneficial when transferring the learned representation to downstream detection tasks. (3) As compared to previous methods, representation learned by CO^3 is able to be transferred to different outdoor scene dataset collected by different type of LiDAR sensors. (4) CO^3 improves current state-of-the-art methods on both Once and KITTI datasets by up to 2.58 mAP. We believe CO^3 will facilitate understanding LiDAR point clouds in outdoor scene.
DropVLA: An Action-Level Backdoor Attack on Vision--Language--Action Models
Vision-Language-Action (VLA) models map multimodal perception and language instructions to executable robot actions, making them particularly vulnerable to behavioral backdoor manipulation: a hidden trigger introduced during training can induce unintended physical actions while nominal task performance remains intact. Prior work on VLA backdoors primarily studies untargeted attacks or task-level hijacking, leaving fine-grained control over individual actions largely unexplored. In this work, we present DropVLA, an action-level backdoor attack that forces a reusable action primitive (e.g., open_gripper) to execute at attacker-chosen decision points under a realistic pipeline-black-box setting with limited data-poisoning access, using a window-consistent relabeling scheme for chunked fine-tuning. On OpenVLA-7B evaluated with LIBERO, vision-only poisoning achieves 98.67%-99.83% attack success rate (ASR) with only 0.31% poisoned episodes while preserving 98.50%-99.17% clean-task retention, and successfully triggers the targeted action within 25 control steps at 500 Hz (0.05 s). Text-only triggers are unstable at low poisoning budgets, and combining text with vision provides no consistent ASR improvement over vision-only attacks. The backdoor remains robust to moderate trigger variations and transfers across evaluation suites (96.27%, 99.09%), whereas text-only largely fails (0.72%). We further validate physical-world feasibility on a 7-DoF Franka arm with pi0-fast, demonstrating non-trivial attack efficacy under camera-relative motion that induces image-plane trigger drift. These results reveal that VLA models can be covertly steered at the granularity of safety-critical actions with minimal poisoning and without observable degradation of nominal performance.
comment: 8 pages, 6 tables, 3 figures. Under review
Generalized Momenta-Based Koopman Formalism for Robust Control of Euler-Lagrangian Systems
This paper presents a novel Koopman operator formulation for Euler Lagrangian dynamics that employs an implicit generalized momentum-based state space representation, which decouples a known linear actuation channel from state dependent dynamics and makes the system more amenable to linear Koopman modeling. By leveraging this structural separation, the proposed formulation only requires to learn the unactuated dynamics rather than the complete actuation dependent system, thereby significantly reducing the number of learnable parameters, improving data efficiency, and lowering overall model complexity. In contrast, conventional explicit formulations inherently couple inputs with the state dependent terms in a nonlinear manner, making them more suitable for bilinear Koopman models, which are more computationally expensive to train and deploy. Notably, the proposed scheme enables the formulation of linear models that achieve superior prediction performance compared to conventional bilinear models while remaining substantially more efficient. To realize this framework, we present two neural network architectures that construct Koopman embeddings from actuated or unactuated data, enabling flexible and efficient modeling across different tasks. Robustness is ensured through the integration of a linear Generalized Extended State Observer (GESO), which explicitly estimates disturbances and compensates for them in real time. The combined momentum-based Koopman and GESO framework is validated through comprehensive trajectory tracking simulations and experiments on robotic manipulators, demonstrating superior accuracy, robustness, and learning efficiency relative to state of the art alternatives.
DECO: Decoupled Multimodal Diffusion Transformer for Bimanual Dexterous Manipulation with a Plugin Tactile Adapter
Bimanual dexterous manipulation relies on integrating multimodal inputs to perform complex real-world tasks. To address the challenges of effectively combining these modalities, we propose DECO, a decoupled multimodal diffusion transformer that disentangles vision, proprioception, and tactile signals through specialized conditioning pathways, enabling structured and controllable integration of multimodal inputs, with a lightweight adapter for parameter-efficient injection of additional signals. Alongside DECO, we release DECO-50 dataset for bimanual dexterous manipulation with tactile sensing, consisting of 50 hours of data and over 5M frames, collected via teleoperation on real dual-arm robots. We train DECO on DECO-50 and conduct extensive real-world evaluation with over 2,000 robot rollouts. Experimental results show that DECO achieves the best performance across all tasks, with a 72.25% average success rate and a 21% improvement over the baseline. Moreover, the tactile adapter brings an additional 10.25% average success rate across all tasks and a 20% gain on complex contact-rich tasks while tuning less than 10% of the model parameters.
comment: 17 pages, 8 figures. Project Page: https://baai-humanoid.github.io/DECO-webpage/
Point Bridge: 3D Representations for Cross Domain Policy Learning
Robot foundation models are beginning to deliver on the promise of generalist robotic agents, yet progress remains constrained by the scarcity of large-scale real-world manipulation datasets. Simulation and synthetic data generation offer a scalable alternative, but their usefulness is limited by the visual domain gap between simulation and reality. In this work, we present Point Bridge, a framework that leverages unified, domain-agnostic point-based representations to unlock synthetic datasets for zero-shot sim-to-real policy transfer, without explicit visual or object-level alignment. Point Bridge combines automated point-based representation extraction via Vision-Language Models (VLMs), transformer-based policy learning, and efficient inference-time pipelines to train capable real-world manipulation agents using only synthetic data. With additional co-training on small sets of real demonstrations, Point Bridge further improves performance, substantially outperforming prior vision-based sim-and-real co-training methods. It achieves up to 44% gains in zero-shot sim-to-real transfer and up to 66% with limited real data across both single-task and multitask settings. Videos of the robot are best viewed at: https://pointbridge3d.github.io/
IntentCUA: Learning Intent-level Representations for Skill Abstraction and Multi-Agent Planning in Computer-Use Agents AAMAS 2026
Computer-use agents operate over long horizons under noisy perception, multi-window contexts, evolving environment states. Existing approaches, from RL-based planners to trajectory retrieval, often drift from user intent and repeatedly solve routine subproblems, leading to error accumulation and inefficiency. We present IntentCUA, a multi-agent computer-use framework designed to stabilize long-horizon execution through intent-aligned plan memory. A Planner, Plan-Optimizer, and Critic coordinate over shared memory that abstracts raw interaction traces into multi-view intent representations and reusable skills. At runtime, intent prototypes retrieve subgroup-aligned skills and inject them into partial plans, reducing redundant re-planning and mitigating error propagation across desktop applications. In end-to-end evaluations, IntentCUA achieved a 74.83% task success rate with a Step Efficiency Ratio of 0.91, outperforming RL-based and trajectory-centric baselines. Ablations show that multi-view intent abstraction and shared plan memory jointly improve execution stability, with the cooperative multi-agent loop providing the largest gains on long-horizon tasks. These results highlight that system-level intent abstraction and memory-grounded coordination are key to reliable and efficient desktop automation in large, dynamic environments.
comment: 12 pages, 9 figures, AAMAS 2026
Development of a Deep Learning-Driven Control Framework for Exoskeleton Robots
The purpose of this study is to develop a computationally efficient deep learning based control framework for high degree of freedom exoskeleton robots to address the real time computational limitations associated with conventional model based control. A parallel structured deep neural network was designed for a seven degree of freedom human lower extremity exoskeleton robot. The network consists of four layers with 49 densely connected neurons and was trained using physics based data generated from the analytical dynamic model. During real time implementation, the trained neural network predicts joint torque commands required for trajectory tracking, while a proportional derivative controller compensates for residual prediction errors. Stability of the proposed control scheme was analytically established, and robustness to parameter variations was evaluated using analysis of variance. Comparative simulations were conducted against computed torque, model reference computed torque, sliding mode, adaptive, and linear quadratic controllers under identical robot dynamics. Results demonstrate accurate trajectory tracking with torque profiles comparable to conventional nonlinear controllers while reducing computational burden. These findings suggest that the proposed deep learning based hybrid controller offers an efficient and robust alternative for controlling multi degree of freedom exoskeleton robots.
Flow-Enabled Generalization to Human Demonstrations in Few-Shot Imitation Learning ICRA 2026
Imitation Learning (IL) enables robots to learn complex skills from demonstrations without explicit task modeling, but it typically requires large amounts of demonstrations, creating significant collection costs. Prior work has investigated using flow as an intermediate representation to enable the use of human videos as a substitute, thereby reducing the amount of required robot demonstrations. However, most prior work has focused on the flow, either on the object or on specific points of the robot/hand, which cannot describe the motion of interaction. Meanwhile, relying on flow to achieve generalization to scenarios observed only in human videos remains limited, as flow alone cannot capture precise motion details. Furthermore, conditioning on scene observation to produce precise actions may cause the flow-conditioned policy to overfit to training tasks and weaken the generalization indicated by the flow. To address these gaps, we propose SFCrP, which includes a Scene Flow prediction model for Cross-embodiment learning (SFCr) and a Flow and Cropped point cloud conditioned Policy (FCrP). SFCr learns from both robot and human videos and predicts any point trajectories. FCrP follows the general flow motion and adjusts the action based on observations for precision tasks. Our method outperforms SOTA baselines across various real-world task settings, while also exhibiting strong spatial and instance generalization to scenarios seen only in human videos.
comment: Accepted to ICRA 2026
Towards Intelligible Human-Robot Interaction: An Active Inference Approach to Occluded Pedestrian Scenarios
The sudden appearance of occluded pedestrians presents a critical safety challenge in autonomous driving. Conventional rule-based or purely data-driven approaches struggle with the inherent high uncertainty of these long-tail scenarios. To tackle this challenge, we propose a novel framework grounded in Active Inference, which endows the agent with a human-like, belief-driven mechanism. Our framework leverages a Rao-Blackwellized Particle Filter (RBPF) to efficiently estimate the pedestrian's hybrid state. To emulate human-like cognitive processes under uncertainty, we introduce a Conditional Belief Reset mechanism and a Hypothesis Injection technique to explicitly model beliefs about the pedestrian's multiple latent intentions. Planning is achieved via a Cross-Entropy Method (CEM) enhanced Model Predictive Path Integral (MPPI) controller, which synergizes the efficient, iterative search of CEM with the inherent robustness of MPPI. Simulation experiments demonstrate that our approach significantly reduces the collision rate compared to reactive, rule-based, and reinforcement learning (RL) baselines, while also exhibiting explainable and human-like driving behavior that reflects the agent's internal belief state.
comment: 14 pages, 6 figures, Proceedings of the 2026 ACM/IEEE International Conference on Human-Robot Interaction (HRI'26)
LEMON-Mapping: Loop-Enhanced Large-Scale Multi-Session Point Cloud Merging and Optimization for Globally Consistent Mapping
Multi-robot collaboration is becoming increasingly critical and presents significant challenges in modern robotics, especially for building a globally consistent, accurate map. Traditional multi-robot pose graph optimization (PGO) methods ensure basic global consistency but ignore the geometric structure of the map, and only use loop closures as constraints between pose nodes, leading to divergence and blurring in overlapping regions. To address this issue, we propose LEMON-Mapping, a loop-enhanced framework for large-scale, multi-session point cloud fusion and optimization. We re-examine the role of loops for multi-robot mapping and introduce three key innovations. First, we develop a robust loop processing mechanism that rejects outliers and a loop recall strategy to recover mistakenly removed but valid loops. Second, we introduce spatial bundle adjustment for multi-robot maps, reducing divergence and eliminating blurring in overlaps. Third, we design a PGO-based approach that leverages refined bundle adjustment constraints to propagate local accuracy to the entire map. We validate LEMON-Mapping on several public datasets and a self-collected dataset. The experimental results show superior mapping accuracy and global consistency of our framework compared to traditional merging methods. Scalability experiments also demonstrate its strong capability to handle scenarios involving numerous robots.
DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation
Vision-Language-Action (VLA) models have shown remarkable success in robotic tasks like manipulation by fusing a language model's reasoning with a vision model's 3D understanding. However, their high computational cost remains a major obstacle for real-world applications that require real-time performance. We observe that the actions within a task have varying levels of importance: critical steps demand high precision, while less important ones can tolerate more variance. Leveraging this insight, we propose DySL-VLA, a novel framework that addresses computational cost by dynamically skipping VLA layers based on each action's importance. DySL-VLA categorizes its layers into two types: informative layers, which are consistently executed, and incremental layers, which can be selectively skipped. To intelligently skip layers without sacrificing accuracy, we invent a prior-post skipping guidance mechanism to determine when to initiate layer-skipping. We also propose a skip-aware two-stage knowledge distillation algorithm to efficiently train a standard VLA into a DySL-VLA. Our experiments indicate that DySL-VLA achieves 2.1% improvement in success length over Deer-VLA on the Calvin dataset, while simultaneously reducing trainable parameters by a factor of 85.7 and providing a 3.75x speedup relative to the RoboFlamingo baseline at iso-accuracy. Our code is available on https://github.com/PKU-SEC-Lab/DYSL_VLA.
comment: DAC 2026
$\rm{A}^{\rm{SAR}}$: $\varepsilon$-Optimal Graph Search for Minimum Expected-Detection-Time Paths with Path Budget Constraints for Search and Rescue (SAR) ICRA
Searches are conducted to find missing persons and/or objects given uncertain information, imperfect observers and large search areas in Search and Rescue (SAR). In many scenarios, such as Maritime SAR, expected survival times are short and optimal search could increase the likelihood of success. This optimization problem is complex for nontrivial problems given its probabilistic nature. Stochastic optimization methods search large problems by nondeterministically sampling the space to reduce the effective size of the problem. This has been used in SAR planning to search otherwise intractably large problems but the stochastic nature provides no formal guarantees on the quality of solutions found in finite time. This paper instead presents $\rm{A}^{\rm{SAR}}$, an $\varepsilon$-optimal search algorithm for SAR planning. It calculates a heuristic to bound the search space and uses graph-search methods to find solutions that are formally guaranteed to be within a user-specified factor, $\varepsilon$, of the optimal solution. It finds better solutions faster than existing optimization approaches in operational simulations. It is also demonstrated with a real-world field trial on Lake Ontario, Canada, where it was used to locate a drifting manikin in only 150s.
comment: IEEE International Conference on Robotics and Automation (ICRA) 2026, 8 pages, 4 figures, 2 tables. The corresponding video can be found at https://www.youtube.com/watch?v=R73-YKWY78M
Robust Finetuning of Vision-Language-Action Robot Policies via Parameter Merging
Generalist robot policies, trained on large and diverse datasets, have demonstrated the ability to generalize across a wide spectrum of behaviors, enabling a single policy to act in varied real-world environments. However, they still fall short on new tasks not covered in the training data. When finetuned on limited demonstrations of a new task, these policies often overfit to the specific demonstrations--not only losing their prior abilities to solve a wide variety of generalist tasks but also failing to generalize within the new task itself. In this work, we aim to develop a method that preserves the generalization capabilities of the generalist policy during finetuning, allowing a single policy to robustly incorporate a new skill into its repertoire. Our goal is a single policy that both learns to generalize to variations of the new task and retains the broad competencies gained from pretraining. We show that this can be achieved through a simple yet effective strategy: interpolating the weights of a finetuned model with that of the pretrained model. We show, across extensive simulated and real-world experiments, that such model merging produces a single model that inherits the generalist abilities of the base model and learns to solve the new task robustly, outperforming both the pretrained and finetuned model on out-of-distribution variations of the new task. Moreover, we show that model merging performance scales with the amount of pretraining data, and enables continual acquisition of new skills in a lifelong learning setting, without sacrificing previously learned generalist abilities.
Multiagent Systems
Sharing is caring: data sharing in multi-agent supply chains
Modern supply networks are complex interconnected systems. Multi-agent models are increasingly explored to optimise their performance. Most research assumes agents will have full observability of the system by having a single policy represent the agents, which seems unrealistic as this requires companies to share their data. The alternative is to develop a Hidden-Markov Process with separate policies, making the problem challenging to solve. In this paper, we propose a multi-agent system where the factory agent can share information downstream, increasing the observability of the environment. It can choose to share no information, lie, tell the truth or combine these in a mixed strategy. The results show that data sharing can boost the performance, especially when combined with a cooperative reward shaping. In the high demand scenario there is limited ability to change the strategy and therefore no data sharing approach benefits both agents. However, lying benefits the factory enough for an overall system improvement, although only by a relatively small amount compared to the overall reward. In the low demand scenario, the most successful data sharing is telling the truth which benefits all actors significantly.
A Novel Hierarchical Multi-Agent System for Payments Using LLMs PAKDD 2026
Large language model (LLM) agents, such as OpenAI's Operator and Claude's Computer Use, can automate workflows but unable to handle payment tasks. Existing agentic solutions have gained significant attention; however, even the latest approaches face challenges in implementing end-to-end agentic payment workflows. To address this gap, this research proposes the Hierarchical Multi-Agent System for Payments (HMASP), which provides an end-to-end agentic method for completing payment workflows. The proposed HMASP leverages either open-weight or proprietary LLMs and employs a modular architecture consisting of the Conversational Payment Agent (CPA - first agent level), Supervisor agents (second agent level), Routing agents (third agent level), and the Process summary agent (fourth agent level). The CPA serves as the central entry point, handling all external requests and coordinating subsequent tasks across hierarchical levels. HMASP incorporates architectural patterns that enable modular task execution across agents and levels for payment operations, including shared state variables, decoupled message states, and structured handoff protocols that facilitate coordination across agents and workflows. Experimental results demonstrate the feasibility of the proposed HMASP. To our knowledge, HMASP is the first LLM-based multi-agent system to implement end-to-end agentic payment workflows. This work lays a foundation for extending agentic capabilities into the payment domain.
comment: 12 pages, 1 figure, 3 tables. Accepted at PAKDD 2026
Mixed Choice in Asynchronous Multiparty Session Types
We present a multiparty session type (MST) framework with asynchronous mixed choice (MC). We propose a core construct for MC that allows transient inconsistencies in protocol state between distributed participants, but ensures all participants can always eventually reach a mutually consistent state. We prove the correctness of our system by establishing a progress property and an operational correspondence between global types and distributed local type projections. Based on our theory, we implement a practical toolchain for specifying and validating asynchronous MST protocols featuring MC, and programming compliant gen_statem processes in Erlang/OTP. We test our framework by using our toolchain to specify and reimplement part of the amqp_client of the RabbitMQ broker for Erlang.
Dynamics of Learning under User Choice: Overspecialization and Peer-Model Probing
In many economically relevant contexts where machine learning is deployed, multiple platforms obtain data from the same pool of users, each of whom selects the platform that best serves them. Prior work in this setting focuses exclusively on the "local" losses of learners on the distribution of data that they observe. We find that there exist instances where learners who use existing algorithms almost surely converge to models with arbitrarily poor global performance, even when models with low full-population loss exist. This happens through a feedback-induced mechanism, which we call the overspecialization trap: as learners optimize for users who already prefer them, they become less attractive to users outside this base, which further restricts the data they observe. Inspired by the recent use of knowledge distillation in modern ML, we propose an algorithm that allows learners to "probe" the predictions of peer models, enabling them to learn about users who do not select them. Our analysis characterizes when probing succeeds: this procedure converges almost surely to a stationary point with bounded full-population risk when probing sources are sufficiently informative, e.g., a known market leader or a majority of peers with good global performance. We verify our findings with semi-synthetic experiments on the MovieLens, Census, and Amazon Sentiment datasets.
Conservative Equilibrium Discovery in Offline Game-Theoretic Multiagent Reinforcement Learning
Offline learning of strategies takes data efficiency to its extreme by restricting algorithms to a fixed dataset of state-action trajectories. We consider the problem in a mixed-motive multiagent setting, where the goal is to solve a game under the offline learning constraint. We first frame this problem in terms of selecting among candidate equilibria. Since datasets may inform only a small fraction of game dynamics, it is generally infeasible in offline game-solving to even verify a proposed solution is a true equilibrium. Therefore, we consider the relative probability of low regret (i.e., closeness to equilibrium) across candidates based on the information available. Specifically, we extend Policy Space Response Oracles (PSRO), an online game-solving approach, by quantifying game dynamics uncertainty and modifying the RL objective to skew towards solutions more likely to have low regret in the true game. We further propose a novel meta-strategy solver, tailored for the offline setting, to guide strategy exploration in PSRO. Our incorporation of Conservatism principles from Offline reinforcement learning approaches for strategy Exploration gives our approach its name: COffeE-PSRO. Experiments demonstrate COffeE-PSRO's ability to extract lower-regret solutions than state-of-the-art offline approaches and reveal relationships between algorithmic components empirical game fidelity, and overall performance.
EmCoop: A Framework and Benchmark for Embodied Cooperation Among LLM Agents
Real-world scenarios increasingly require multiple embodied agents to collaborate in dynamic environments under embodied constraints, as many tasks exceed the capabilities of any single agent. Recent advances in large language models (LLMs) enable high-level cognitive coordination through reasoning, planning, and natural language communication. However, fine-grained analyses of how such collaboration emerges, unfolds, and contributes to task success in embodied multi-agent systems are difficult to conduct with existing benchmarks. In this paper, we introduce EmCoop, a benchmark framework for studying cooperation in LLM-based embodied multi-agent systems. Our framework separates a high-level cognitive layer from a low-level embodied interaction layer, allowing us to characterize agent cooperation through their interleaved dynamics over time. Given a cooperation-constrained embodied task, we propose generalizable, process-level metrics that diagnose collaboration quality and failure modes, beyond final task success. We instantiate our framework in two embodied environments that scale to arbitrary numbers of agents and support diverse communication topologies, and use these instantiations to demonstrate how EmCoop enables systematic analysis of cooperation dynamics across team sizes and task settings. The project web page can be found at: https://happyeureka.github.io/emcoop.
DIG to Heal: Scaling General-purpose Agent Collaboration via Explainable Dynamic Decision Paths
The increasingly popular agentic AI paradigm promises to harness the power of multiple, general-purpose large language model (LLM) agents to collaboratively complete complex tasks. While many agentic AI systems utilize predefined workflows or agent roles in order to reduce complexity, ideally these agents would be truly autonomous, able to achieve emergent collaboration even as the number of collaborating agents increases. Yet in practice, such unstructured interactions can lead to redundant work and cascading failures that are difficult to interpret or correct. In this work, we study multi-agent systems composed of general-purpose LLM agents that operate without predefined roles, control flow, or communication constraints, relying instead on emergent collaboration to solve problems. We introduce the Dynamic Interaction Graph (DIG), which captures emergent collaboration as a time-evolving causal network of agent activations and interactions. DIG makes emergent collaboration observable and explainable for the first time, enabling real-time identification, explanation, and correction of collaboration-induced error patterns directly from agents' collaboration paths. Thus, DIG fills a critical gap in understanding how general LLM agents solve problems together in truly agentic multi-agent systems. The project webpage can be found at: https://happyeureka.github.io/dig.
Integrating LLM in Agent-Based Social Simulation: Opportunities and Challenges
This position paper examines the use of Large Language Models (LLMs) in social simulation, analyzing their potential and limitations from a computational social science perspective. We first review recent findings on LLMs' ability to replicate key aspects of human cognition, including Theory of Mind reasoning and social inference, while identifying persistent limitations such as cognitive biases, lack of grounded understanding, and behavioral inconsistencies. We then survey emerging applications of LLMs in multi-agent simulation frameworks, examining system architectures, scalability, and validation strategies. Projects such as Generative Agents (Smallville) and AgentSociety are analyzed with respect to their empirical grounding and methodological design. Particular attention is given to the challenges of behavioral fidelity, calibration, and reproducibility in large-scale LLM-driven simulations. Finally, we distinguish between contexts where LLM-based agents provide operational value-such as interactive simulations and serious games-and contexts where their use raises epistemic concerns, particularly in explanatory or predictive modeling. We argue that hybrid approaches integrating LLMs into established agent-based modeling platforms such as GAMA and NetLogo may offer a promising compromise between expressive flexibility and analytical transparency. Building on this analysis, we outline a conceptual research direction termed Hybrid Constitutional Architectures, which proposes a stratified integration of classical agent-based models (ABMs), small language models (SLMs), and LLMs within established platforms such as GAMA and NetLogo.
Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis ICLR 2026
Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a modality-weaving Diagnoser that weaves clinical text with audio tokens via strategic global attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a flow matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for this work, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.
comment: 24 pages, 3 figures. Published as a conference paper at ICLR 2026
ParamMem: Augmenting Language Agents with Parametric Reflective Memory
Self-reflection enables language agents to iteratively refine solutions, yet often produces repetitive outputs that limit reasoning performance. Recent studies have attempted to address this limitation through various approaches, among which increasing reflective diversity has shown promise. Our empirical analysis reveals a strong positive correlation between reflective diversity and task success, further motivating the need for diverse reflection signals. We introduce ParamMem, a parametric memory module that encodes cross-sample reflection patterns into model parameters, enabling diverse reflection generation through temperature-controlled sampling. Building on this module, we propose ParamAgent, a reflection-based agent framework that integrates parametric memory with episodic and cross-sample memory. Extensive experiments on code generation, mathematical reasoning, and multi-hop question answering demonstrate consistent improvements over state-of-the-art baselines. Further analysis reveals that ParamMem is sample-efficient, enables weak-to-strong transfer across model scales, and supports self-improvement without reliance on stronger external model, highlighting the potential of ParamMem as an effective component for enhancing language agents.
comment: 20 pages
Mixed-Initiative Dialog for Human-Robot Collaborative Manipulation
Effective robotic systems for long-horizon human-robot collaboration must adapt to a wide range of human partners, whose physical behavior, willingness to assist, and understanding of the robot's capabilities may change over time. This demands a tightly coupled communication loop that grants both agents the flexibility to propose, accept, or decline requests as they coordinate toward completing the task effectively. We apply a Mixed-Initiative dialog paradigm to Collaborative human-roBot teaming and propose MICoBot, a system that handles the common scenario where both agents, using natural language, take initiative in formulating, accepting, or rejecting proposals on who can best complete different steps of a task. To handle diverse, task-directed dialog, and find successful collaborative strategies that minimize human effort, MICoBot makes decisions at three levels: (1) a meta-planner considers human dialog to formulate and code a high-level collaboration strategy, (2) a planner optimally allocates the remaining steps to either agent based on the robot's capabilities (measured by a simulation-pretrained affordance model) and the human's estimated availability to help, and (3) an action executor decides the low-level actions to perform or words to say to the human. In physical robot trials with 18 unique human participants, MICoBot significantly improves task success and user experience over a pure LLM baseline and standard agent allocation models. See additional videos and materials at https://robin-lab.cs.utexas.edu/MicoBot/.
comment: Project website at https://robin-lab.cs.utexas.edu/MicoBot/
Training Generalizable Collaborative Agents via Strategic Risk Aversion
Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals. Unfortunately, existing approaches to learning policies for such collaborative problems produce brittle solutions that fail when paired with new partners. We attribute these failures to a combination of free-riding during training and a lack of strategic robustness. To address these problems, we study the concept of strategic risk aversion and interpret it as a principled inductive bias for generalizable cooperation with unseen partners. While strategically risk-averse players are robust to deviations in their partner's behavior by design, we show that, in collaborative games, they also (1) can have better equilibrium outcomes than those at classical game-theoretic concepts like Nash, and (2) exhibit less or no free-riding. Inspired by these insights, we develop a multi-agent reinforcement learning (MARL) algorithm that integrates strategic risk aversion into standard policy optimization methods. Our empirical results across collaborative benchmarks (including an LLM collaboration task) validate our theory and demonstrate that our approach consistently achieves reliable collaboration with heterogeneous and previously unseen partners across collaborative tasks.
City Editing: Hierarchical Agentic Execution for Dependency-Aware Urban Geospatial Modification
As cities evolve over time, challenges such as traffic congestion and functional imbalance increasingly necessitate urban renewal through efficient modification of existing plans, rather than complete re-planning. In practice, even minor urban changes require substantial manual effort to redraw geospatial layouts, slowing the iterative planning and decision-making procedure. Motivated by recent advances in agentic systems and multimodal reasoning, we formulate urban renewal as a machine-executable task that iteratively modifies existing urban plans represented in structured geospatial formats. More specifically, we represent urban layouts using GeoJSON and decompose natural-language editing instructions into hierarchical geometric intents spanning polygon-, line-, and point-level operations. To coordinate interdependent edits across spatial elements and abstraction levels, we propose a hierarchical agentic framework that jointly performs multi-level planning and execution with explicit propagation of intermediate spatial constraints. We further introduce an iterative execution-validation mechanism that mitigates error accumulation and enforces global spatial consistency during multi-step editing. Extensive experiments across diverse urban editing scenarios demonstrate significant improvements in efficiency, robustness, correctness, and spatial validity over existing baselines.
Emergent Coordination in Multi-Agent Language Models
When are multi-agent LLM systems merely a collection of individual agents versus an integrated collective with higher-order structure? We introduce an information-theoretic framework to test -- in a purely data-driven way -- whether multi-agent systems show signs of higher-order structure. This information decomposition lets us measure whether dynamical emergence is present in multi-agent LLM systems, localize it, and distinguish spurious temporal coupling from performance-relevant cross-agent synergy. We implement a practical criterion and an emergence capacity criterion operationalized as partial information decomposition of time-delayed mutual information (TDMI). We apply our framework to experiments using a simple guessing game without direct agent communication and minimal group-level feedback with three randomized interventions. Groups in the control condition exhibit strong temporal synergy but little coordinated alignment across agents. Assigning a persona to each agent introduces stable identity-linked differentiation. Combining personas with an instruction to ``think about what other agents might do'' shows identity-linked differentiation and goal-directed complementarity across agents. Taken together, our framework establishes that multi-agent LLM systems can be steered with prompt design from mere aggregates to higher-order collectives. Our results are robust across emergence measures and entropy estimators, and not explained by coordination-free baselines or temporal dynamics alone. Without attributing human-like cognition to the agents, the patterns of interaction we observe mirror well-established principles of collective intelligence in human groups: effective performance requires both alignment on shared objectives and complementary contributions across members.
Systems and Control (EESS)
Virtual Constraint for a Quadrotor UAV Enforcing a Body-Axis Pointing Direction
We propose a geometric control framework on $SE(3)$ for quadrotors that enforces pointing-driven missions without completing a full attitude reference. The mission is encoded through virtual constraints defining a task manifold and an associated set of admissible velocities, and invariance is achieved by a feedback law obtained from a linear system in selected inputs. Under a transversality condition with the effective actuation distribution, the invariance-enforcing input is uniquely defined, yielding a constructive control law and, for relevant tasks, closed-form expressions. We further derive a local off-manifold stabilization extension. As a case study, we lock a body axis to a prescribed line-of-sight direction while maintaining fixed altitude.
Observer-Based Estimation and Hydrostatic Inertia Modeling for Cooperative Transport of Variable-Inertia Loads with Quadrotors
We address load-parameter estimation in cooperative aerial transport with time-varying mass and inertia, as in fluid-carrying payloads. Using an intrinsic manifold model of the multi-quadrotor-load dynamics, we combine a geometric tracking controller with an observer for parameter identification. We estimate mass from measurable kinematics and commanded forces, and handle variable inertia via an inertia surrogate that reproduces the load's rotational dynamics for control and state propagation. Instead of real-time identification of the true inertia tensor, driven by high-dimensional internal fluid motion, we leverage known tank geometry and fluid-mechanical structure to pre-compute inertia tensors and update them through a lookup table indexed by fill level and attitude. The surrogate is justified via the incompressible Navier-Stokes equations in the translating/rotating load frame: when effective forcing is gravity-dominated (i.e., translational/rotational accelerations and especially jerk are limited), the fluid approaches hydrostatic equilibrium and the free surface is well approximated by a plane orthogonal to the body-frame gravity direction.
Curriculum-Based Soft Actor-Critic for Multi-Section R2R Tension Control
Precise tension control in roll-to-roll (R2R) manufacturing is difficult under varying operating conditions and process uncertainty. This paper presents a curriculum-based Soft Actor-Critic (SAC) controller for multi-section R2R tension control. The policy is trained in three phases with progressively wider reference ranges, from 27 to 33 N to the full operating envelope of 20 to 40 N, so it can generalize across nominal and disturbed conditions. On a three-section R2R benchmark, the learned controller achieves accurate tracking in nominal operation and handles large disturbances, including 20 N to 40 N step changes, with a single policy and no scenario-specific retuning. These results indicate that curriculum-trained SAC is a practical alternative to model-based control when system parameters vary and process uncertainty is significant.
FaultXformer: A Transformer-Encoder Based Fault Classification and Location Identification model in PMU-Integrated Active Electrical Distribution System
Accurate fault detection and localization in electrical distribution systems is crucial, especially with the increasing integration of distributed energy resources (DERs), which inject greater variability and complexity into grid operations. In this study, FaultXformer is proposed, a Transformer encoder-based architecture developed for automatic fault analysis using real-time current data obtained from phasor measurement unit (PMU). The approach utilizes time-series current data to initially extract rich temporal information in stage 1, which is crucial for identifying the fault type and precisely determining its location across multiple nodes. In Stage 2, these extracted features are processed to differentiate among distinct fault types and identify the respective fault location within the distribution system. Thus, this dual-stage transformer encoder pipeline enables high-fidelity representation learning, considerably boosting the performance of the work. The model was validated on a dataset generated from the IEEE 13-node test feeder, simulated with 20 separate fault locations and several DER integration scenarios, utilizing current measurements from four strategically located PMUs. To demonstrate robust performance evaluation, stratified 10-fold cross-validation is performed. FaultXformer achieved average accuracies of 98.76% in fault type classification and 98.92% in fault location identification across cross-validation, consistently surpassing conventional deep learning baselines convolutional neural network (CNN), recurrent neural network (RNN). long short-term memory (LSTM) by 1.70%, 34.95%, and 2.04% in classification accuracy and by 10.82%, 40.89%, and 6.27% in location accuracy, respectively. These results demonstrate the efficacy of the proposed model with significant DER penetration.
Neural Luenberger state observer for nonautonomous nonlinear systems
This work proposes a method for model-free synthesis of a state observer for nonlinear systems with manipulated inputs, where the observer is trained offline using a historical or simulation dataset of state measurements. We use the structure of the Kazantzis-Kravaris/Luenberger (KKL) observer, extended to nonautonomous systems by adding an additional input-affine term to the linear time-invariant (LTI) observer-state dynamics, which determines a nonlinear injective mapping of the true states. Both this input-affine term and the nonlinear mapping from the observer states to the system states are learned from data using fully connected feedforward multi-layer perceptron neural networks. Furthermore, we theoretically prove that trained neural networks, when given new input-output data, can be used to observe the states with a guaranteed error bound. To validate the proposed observer synthesis method, case studies are performed on a bioreactor and a Williams-Otto reactor.
Data-Driven Linearization based Arc Fault Prediction in Medium Voltage Electrical Distribution System
High-impedance arc faults (HIAFs) in medium-voltage electrical distribution systems are difficult to detect due to their low fault current levels and nonlinear transient behavior. Traditional detection algorithms generally struggle with predictions under dynamic waveform scenarios. This research provides our approach of using a unique data-driven linearization (DDL) framework for early prediction of HIAFs, giving both interpretability and scalability. The proposed method translates nonlinear current waveforms into a linearized space using coordinate embeddings and polynomial transformation, enabling precise modelling of fault precursors.The total duration of the test waveform is 0.5 seconds, within which the arc fault occurs between 0.2 seconds to 0.3 seconds. Our proposed approach using DDL, trained solely on the pre-fault healthy region (0.10 seconds to 0.18 seconds) effectively captures certain invisible fault precursors, to accurately predict the onset of fault at 0.189 seconds, which is approximately 0.011 seconds (i.e., 11 milliseconds) earlier than the actual fault occurrence. In particular, the framework predicts the start of arc faults at 0.189 seconds, significantly earlier of the actual fault incidence at 0.200 seconds, demonstrating substantial early warning capability. Performance evaluation comprises eigenvalue analysis, prediction error measures, error growth rate and waveform regeneration fidelity. Such early prediction proves that the model is capable of correctly foreseeing faults which is especially helpful in preventing real-world faults and accidents. It confirms that our proposed approach reliably predicts arc faults in medium-voltage power distribution systems
A 200 dB Dynamic Range Radiation-Hard Delta-Sigma Current Digitizer for Beam Loss Monitoring
This paper presents a radiation-hardened current-mode delta-sigma ADC fabricated in a standard 130~nm CMOS technology and qualified for total ionizing doses up to 100~Mrad. The converter is designed for beam loss monitoring applications in high-energy physics, where it must handle input currents spanning nine decades, from 1~mA down to 1~pA, while providing a fast 10~\textmu s response time for machine protection. To meet these conflicting requirements, the architecture exploits the inherent trade-off between resolution and acquisition time: a first-order modulator sampled at 20~MHz delivers 11-bit effective resolution within the critical 10~\textmu s window for the mA current range. Extended integration times of up to 100~s enable the sub-picoampere resolution required for beam alignment and background monitoring and provides an operational dynamic range exceeding 200~dB. The chip integrates two independent channels, consumes 25~mW from a 1.2~V supply, and includes radiation-hardening techniques such as triple-redundant digital logic and SEU-tolerant comparator banks. Post-irradiation measurements up to 100~Mrad show no performance degradation, and the uncalibrated integral nonlinearity remains within [+0.2\%, --0.3\%] of full scale over the 1~mA to 5~\textmu A range. The converter's flexibility and radiation tolerance make it suitable not only for the HL-LHC beam loss monitoring upgrade but also for other precision current measurement applications in harsh environments.
Physics-Embedded Neural ODEs for Learning Antagonistic Pneumatic Artificial Muscle Dynamics
Pneumatic artificial muscles (PAMs) enable compliant actuation for soft wearable, assistive, and interactive robots. When arranged antagonistically, PAMs can provide variable impedance through co-contraction but exhibit coupled, nonlinear, and hysteretic dynamics that challenge modeling and control. This paper presents a hybrid neural ordinary differential equation (Neural ODE) framework that embeds physical structure into a learned model of antagonistic PAM dynamics. The formulation combines parametric joint mechanics and pneumatic state dynamics with a neural network force component that captures antagonistic coupling and rate-dependent hysteresis. The forward model predicts joint motion and chamber pressures with a mean R$^2$ of 0.88 across 225 co-contraction conditions. An inverse formulation, derived from the learned dynamics, computes pressure commands offline for desired motion and stiffness profiles, tracked in closed loop during execution. Experimental validation demonstrates reliable stiffness control across 126-176 N/mm and consistent impedance behavior across operating velocities, in contrast to a static model, which shows degraded stiffness consistency at higher velocities.
PseudoAct: Leveraging Pseudocode Synthesis for Flexible Planning and Action Control in Large Language Model Agents
Large language model (LLM) agents typically rely on reactive decision-making paradigms such as ReAct, selecting actions conditioned on growing execution histories. While effective for short tasks, these approaches often lead to redundant tool usage, unstable reasoning, and high token consumption in complex long-horizon tasks involving branching, iteration, or multi-tool coordination. To address these limitations, this paper introduces PseudoAct, a novel framework for flexible planning and action control in LLM agents through pseudocode synthesis. Leveraging the ability of LLMs to express task-solving strategies as code, PseudoAct synthesizes a structured pseudocode plan that decomposes a task into subtasks and explicitly encodes control flow, including sequencing, conditionals, loops, parallel composition, and combinations of these logic primitives. Actions are then executed by following this global plan, making the decision logic explicit and temporally coherent. This design reduces redundant actions, prevents infinite loops, and avoids uninformative alternative exploration, enabling consistent and efficient long-horizon decision-making. Experiments on benchmark datasets show that our method significantly outperforms existing reactive agent approaches, achieving a 20.93% absolute gain in success rate on FEVER and setting a new state-of-the-art on HotpotQA.
MicroPush: A Simulator and Benchmark for Contact-Rich Cell Pushing and Assembly with a Magnetic Rolling Microrobot
Magnetic rolling microrobots enable gentle manipulation in confined microfluidic environments, yet autonomy for contact-rich behaviors such as cell pushing and multi-target assembly remains difficult to develop and evaluate reproducibly. We present MicroPush, an open-source simulator and benchmark suite for magnetic rolling microrobots in cluttered 2D scenes. MicroPush combines an overdamped interaction model with contact-aware stick--slip effects, lightweight near-field damping, optional Poiseuille background flow, and a calibrated mapping from actuation frequency to free-space rolling speed. On top of the simulator core, we provide a modular planning--control stack with a two-phase strategy for contact establishment and goal-directed pushing, together with a deterministic benchmark protocol with fixed tasks, staged execution, and unified CSV logging for single-object transport and hexagonal assembly. We report success, time, and tracking metrics, and an actuation-variation measure $E_{Δω}$. Results show that controller stability dominates performance under flow disturbances, while planner choice can influence command smoothness over long-horizon sequences via waypoint progression. MicroPush enables reproducible comparison and ablation of planning, control, and learning methods for microscale contact-rich micromanipulation.
comment: 13 pages, 8 figures
Verifier-Bound Communication for LLM Agents: Certified Bounds on Covert Signaling
Colluding language-model agents can hide coordination in messages that remain policy-compliant at the surface level. We present CLBC, a protocol where generation and admission are separated: a message is admitted to transcript state only if a small verifier accepts a proof-bound envelope under a pinned predicate $Π$. The predicate binds policy hash, public randomness schedule, transcript chaining, latent schema constraints, canonical metadata/tool fields, and deterministic rejection codes. We show how this protocol yields an upper bound on transcript leakage in terms of latent leakage plus explicit residual channels, derive adaptive composition guarantees, and state a semantic lower bound when policy-valid alternatives remain choosable. We report extensive empirically grounded evidence: aggregate evaluation satisfies all prespecified thresholds; strict lane decoder advantage is bounded at 0.0000 with MI proxy 0.0636; adaptive-colluder stress tests remain below attacker thresholds; and baseline separation shows large gaps between reject-by-default semantics and audit-only controls. We further quantify operational tradeoffs. Strict full-proof mode has median turn latency 27.53s (p95 28.08s), while sampled proving reduces non-proved-turn latency to 0.327ms. The central finding is that bottlenecks alone are insufficient: security claims depend on verifiable admission semantics that are online, deterministic, and fail-closed.
Vector Certificates for $ω$-regular Specifications
The recently introduced notions of ranking functions and closure certificates utilize well-foundedness arguments to facilitate the verification of dynamical systems against $ω$-regular properties. A ranking function and a closure certificate are real-valued functions defined over states and state pairs of a dynamical system whose zero superlevel sets are inductive state invariant and inductive transition invariant, respectively. The search for such certificates can be automated by fixing a specific template class, such as a polynomial of a fixed degree, and then using optimization techniques such as sum-of-squares (SOS) programming to find it. Unfortunately, such certificates may not be found for a fixed template. In such a case, one must change the template; for example, increase the degree of the polynomial. In this paper, we consider a notion of multiple functions in the form of vector certificates. Taking inspiration from the literature on vector barrier certificates as generalizations of standard barrier certificates for safety verification, we propose vector co-Büchi ranking functions and vector closure certificates as nontrivial generalizations of ranking functions and closure certificates, respectively. Both notions consist of a set of functions that jointly overapproximate an inductive invariant by considering each function to be a linear combination of the others. The advantage of such certificates is that they allow us to prove properties even when a single function for a fixed template cannot be found using standard approaches. We present an SOS programming approach to search for these functions and demonstrate the effectiveness of our proposed method in verifying $ω$-regular specifications in several case studies.
SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs
In large-scale LLM pre-training systems with 100k+ GPUs, failures become the norm rather than the exception, and restart costs can dominate wall-clock training time. However, existing fault-tolerance mechanisms are largely unprepared for this restart-dominant regime. To address this challenge, we propose SPARe - Stacked Parallelism with Adaptive Reordering - a fault-tolerance framework that masks node failures during gradient synchronization by stacking redundant data shards across parallelism groups and adaptively reordering execution. SPARe achieves availability comparable to traditional replication while maintaining near-constant computation overhead of only 2~3x, even under high redundancy where traditional replication would require linearly inflating overhead. We derive closed-form expressions for endurable failure count and computation overhead, validate them via SimGrid-based discrete-event simulation, and jointly optimize redundancy and checkpointing to minimize time-to-train. At extreme scale with up to 600k GPUs, SPARe reduces time-to-train by 40~50% compared to traditional replication.
Geometric Look-Angle Shaping Strategy for Enclosed Inspection
This paper introduces inspection through GLASS, a Geometric Look-Angle Shaping Strategy for enclosed regions using unmanned aerial vehicles. In doing so, the vehicles guidance command is constructed through a bounded, geometry-consistent shaping of the look angle relative to a desired standoff path. By embedding a smooth, hyperbolic-tangent-type shaping function within a polar geometric framework, GLASS ensures global existence of the guidance dynamics. It avoids the far-field limitations inherent to conventional formulations. Lyapunov stability analysis establishes asymptotic convergence to a prescribed inspection standoff under explicit curvature feasibility conditions, along with analytical settling-time characteristics. The proposed strategy incorporates maximum turn-rate constraints without inducing singularities throughout the workspace. High-fidelity six-degree-of-freedom quadrotor simulations demonstrate the effectiveness of GLASS in representative enclosed inspection scenarios, highlighting a practically viable guidance framework for autonomous enclosed inspection missions.
comment: Preprinted submitted to ICUAS 2026
Smart Prism with Tilt Compensation for CAN bus on Mobile Machinery Using Robotic Total Stations
Accurate reference trajectories are required to validate autonomous agricultural robots and highly automated off-road vehicles under real-world field conditions. In practice, robotic total stations provide millimeter-level prism center coordinates, but the point of interest on the vehicle is typically displaced by a lever arm, ranging from decimeters to multiple meters. Roll and pitch motions, as typically observed in off-road machinery, therefore introduce horizontal point of interest errors far exceeding the measurement accuracy of robotic total stations observations. This paper presents the design, implementation, and validation of a Smart Prism prototype that augments a robotic total station prism with an inertial measurement unit to enable real-time tilt compensation. The prototype integrates an STM32H7 microcontroller and a Murata SCH16T-series IMU and estimates roll and pitch angles using an adaptive complementary filter. The tilt-compensated point of interest coordinates are obtained by transforming a calibrated lever arm from the body frame into the navigation frame and combining it with robotic total station prism positions. To support vehicle-side integration, the system can transmit prism and tilt-compensated point of interest coordinates on the Controller Area Network bus, allowing the point of interest to be treated as a virtual position sensor (e.g., co-located with a rear-axle reference point). Experiments with a fixed ground reference point, using a prism to point of interest lever arm of approximately 1.07m and manual roll/pitch excursions of up to 60 deg, yield three-dimensional root-mean-square errors between 2.9mm and 23.6mm across five test series. The results demonstrate that IMU-based tilt compensation enables reference measurements suitable for validating centimeter-level navigation systems under dynamic field conditions.
Modeling PWM-Time-SOC Interaction in a Simulated Robot
Accurate prediction of battery state of charge is needed for autonomous robots to plan movements without using up all available power. This work develops a physics and data-informed model from a simulation that predicts SOC depletion as a function of time and PWM duty cycle for a simulated 4-wheel Arduino robot. A forward-motion simulation incorporating motor electrical characteristics (resistance, inductance, back-EMF, torque constant) and mechanical dynamics (mass, drag, rolling resistance, wheel radius) was used to generate SOC time-series data across PWM values from 1-100%. Sparse Identification of Nonlinear Dynamics (SINDy), combined with least-squares regression, was applied to construct a unified nonlinear model that captures SOC(t, p). The framework allows for energy-aware planning for similar robots and can be extended to incorporate arbitrary initial SOC levels and environment-dependent parameters for real-world deployment.
Robust Adaptive MPC Under Nonlinear Time-Varying Uncertainties: An Uncertainty Compensation Approach
This paper introduces an uncertainty compensation-based robust adaptive model predictive control (MPC) framework for linear systems with nonlinear time-varying uncertainties. The framework integrates an L1 adaptive controller to compensate for the matched uncertainty and a robust feedback controller, designed using linear matrix inequalities, to mitigate the effect of unmatched uncertainty on target output channels. Uniform bounds on the errors between the system's states and control inputs and those of a nominal (i.e., uncertainty-free) system are derived. These error bounds are then used to tighten the actual system's state and input constraints, enabling the design of an MPC for the nominal system under these tightened constraints. Referred to as uncertainty compensation-based MPC (UC-MPC), this approach ensures constraint satisfaction while delivering enhanced performance compared to existing methods. Simulation results for a flight control example and a spacecraft landing on an asteroid demonstrate the effectiveness of the proposed framework.
GENAI WORKBENCH: AI-Assisted Analysis and Synthesis of Engineering Systems from Multimodal Engineering Data
Modern engineering design platforms excel at discipline-specific tasks such as CAD, CAM, and CAE, but often lack native systems engineering frameworks. This creates a disconnect where system-level requirements and architectures are managed separately from detailed component design, hindering holistic development and increasing integration risks. To address this, we present the conceptual framework for the GenAI Workbench, a Model-Based Systems Engineering (MBSE) environment that integrates systems engineering principles into the designer's workflow. Built on an open-source PLM platform, it establishes a unified digital thread by linking semantic data from documents, physical B-rep geometry, and relational system graphs. The workbench facilitates an AI-assisted workflow where a designer can ingest source documents, from which the system automatically extracts requirements and uses vision-language models to generate an initial system architecture, such as a Design Structure Matrix (DSM). This paper presents the conceptual architecture, proposed methodology, and anticipated impact of this work-in-progress framework, which aims to foster a more integrated, data-driven, and informed engineering design methodology.
comment: 7 pages, 3 figures, accepted to be presented at IISE Annual Conference 2026
The PenduMAV: A Six-Input Omnidirectional MAV without Internal Forces -- Design, Dynamics, and SE(3) Control
We introduce the PenduMAV, an exactly actuated (6-input) omnidirectional multirotor that structurally eliminates internal forces at equilibria. The vehicle features one actively-tilting propeller and three propellers mounted on passive pendulum links via universal joints. This architecture achieves full 6D wrench generation while avoiding the structural and energetic costs of input redundancy and internal forces. After deriving the full multibody dynamics, we demonstrate that a forced equilibrium exists for every main platform pose. To asymptotically stabilize the closed-loop system, we design a coordinate-invariant nonlinear controller based on dynamic feedback linearization and backstepping, utilizing the left-trivialized error on SE(3). System stability is formally guaranteed through Lyapunov analysis of the zero dynamics. Finally, Gazebo simulations (videos available at https://www.youtube.com/playlist?list=PL4N8pJgvqASQX6AWEpg3NCZ6QdGBPfbXq) validate the approach, showcasing fully decoupled attitude and translational tracking under parametric uncertainty and actuator noise.
Mixed formulation and structure-preserving discretization of Cosserat rod dynamics in a port-Hamiltonian framework
An energy-based modeling framework for the nonlinear dynamics of spatial Cosserat rods undergoing large displacements and rotations is proposed. The mixed formulation features independent displacement, velocity and stress variables and is further objective and locking-free. Finite rotations are represented using a director formulation that avoids singularities and yields a constant mass matrix. This results in an infinite-dimensional nonlinear port-Hamiltonian (PH) system governed by partial differential-algebraic equations with a quadratic energy functional. Using a time-differentiated compliance form of the stress-strain relations allows for the imposition of kinematic constraints, such as inextensibility or shear-rigidity. A structure-preserving finite element discretization leads to a finite-dimensional system with PH structure, thus facilitating the design of an energy-momentum consistent integration scheme. Dissipative material behavior (via the generalized-Maxwell model) and non-standard actuation approaches (via pneumatic chambers or tendons) integrate naturally into the framework. As illustrated by selected numerical examples, the present framework establishes a new approach to energy-momentum consistent formulations in computational mechanics involving finite rotations.
comment: 39 pages, 16 figures
Decentralized Parametric Stability Certificates for Grid-Forming Converter Control
We propose a decentralized framework to analytically guarantee the small-signal stability of future power systems with grid-forming converters. Our approach leverages dynamic loop-shifting techniques to compensate for the lack of passivity in the network dynamics and establishes decentralized parametric stability certificates, depending on the local device-level controls and incorporating the effects of the network. By following practical tuning rules, we are able to ensure plug-and-play operation without centralized coordination. Unlike prior works, our approach accommodates coupled frequency and voltage dynamics, incorporates network dynamics, and does not rely on specific network configurations or operating points, offering a general and scalable solution for the integration of power-electronics-based devices into future power systems. We validate our theoretical stability results through numerical case studies in a high-fidelity simulation model.
comment: 14 pages, 17 figures
Instantaneous Complex Phase and Frequency: Conceptual Clarification and Equivalence between Formulations
This letter seeks to clarify the different existing definitions of both instantaneous complex phase and frequency as well as their equivalence under standard modeling assumptions considered for transmission systems, i.e. balanced positive sequence operation, sole presence of electro-mechanical transient dynamics and absence of harmonics and interharmonics. To achieve this, the two fundamental definitions, i.e., those based on either the use of (i) analytic signals or (ii) space vectors, together with the premises used for their formulation, are presented and their relationship shown. Lastly, a unified notation and terminology to avoid confusion is proposed.
Model Predictive Control with Reference Learning for Soft Robotic Intracranial Pressure Waveform Modulation
This paper introduces a learning-based control framework for a soft robotic actuator system designed to modulate intracranial pressure (ICP) waveforms, which is essential for studying cerebrospinal fluid dynamics and pathological processes underlying neurological disorders. A two-layer framework is proposed to safely achieve a desired ICP waveform modulation. First, a model predictive controller (MPC) with a disturbance observer is used for offset-free tracking of the system's motor position reference trajectory under safety constraints. Second, to address the unknown nonlinear dependence of ICP on the motor position, we employ a Bayesian optimization (BO) algorithm used for online learning of a motor position reference trajectory that yields the desired ICP modulation. The framework is experimentally validated using a test bench with a brain phantom that replicates realistic ICP dynamics in vitro. Compared to a previously employed proportional-integral-derivative controller, the MPC reduces mean and maximum motor position reference tracking errors by 83 % and 73 %, respectively. In less than 20 iterations, the BO algorithm learns a motor position reference trajectory that yields an ICP waveform with the desired mean and amplitude.
GenAI-Net: A Generative AI Framework for Automated Biomolecular Network Design
Biomolecular networks underpin emerging technologies in synthetic biology-from robust biomanufacturing and metabolic engineering to smart therapeutics and cell-based diagnostics-and also provide a mechanistic language for understanding complex dynamics in natural and ecological systems. Yet designing chemical reaction networks (CRNs) that implement a desired dynamical function remains largely manual: while a proposed network can be checked by simulation, the reverse problem of discovering a network from a behavioral specification is difficult, requiring substantial human insight to navigate a vast space of topologies and kinetic parameters with nonlinear and possibly stochastic dynamics. Here we introduce GenAI-Net, a generative AI framework that automates CRN design by coupling an agent that proposes reactions to simulation-based evaluation defined by a user-specified objective. GenAI-Net efficiently produces novel, topologically diverse solutions across multiple design tasks, including dose responses, complex logic gates, classifiers, oscillators, and robust perfect adaptation in deterministic and stochastic settings (including noise reduction). By turning specifications into families of circuit candidates and reusable motifs, GenAI-Net provides a general route to programmable biomolecular circuit design and accelerates the translation from desired function to implementable mechanisms.
The Sample Complexity of Online Reinforcement Learning: A Multi-model Perspective ICLR 2026
We study the sample complexity of online reinforcement learning in the general \hzyrev{non-episodic} setting of nonlinear dynamical systems with continuous state and action spaces. Our analysis accommodates a large class of dynamical systems ranging from a finite set of nonlinear candidate models to models with bounded and Lipschitz continuous dynamics, to systems that are parametrized by a compact and real-valued set of parameters. In the most general setting, our algorithm achieves a policy regret of $\mathcal{O}(N ε^2 + d_\mathrm{u}\mathrm{ln}(m(ε))/ε^2)$, where $N$ is the time horizon, $ε$ is a user-specified discretization width, $d_\mathrm{u}$ the input dimension, and $m(ε)$ measures the complexity of the function class under consideration via its packing number. In the special case where the dynamics are parametrized by a compact and real-valued set of parameters (such as neural networks, transformers, etc.), we prove a policy regret of $\mathcal{O}(\sqrt{d_\mathrm{u}N p})$, where $p$ denotes the number of parameters, recovering earlier sample-complexity results that were derived for linear time-invariant dynamical systems. While this article focuses on characterizing sample complexity, the proposed algorithms are likely to be useful in practice, due to their simplicity, their ability to incorporate prior knowledge, and their benign transient behaviors.
comment: accepted at ICLR 2026; 37 pages, 6 figures
Federated Nonlinear System Identification
We consider federated learning of linearly-parameterized nonlinear systems. We establish theoretical guarantees on the effectiveness of federated nonlinear system identification compared to centralized approaches, demonstrating that the convergence rate improves as the number of clients increases. Although the convergence rates in the linear and nonlinear cases differ only by a constant, this constant depends on the feature map $φ$, which can be carefully chosen in the nonlinear setting to increase excitation and improve performance. We experimentally validate our theory in physical settings where client devices are driven by i.i.d. control inputs and control policies exhibiting i.i.d. random perturbations, ensuring non-active exploration. Experiments use trajectories from nonlinear dynamical systems characterized by real-analytic feature functions, including polynomial and trigonometric components, representative of physical systems including pendulum and quadrotor dynamics. We analyze the convergence behavior of the proposed method under varying noise levels and data distributions. Results show that federated learning consistently improves convergence of any individual client as the number of participating clients increases.
comment: 8 pages. Accepted at ACC 2026
Robust Capacity Expansion Modelling for Renewable Energy Systems
Future greenhouse gas neutral energy systems will be dominated by renewable energy technologies providing variable supply subject to uncertain weather conditions. For this setting, we propose an algorithm for capacity expansion planning: We evaluate solutions optimised on a single years' data under different input weather years, and iteratively modify solutions whenever supply gaps are detected. These modifications lead to solutions with sufficient capacities to overcome periods of cold dark lulls and seasonal demand/supply fluctuations. A computational study on a German energy system model for 40 operating years shows that preventing supply gaps, i.e. finding a robust system, increases the total annual cost by 1.6-2.9%. In comparison, non-robust systems display loss of load close to 50% of total demand during some periods. Results underline the importance of assessing the feasibility of energy system models using atypical time-series, combining dark lull and cold period effects.
Joint Estimation of Sea State and Vessel Parameters Using a Mass-Spring-Damper Equivalence Model
Real-time sea state estimation is vital for applications like shipbuilding and maritime safety. Traditional methods rely on accurate wave-vessel transfer functions to estimate wave spectra from onboard sensors. In contrast, our approach jointly estimates sea state and vessel parameters without needing prior transfer function knowledge, which may be unavailable or variable. We model the wave-vessel system using pseudo mass-spring-dampers and develop a dynamic model for the system. This method allows for recursive modeling of wave excitation as a time-varying input, relaxing prior works' assumption of a constant input. We derive statistically consistent process noise covariance and implement a square root cubature Kalman filter for sensor data fusion. Further, we derive the Posterior Cramer-Rao lower bound to evaluate estimator performance. Extensive Monte Carlo simulations and data from a high-fidelity validated simulator confirm that the estimated wave spectrum matches methods assuming complete transfer function knowledge.
comment: Accepted to journal, Signal Processing
Generalized Momenta-Based Koopman Formalism for Robust Control of Euler-Lagrangian Systems
This paper presents a novel Koopman operator formulation for Euler Lagrangian dynamics that employs an implicit generalized momentum-based state space representation, which decouples a known linear actuation channel from state dependent dynamics and makes the system more amenable to linear Koopman modeling. By leveraging this structural separation, the proposed formulation only requires to learn the unactuated dynamics rather than the complete actuation dependent system, thereby significantly reducing the number of learnable parameters, improving data efficiency, and lowering overall model complexity. In contrast, conventional explicit formulations inherently couple inputs with the state dependent terms in a nonlinear manner, making them more suitable for bilinear Koopman models, which are more computationally expensive to train and deploy. Notably, the proposed scheme enables the formulation of linear models that achieve superior prediction performance compared to conventional bilinear models while remaining substantially more efficient. To realize this framework, we present two neural network architectures that construct Koopman embeddings from actuated or unactuated data, enabling flexible and efficient modeling across different tasks. Robustness is ensured through the integration of a linear Generalized Extended State Observer (GESO), which explicitly estimates disturbances and compensates for them in real time. The combined momentum-based Koopman and GESO framework is validated through comprehensive trajectory tracking simulations and experiments on robotic manipulators, demonstrating superior accuracy, robustness, and learning efficiency relative to state of the art alternatives.
Minimal Construction of Graphs with Maximum Robustness
The notions of $r$-robustness and $(r,s)$-robustness of a network have been earlier introduced in the literature to achieve resilient consensus in the presence of misbehaving agents. However, while higher robustness levels enable networks to tolerate a higher number of misbehaving agents, they also require dense communication structures, which are not always desirable for systems with limited communication ranges, energy, and resources. Therefore, this paper studies the fundamental structures behind $r$-robustness and $(r,s)$- robustness properties in two ways. (a) We first establish tight necessary conditions on the number of edges that an undirected graph with an arbitrary number of nodes must have to achieve maximum $r$- and $(r,s)$-robustness. (b) We then use these conditions to construct two classes of undirected graphs, referred as to $γ$- and $(γ,γ)$-Minimal Edge Robust Graphs (MERGs), that provably achieve maximum robustness with minimal numbers of edges. We demonstrate the effectiveness of our method via comparison against existing robust graph structures and a set of simulations.
comment: 13 pages, 7 figures, under revision at IEEE Transactions on Automatic Control
Development of a Deep Learning-Driven Control Framework for Exoskeleton Robots
The purpose of this study is to develop a computationally efficient deep learning based control framework for high degree of freedom exoskeleton robots to address the real time computational limitations associated with conventional model based control. A parallel structured deep neural network was designed for a seven degree of freedom human lower extremity exoskeleton robot. The network consists of four layers with 49 densely connected neurons and was trained using physics based data generated from the analytical dynamic model. During real time implementation, the trained neural network predicts joint torque commands required for trajectory tracking, while a proportional derivative controller compensates for residual prediction errors. Stability of the proposed control scheme was analytically established, and robustness to parameter variations was evaluated using analysis of variance. Comparative simulations were conducted against computed torque, model reference computed torque, sliding mode, adaptive, and linear quadratic controllers under identical robot dynamics. Results demonstrate accurate trajectory tracking with torque profiles comparable to conventional nonlinear controllers while reducing computational burden. These findings suggest that the proposed deep learning based hybrid controller offers an efficient and robust alternative for controlling multi degree of freedom exoskeleton robots.
Provably Safe Generative Sampling with Constricting Barrier Functions
Flow-based generative models, such as diffusion models and flow matching models, have achieved remarkable success in learning complex data distributions. However, a critical gap remains for their deployment in safety-critical domains: the lack of formal guarantees that generated samples will satisfy hard constraints. We address this by proposing a safety filtering framework that acts as an online shield for any pre-trained generative model. Our key insight is to cooperate with the generative process rather than override it. We define a constricting safety tube that is relaxed at the initial noise distribution and progressively tightens to the target safe set at the final data distribution, mirroring the coarse-to-fine structure of the generative process itself. By characterizing this tube via Control Barrier Functions (CBFs), we synthesize a feedback control input through a convex Quadratic Program (QP) at each sampling step. As the tube is loosest when noise is high and intervention is cheapest in terms of control energy, most constraint enforcement occurs when it least disrupts the model's learned structure. We prove that this mechanism guarantees safe sampling while minimizing the distributional shift from the original model at each sampling step, as quantified by the KL divergence. Our framework applies to any pre-trained flow-based generative scheme requiring no retraining or architectural modifications. We validate the approach across constrained image generation, physically-consistent trajectory sampling, and safe robotic manipulation policies, achieving 100% constraint satisfaction while preserving semantic fidelity.
comment: 21 pages, 7 figures
Newton Methods in Generalized Nash Equilibrium Problems with Applications to Game-Theoretic Model Predictive Control
We prove input-to-state stability (ISS) of perturbed Newton-type methods for generalized equations arising from Nash equilibrium (NE) and generalized NE (GNE) problems. This ISS property allows the use of inexact computations in equilibrium-seeking to enable fast solution tracking in dynamic systems such as in model predictive control (MPC). For NE problems, we address the local convergence of perturbed Josephy-Newton methods from the variational inequality (VI) stability analysis, and establish the ISS result under less restrictive regularity conditions compared to the existing results established for nonlinear optimization. Agent-distributed algorithms are also developed. For GNE problems, since they cannot be reduced to VI problems in general, we use semismooth Newton methods to solve the semismooth equations arising from the Karush-Kuhn-Tucker (KKT) systems of the GNE problem and establish the ISS result under a quasi-regularity condition. To illustrate the use of the ISS in dynamic systems, applications to constrained game-theoretic MPC (CG-MPC) are studied with time-distributed solution-tracking for real-time implementation. Boundness of tracking errors is proven. Numerical examples are reported.
Learning Mixtures of Linear Dynamical Systems via Hybrid Tensor-EM Method
Mixtures of linear dynamical systems (MoLDS) provide a path to model time-series data that exhibit diverse temporal dynamics across trajectories. However, its application remains challenging in complex and noisy settings, limiting its effectiveness for neural data analysis. Tensor-based moment methods can provide global identifiability guarantees for MoLDS, but their performance degrades under noise and complexity. Commonly used expectation-maximization (EM) methods offer flexibility in fitting latent models but are highly sensitive to initialization and prone to poor local minima. Here, we propose a tensor-based method that provides identifiability guarantees for learning MoLDS, which is followed by EM updates to combine the strengths of both approaches. The novelty in our approach lies in the construction of moment tensors using the input-output data to recover globally consistent estimates of mixture weights and system parameters. These estimates can then be refined through a Kalman EM algorithm, with closed-form updates for all LDS parameters. We validate our framework on synthetic benchmarks and real-world datasets. On synthetic data, the proposed Tensor-EM method achieves more reliable recovery and improved robustness compared to either pure tensor or randomly initialized EM methods. We then analyze neural recordings from the primate somatosensory cortex while a non-human primate performs reaches in different directions. Our method successfully models and clusters different conditions as separate subsystems, consistent with supervised single-LDS fits for each condition. Finally, we apply this approach to another neural dataset where monkeys perform a sequential reaching task. These results demonstrate that MoLDS provides an effective framework for modeling complex neural data, and that Tensor-EM is a reliable approach to MoLDS learning for these applications.
comment: 24 pages, 14 figures
Robotics
Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction
Leader-follower interaction is an important paradigm in human-robot interaction (HRI). Yet, assigning roles in real time remains challenging for resource-constrained mobile and assistive robots. While large language models (LLMs) have shown promise for natural communication, their size and latency limit on-device deployment. Small language models (SLMs) offer a potential alternative, but their effectiveness for role classification in HRI has not been systematically evaluated. In this paper, we present a benchmark of SLMs for leader-follower communication, introducing a novel dataset derived from a published database and augmented with synthetic samples to capture interaction-specific dynamics. We investigate two adaptation strategies: prompt engineering and fine-tuning, studied under zero-shot and one-shot interaction modes, compared with an untrained baseline. Experiments with Qwen2.5-0.5B reveal that zero-shot fine-tuning achieves robust classification performance (86.66% accuracy) while maintaining low latency (22.2 ms per sample), significantly outperforming baseline and prompt-engineered approaches. However, results also indicate a performance degradation in one-shot modes, where increased context length challenges the model's architectural capacity. These findings demonstrate that fine-tuned SLMs provide an effective solution for direct role assignment, while highlighting critical trade-offs between dialogue complexity and classification reliability on the edge.
Interface-Aware Trajectory Reconstruction of Limited Demonstrations for Robot Learning
Assistive robots offer agency to humans with severe motor impairments. Often, these users control high-DoF robots through low-dimensional interfaces, such as using a 1-D sip-and-puff interface to operate a 6-DoF robotic arm. This mismatch results in having access to only a subset of control dimensions at a given time, imposing unintended and artificial constraints on robot motion. As a result, interface-limited demonstrations embed suboptimal motions that reflect interface restrictions rather than user intent. To address this, we present a trajectory reconstruction algorithm that reasons about task, environment, and interface constraints to lift demonstrations into the robot's full control space. We evaluate our approach using real-world demonstrations of ADL-inspired tasks performed via a 2-D joystick and 1-D sip-and-puff control interface, teleoperating two distinct 7-DoF robotic arms. Analyses of the reconstructed demonstrations and derived control policies show that lifted trajectories are faster and more efficient than their interface-constrained counterparts while respecting user preferences.
comment: 13 pages, 8 figures, to appear in the proceedings of the 2026 Human-Robot Interaction (HRI) Conference
Simple Models, Real Swimming: Digital Twins for Tendon-Driven Underwater Robots
Mimicking the graceful motion of swimming animals remains a core challenge in soft robotics due to the complexity of fluid-structure interaction and the difficulty of controlling soft, biomimetic bodies. Existing modeling approaches are often computationally expensive and impractical for complex control or reinforcement learning needed for realistic motions to emerge in robotic systems. In this work, we present a tendon-driven fish robot modeled in an efficient underwater swimmer environment using a simplified, stateless hydrodynamics formulation implemented in the widespread robotics framework MuJoCo. With just two real-world swimming trajectories, we identify five fluid parameters that allow a matching to experimental behavior and generalize across a range of actuation frequencies. We show that this stateless fluid model can generalize to unseen actuation and outperform classical analytical models such as the elongated body theory. This simulation environment runs faster than real-time and can easily enable downstream learning algorithms such as reinforcement learning for target tracking, reaching a 93% success rate. Due to the simplicity and ease of use of the model and our open-source simulation environment, our results show that even simple, stateless models -- when carefully matched to physical data -- can serve as effective digital twins for soft underwater robots, opening up new directions for scalable learning and control in aquatic environments.
Physics Informed Viscous Value Representations
Offline goal-conditioned reinforcement learning (GCRL) learns goal-conditioned policies from static pre-collected datasets. However, accurate value estimation remains a challenge due to the limited coverage of the state-action space. Recent physics-informed approaches have sought to address this by imposing physical and geometric constraints on the value function through regularization defined over first-order partial differential equations (PDEs), such as the Eikonal equation. However, these formulations can often be ill-posed in complex, high-dimensional environments. In this work, we propose a physics-informed regularization derived from the viscosity solution of the Hamilton-Jacobi-Bellman (HJB) equation. By providing a physics-based inductive bias, our approach grounds the learning process in optimal control theory, explicitly regularizing and bounding updates during value iterations. Furthermore, we leverage the Feynman-Kac theorem to recast the PDE solution as an expectation, enabling a tractable Monte Carlo estimation of the objective that avoids numerical instability in higher-order gradients. Experiments demonstrate that our method improves geometric consistency, making it broadly applicable to navigation and high-dimensional, complex manipulation tasks. Open-source codes are available at https://github.com/HrishikeshVish/phys-fk-value-GCRL.
Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving
With advances in imitation learning (IL) and large-scale driving datasets, end-to-end autonomous driving (E2E-AD) has made great progress recently. Currently, IL-based methods have become a mainstream paradigm: models rely on standard driving behaviors given by experts, and learn to minimize the discrepancy between their actions and expert actions. However, this objective of "only driving like the expert" suffers from limited generalization: when encountering rare or unseen long-tail scenarios outside the distribution of expert demonstrations, models tend to produce unsafe decisions in the absence of prior experience. This raises a fundamental question: Can an E2E-AD system make reliable decisions without any expert action supervision? Motivated by this, we propose a unified framework named Risk-aware World Model Predictive Control (RaWMPC) to address this generalization dilemma through robust control, without reliance on expert demonstrations. Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation. To endow the world model with the ability to predict the outcomes of risky driving behaviors, we design a risk-aware interaction strategy that systematically exposes the world model to hazardous behaviors, making catastrophic outcomes predictable and thus avoidable. Furthermore, to generate low-risk candidate actions at test time, we introduce a self-evaluation distillation method to distill riskavoidance capabilities from the well-trained world model into a generative action proposal network without any expert demonstration. Extensive experiments show that RaWMPC outperforms state-of-the-art methods in both in-distribution and out-of-distribution scenarios, while providing superior decision interpretability.
SPARR: Simulation-based Policies with Asymmetric Real-world Residuals for Assembly
Robotic assembly presents a long-standing challenge due to its requirement for precise, contact-rich manipulation. While simulation-based learning has enabled the development of robust assembly policies, their performance often degrades when deployed in real-world settings due to the sim-to-real gap. Conversely, real-world reinforcement learning (RL) methods avoid the sim-to-real gap, but rely heavily on human supervision and lack generalization ability to environmental changes. In this work, we propose a hybrid approach that combines a simulation-trained base policy with a real-world residual policy to efficiently adapt to real-world variations. The base policy, trained in simulation using low-level state observations and dense rewards, provides strong priors for initial behavior. The residual policy, learned in the real world using visual observations and sparse rewards, compensates for discrepancies in dynamics and sensor noise. Extensive real-world experiments demonstrate that our method, SPARR, achieves near-perfect success rates across diverse two-part assembly tasks. Compared to the state-of-the-art zero-shot sim-to-real methods, SPARR improves success rates by 38.4% while reducing cycle time by 29.7%. Moreover, SPARR requires no human expertise, in contrast to the state-of-the-art real-world RL approaches that depend heavily on human supervision.
UniScale: Unified Scale-Aware 3D Reconstruction for Multi-View Understanding via Prior Injection for Robotic Perception
We present UniScale, a unified, scale-aware multi-view 3D reconstruction framework for robotic applications that flexibly integrates geometric priors through a modular, semantically informed design. In vision-based robotic navigation, the accurate extraction of environmental structure from raw image sequences is critical for downstream tasks. UniScale addresses this challenge with a single feed-forward network that jointly estimates camera intrinsics and extrinsics, scale-invariant depth and point maps, and the metric scale of a scene from multi-view images, while optionally incorporating auxiliary geometric priors when available. By combining global contextual reasoning with camera-aware feature representations, UniScale is able to recover the metric-scale of the scene. In robotic settings where camera intrinsics are known, they can be easily incorporated to improve performance, with additional gains obtained when camera poses are also available. This co-design enables robust, metric-aware 3D reconstruction within a single unified model. Importantly, UniScale does not require training from scratch, and leverages world priors exhibited in pre-existing models without geometric encoding strategies, making it particularly suitable for resource-constrained robotic teams. We evaluate UniScale on multiple benchmarks, demonstrating strong generalization and consistent performance across diverse environments. We will release our implementation upon acceptance.
Grasp, Slide, Roll: Comparative Analysis of Contact Modes for Tactile-Based Shape Reconstruction ICRA 2026
Tactile sensing allows robots to gather detailed geometric information about objects through physical interaction, complementing vision-based approaches. However, efficiently acquiring useful tactile data remains challenging due to the time-consuming nature of physical contact and the need to strategically choose contact locations that maximize information gain while minimizing physical interactions. This paper studies how different contact modes affect object shape reconstruction using a tactile-enabled dexterous gripper. We compare three contact interaction modes: grasp-releasing, sliding induced by finger-grazing, and palm-rolling. These contact modes are combined with an information-theoretic exploration framework that guides subsequent sampling locations using a shape completion model. Our results show that the improved tactile sensing efficiency of finger-grazing and palm-rolling translates into faster convergence in shape reconstruction, requiring 34% fewer physical interactions while improving reconstruction accuracy by 55%. We validate our approach using a UR5e robot arm equipped with an Inspire-Robots Dexterous Hand, showing robust performance across primitive object geometries.
comment: 8 pages, 11 figures, Accepted by ICRA 2026
Motion-aware Event Suppression for Event Cameras
In this work, we introduce the first framework for Motion-aware Event Suppression, which learns to filter events triggered by IMOs and ego-motion in real time. Our model jointly segments IMOs in the current event stream while predicting their future motion, enabling anticipatory suppression of dynamic events before they occur. Our lightweight architecture achieves 173 Hz inference on consumer-grade GPUs with less than 1 GB of memory usage, outperforming previous state-of-the-art methods on the challenging EVIMO benchmark by 67\% in segmentation accuracy while operating at a 53\% higher inference rate. Moreover, we demonstrate significant benefits for downstream applications: our method accelerates Vision Transformer inference by 83\% via token pruning and improves event-based visual odometry accuracy, reducing Absolute Trajectory Error (ATE) by 13\%.
Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking
Capturing 4D spatiotemporal surroundings is crucial for the safe and reliable operation of robots in dynamic environments. However, most existing methods address only one side of the problem: they either provide coarse geometric tracking via bounding boxes, or detailed 3D structures like voxel-based occupancy that lack explicit temporal association. In this work, we present Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking (LaGS) that advances spatiotemporal scene understanding in a holistic direction. Our approach incorporates camera-based end-to-end tracking with mask-based multi-view panoptic occupancy prediction, and addresses the key challenge of efficiently aggregating multi-view information into 3D voxel grids via a novel latent Gaussian splatting approach. Specifically, we first fuse observations into 3D Gaussians that serve as a sparse point-centric latent representation of the 3D scene, and then splat the aggregated features onto a 3D voxel grid that is decoded by a mask-based segmentation head. We evaluate LaGS on the Occ3D nuScenes and Waymo datasets, achieving state-of-the-art performance for 4D panoptic occupancy tracking. We make our code available at https://lags.cs.uni-freiburg.de/.
FLIGHT: Fibonacci Lattice-based Inference for Geometric Heading in real-Time
Estimating camera motion from monocular video is a fundamental problem in computer vision, central to tasks such as SLAM, visual odometry, and structure-from-motion. Existing methods that recover the camera's heading under known rotation, whether from an IMU or an optimization algorithm, tend to perform well in low-noise, low-outlier conditions, but often decrease in accuracy or become computationally expensive as noise and outlier levels increase. To address these limitations, we propose a novel generalization of the Hough transform on the unit sphere (S(2)) to estimate the camera's heading. First, the method extracts correspondences between two frames and generates a great circle of directions compatible with each pair of correspondences. Then, by discretizing the unit sphere using a Fibonacci lattice as bin centers, each great circle casts votes for a range of directions, ensuring that features unaffected by noise or dynamic objects vote consistently for the correct motion direction. Experimental results on three datasets demonstrate that the proposed method is on the Pareto frontier of accuracy versus efficiency. Additionally, experiments on SLAM show that the proposed method reduces RMSE by correcting the heading during camera pose initialization.
Towards Intelligible Human-Robot Interaction: An Active Inference Approach to Occluded Pedestrian Scenarios
The sudden appearance of occluded pedestrians presents a critical safety challenge in autonomous driving. Conventional rule-based or purely data-driven approaches struggle with the inherent high uncertainty of these long-tail scenarios. To tackle this challenge, we propose a novel framework grounded in Active Inference, which endows the agent with a human-like, belief-driven mechanism. Our framework leverages a Rao-Blackwellized Particle Filter (RBPF) to efficiently estimate the pedestrian's hybrid state. To emulate human-like cognitive processes under uncertainty, we introduce a Conditional Belief Reset mechanism and a Hypothesis Injection technique to explicitly model beliefs about the pedestrian's multiple latent intentions. Planning is achieved via a Cross-Entropy Method (CEM) enhanced Model Predictive Path Integral (MPPI) controller, which synergizes the efficient, iterative search of CEM with the inherent robustness of MPPI. Simulation experiments demonstrate that our approach significantly reduces the collision rate compared to reactive, rule-based, and reinforcement learning (RL) baselines, while also exhibiting explainable and human-like driving behavior that reflects the agent's internal belief state.
comment: 14 pages, 6 figures, Proceedings of the 2026 ACM/IEEE International Conference on Human-Robot Interaction (HRI'26)
GeoWorld: Geometric World Models CVPR 2026
Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3-step planning and 2% SR improvement in 4-step planning compared to the state-of-the-art V-JEPA 2. Project website: https://steve-zeyu-zhang.github.io/GeoWorld.
comment: Accepted to CVPR 2026
Marinarium: a New Arena to Bring Maritime Robotics Closer to Shore
This paper presents the Marinarium, a modular and stand-alone underwater research facility designed to provide a realistic testbed for maritime and space-analog robotic experimentation in a resource-efficient manner. The Marinarium combines a fully instrumented underwater and aerial operational volume, extendable via a retractable roof for real-weather conditions, a digital twin in the SMaRCSim simulator and tight integration with a space robotics laboratory. All of these result from design choices aimed at bridging simulation, laboratory validation, and field conditions. We compare the Marinarium to similar existing infrastructures and illustrate how its design enables a set of experiments in four open research areas within field robotics. First, we exploit high-fidelity dynamics data from the tank to demonstrate the potential of learning-based system identification approaches applied to underwater vehicles. We further highlight the versatility of the multi-domain operating volume via a rendezvous mission with a heterogeneous fleet of robots across underwater, surface, and air. We then illustrate how the presented digital twin can be utilized to reduce the reality gap in underwater simulation. Finally, we demonstrate the potential of underwater surrogates for spacecraft navigation validation by executing spatiotemporally identical inspection tasks on a planar space-robot emulator and a neutrally buoyant \gls{rov}. In this work, by sharing the insights obtained and rationale behind the design and construction of the Marinarium, we hope to provide the field robotics research community with a blueprint for bridging the gap between controlled and real offshore and space robotics experimentation.
An Empirical Analysis of Cooperative Perception for Occlusion Risk Mitigation
Occlusions present a significant challenge for connected and automated vehicles, as they can obscure critical road users from perception systems. Traditional risk metrics often fail to capture the cumulative nature of these threats over time adequately. In this paper, we propose a novel and universal risk assessment metric, the Risk of Tracking Loss (RTL), which aggregates instantaneous risk intensity throughout occluded periods. This provides a holistic risk profile that encompasses both high-intensity, short-term threats and prolonged exposure. Utilizing diverse and high-fidelity real-world datasets, a large-scale statistical analysis is conducted to characterize occlusion risk and validate the effectiveness of the proposed metric. The metric is applied to evaluate different vehicle-to-everything (V2X) deployment strategies. Our study shows that full V2X penetration theoretically eliminates this risk, the reduction is highly nonlinear; a substantial statistical benefit requires a high penetration threshold of 75-90%. To overcome this limitation, we propose a novel asymmetric communication framework that allows even non-connected vehicles to receive warnings. Experimental results demonstrate that this paradigm achieves better risk mitigation performance. We found that our approach at 25% penetration outperforms the traditional symmetric model at 75%, and benefits saturate at only 50% penetration. This work provides a crucial risk assessment metric and a cost-effective, strategic roadmap for accelerating the safety benefits of V2X deployment.
comment: Accepted for publication in IEEE Internet of Things Journal (Regular Article), 2026. DOI: 10.1109/JIOT.2026.3668184
InCoM: Intent-Driven Perception and Structured Coordination for Whole-Body Mobile Manipulation
Whole-body mobile manipulation is a fundamental capability for general-purpose robotic agents, requiring both coordinated control of the mobile base and manipulator and robust perception under dynamically changing viewpoints. However, existing approaches face two key challenges: strong coupling between base and arm actions complicates whole-body control optimization, and perceptual attention is often poorly allocated as viewpoints shift during mobile manipulation. We propose InCoM, an intent-driven perception and structured coordination framework for whole-body mobile manipulation. InCoM infers latent motion intent to dynamically reweight multi-scale perceptual features, enabling stage-adaptive allocation of perceptual attention. To support robust cross-modal perception, InCoM further incorporates a geometric-semantic structured alignment mechanism that enhances multimodal correspondence. On the control side, we design a decoupled coordinated flow matching action decoder that explicitly models coordinated base-arm action generation, alleviating optimization difficulties caused by control coupling. Without access to privileged perceptual information, InCoM outperforms state-of-the-art methods on three ManiSkill-HAB scenarios by 28.2%, 26.1%, and 23.6% in success rate, demonstrating strong effectiveness for whole-body mobile manipulation.
comment: 16 pages, 9 figures
DigiArm: An Anthropomorphic 3D-Printed Prosthetic Hand with Enhanced Dexterity for Typing Tasks
Despite recent advancements, existing prosthetic limbs are unable to replicate the dexterity and intuitive control of the human hand. Current control systems for prosthetic hands are often limited to grasping, and commercial prosthetic hands lack the precision needed for dexterous manipulation or applications that require fine finger motions. Thus, there is a critical need for accessible and replicable prosthetic designs that enable individuals to interact with electronic devices and perform precise finger pressing, such as keyboard typing or piano playing, while preserving current prosthetic capabilities. This paper presents a low-cost, lightweight, 3D-printed robotic prosthetic hand, specifically engineered for enhanced dexterity with electronic devices such as a computer keyboard or piano, as well as general object manipulation. The robotic hand features a mechanism to adjust finger abduction/adduction spacing, a 2-D wrist with the inclusion of controlled ulnar/radial deviation optimized for typing, and control of independent finger pressing. We conducted a study to demonstrate how participants can use the robotic hand to perform keyboard typing and piano playing in real time, with different levels of finger and wrist motion. This supports the notion that our proposed design can allow for the execution of key typing motions more effectively than before, aiming to enhance the functionality of prosthetic hands.
A Perspective on Open Challenges in Deformable Object Manipulation
Deformable object manipulation (DOM) represents a critical challenge in robotics, with applications spanning healthcare, manufacturing, food processing, and beyond. Unlike rigid objects, deformable objects exhibit infinite dimensionality, dynamic shape changes, and complex interactions with their environment, posing significant hurdles for perception, modeling, and control. This paper reviews the state of the art in DOM, focusing on key challenges such as occlusion handling, task generalization, and scalable, real-time solutions. It highlights advancements in multimodal perception systems, including the integration of multi-camera setups, active vision, and tactile sensing, which collectively address occlusion and improve adaptability in unstructured environments. Cutting-edge developments in physically informed reinforcement learning (RL) and differentiable simulations are explored, showcasing their impact on efficiency, precision, and scalability. The review also emphasizes the potential of simulated expert demonstrations and generative neural networks to standardize task specifications and bridge the simulation-to-reality gap. Finally, future directions are proposed, including the adoption of graph neural networks for high-level decision-making and the creation of comprehensive datasets to enhance DOM's real-world applicability. By addressing these challenges, DOM research can pave the way for versatile robotic systems capable of handling diverse and dynamic tasks with deformable objects.
comment: 28 pages, 7 Figures
Automated Robotic Needle Puncture for Percutaneous Dilatational Tracheostomy
Percutaneous dilatational tracheostomy (PDT) is frequently performed on patients in intensive care units for prolonged mechanical ventilation. The needle puncture, as the most critical step of PDT, could lead to adverse consequences such as major bleeding and posterior tracheal wall perforation if performed inaccurately. Current practices of PDT puncture are all performed manually with no navigation assistance, which leads to large position and angular errors (5 mm and 30 degree). To improve the accuracy and reduce the difficulty of the PDT procedure, we propose a system that automates the needle insertion using a velocity-controlled robotic manipulator. Guided using pose data from two electromagnetic sensors, one at the needle tip and the other inside the trachea, the robotic system uses an adaptive constrained controller to adapt the uncertain kinematic parameters online and avoid collisions with the patient's body and tissues near the target. Simulations were performed to validate the controller's implementation, and then four hundred PDT punctures were performed on a mannequin to evaluate the position and angular accuracy. The absolute median puncture position error was 1.7 mm (IQR: 1.9 mm) and midline deviation was 4.13 degree (IQR: 4.55 degree), measured by the sensor inside the trachea. The small deviations from the nominal puncture in a simulated experimental setup and formal guarantees of collision-free insertions suggest the feasibility of the robotic PDT puncture.
Considering Perspectives for Automated Driving Ethics: Collective Risk in Vehicular Motion Planning
Recent automated vehicle (AV) motion planning strategies evolve around minimizing risk in road traffic. However, they exclusively consider risk from the AV's perspective and, as such, do not address the ethicality of its decisions for other road users. We argue that this does not reduce the risk of each road user, as risk may be different from the perspective of each road user. Indeed, minimizing the risk from the AV's perspective may not imply that the risk from the perspective of other road users is also being minimized; in fact, it may even increase. To test this hypothesis, we propose an AV motion planning strategy that supports switching risk minimization strategies between all road user perspectives. We find that the risk from the perspective of other road users can generally be considered different to the risk from the AV's perspective. Taking a collective risk perspective, i.e., balancing the risks of all road users, we observe an AV that minimizes overall traffic risk the best, while putting itself at slightly higher risk for the benefit of others, which is consistent with human driving behavior. In addition, adopting a collective risk minimization strategy can also be beneficial to the AV's travel efficiency by acting assertively when other road users maintain a low risk estimate of the AV. Yet, the AV drives conservatively when its planned actions are less predictable to other road users, i.e., associated with high risk. We argue that such behavior is a form of self-reflection and a natural prerequisite for socially acceptable AV behavior. We conclude that to facilitate ethicality in road traffic that includes AVs, the risk-perspective of each road user must be considered in the decision-making of AVs.
comment: 17 pages, 6 figures, 2 tables
WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents
While autonomous navigation has achieved remarkable success in passive perception (e.g., object detection and segmentation), it remains fundamentally constrained by a void in knowledge-driven, interactive environmental cognition. In the high-stakes domain of maritime navigation, the ability to bridge the gap between raw visual perception and complex cognitive reasoning is not merely an enhancement but a critical prerequisite for Autonomous Surface Vessels to execute safe and precise maneuvers. To this end, we present WaterVideoQA, the first large-scale, comprehensive Video Question Answering benchmark specifically engineered for all-waterway environments. This benchmark encompasses 3,029 video clips across six distinct waterway categories, integrating multifaceted variables such as volatile lighting and dynamic weather to rigorously stress-test ASV capabilities across a five-tier hierarchical cognitive framework. Furthermore, we introduce NaviMind, a pioneering multi-agent neuro-symbolic system designed for open-ended maritime reasoning. By synergizing Adaptive Semantic Routing, Situation-Aware Hierarchical Reasoning, and Autonomous Self-Reflective Verification, NaviMind transitions ASVs from superficial pattern matching to regulation-compliant, interpretable decision-making. Experimental results demonstrate that our framework significantly transcends existing baselines, establishing a new paradigm for intelligent, trustworthy interaction in dynamic maritime environments.
comment: 11 pages,8 figures
Bayesian Preference Elicitation: Human-In-The-Loop Optimization of An Active Prosthesis
Tuning active prostheses for people with amputation is time-consuming and relies on metrics that may not fully reflect user needs. We introduce a human-in-the-loop optimization (HILO) approach that leverages direct user preferences to personalize a standard four-parameter prosthesis controller efficiently. Our method employs preference-based Multiobjective Bayesian Optimization that uses a state-or-the-art acquisition function especially designed for preference learning, and includes two algorithmic variants: a discrete version (\textit{EUBO-LineCoSpar}), and a continuous version (\textit{BPE4Prost}). Simulation results on benchmark functions and real-application trials demonstrate efficient convergence, robust preference elicitation, and measurable biomechanical improvements, illustrating the potential of preference-driven tuning for user-centered prosthesis control.
comment: 8 pages, 5 figures
DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation
Vision-Language-Action (VLA) models have shown remarkable success in robotic tasks like manipulation by fusing a language model's reasoning with a vision model's 3D understanding. However, their high computational cost remains a major obstacle for real-world applications that require real-time performance. We observe that the actions within a task have varying levels of importance: critical steps demand high precision, while less important ones can tolerate more variance. Leveraging this insight, we propose DySL-VLA, a novel framework that addresses computational cost by dynamically skipping VLA layers based on each action's importance. DySL-VLA categorizes its layers into two types: informative layers, which are consistently executed, and incremental layers, which can be selectively skipped. To intelligently skip layers without sacrificing accuracy, we invent a prior-post skipping guidance mechanism to determine when to initiate layer-skipping. We also propose a skip-aware two-stage knowledge distillation algorithm to efficiently train a standard VLA into a DySL-VLA. Our experiments indicate that DySL-VLA achieves 2.1% improvement in success length over Deer-VLA on the Calvin dataset, while simultaneously reducing trainable parameters by a factor of 85.7 and providing a 3.75x speedup relative to the RoboFlamingo baseline at iso-accuracy. Our code is available on https://github.com/PKU-SEC-Lab/DYSL_VLA.
comment: DAC 2026
GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion CVPR 2026
This paper focuses on enhancing the grasping precision and generalization of manipulation policies learned via imitation learning. Diffusion-based policy learning methods have recently become the mainstream approach for robotic manipulation tasks. As grasping is a critical subtask in manipulation, the ability of imitation-learned policies to execute precise and generalizable grasps merits particular attention. Existing imitation learning techniques for grasping often suffer from imprecise grasp executions, limited spatial generalization, and poor object generalization. To address these challenges, we incorporate grasp prior knowledge into the diffusion policy framework. In particular, we employ a latent diffusion policy to guide action chunk decoding with grasp pose prior, ensuring that generated motion trajectories adhere closely to feasible grasp configurations. Furthermore, we introduce a self-supervised reconstruction objective during diffusion to embed the graspness prior: at each reverse diffusion step, we reconstruct wrist-camera images back-projected the graspness from the intermediate representations. Both simulation and real robot experiments demonstrate that our approach significantly outperforms baseline methods and exhibits strong dynamic grasping capabilities.
comment: Accepted to CVPR 2026
Performance and Experimental Analysis of Strain-based Models for Continuum Robots
Although strain-based models have been widely adopted in robotics, no comparison beyond the uniform bending test is commonly recognized to assess their performance. In addition, the increasing effort in prototyping continuum robots highlights the need to assess the applicability of these models and the necessity of comprehensive performance evaluation. To address this gap, this work investigates the shape reconstruction abilities of a third-order strain interpolation method, examining its ability to capture both individual and combined deformation effects. These results are compared and discussed against the Geometric-Variable Strain approach. Subsequently, simulation results are experimentally verified by reshaping a slender rod while recording the resulting configurations using cameras. The rod configuration is imposed using a manipulator displacing one of its tips and extracted through reflective markers, without the aid of any other external sensor -- i.e. strain gauges or wrench sensors placed along the rod. The experiments demonstrate good agreement between the model predictions and observed shapes, with average error of 0.58% of the rod length and average computational time of 0.32s per configuration, outperforming existing models.
LeRobot: An Open-Source Library for End-to-End Robot Learning
Robotics is undergoing a significant transformation powered by advances in high-level control techniques based on machine learning, giving rise to the field of robot learning. Recent progress in robot learning has been accelerated by the increasing availability of affordable teleoperation systems, large-scale openly available datasets, and scalable learning-based methods. However, development in the field of robot learning is often slowed by fragmented, closed-source tools designed to only address specific sub-components within the robotics stack. In this paper, we present \texttt{lerobot}, an open-source library that integrates across the entire robot learning stack, from low-level middleware communication for motor controls to large-scale dataset collection, storage and streaming. The library is designed with a strong focus on real-world robotics, supporting accessible hardware platforms while remaining extensible to new embodiments. It also supports efficient implementations for various state-of-the-art robot learning algorithms from multiple prominent paradigms, as well as a generalized asynchronous inference stack. Unlike traditional pipelines which heavily rely on hand-crafted techniques, \texttt{lerobot} emphasizes scalable learning approaches that improve directly with more data and compute. Designed for accessibility, scalability, and openness, \texttt{lerobot} lowers the barrier to entry for researchers and practitioners to robotics while providing a platform for reproducible, state-of-the-art robot learning.
comment: https://github.com/huggingface/lerobot
Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving
Diffusion models have become a popular choice for decision-making tasks in robotics, and more recently, are also being considered for solving autonomous driving tasks. However, their applications and evaluations in autonomous driving remain limited to simulation-based or laboratory settings. The full strength of diffusion models for large-scale, complex real-world settings, such as End-to-End Autonomous Driving (E2E AD), remains underexplored. In this study, we conducted a systematic and large-scale investigation to unleash the potential of the diffusion models as planners for E2E AD, based on a tremendous amount of real-vehicle data and road testing. Through comprehensive and carefully controlled studies, we identify key insights into the diffusion loss space, trajectory representation, and data scaling that significantly impact E2E planning performance. Moreover, we also provide an effective reinforcement learning post-training strategy to further enhance the safety of the learned planner. The resulting diffusion-based learning framework, Hyper Diffusion Planner} (HDP), is deployed on a real-vehicle platform and evaluated across 6 urban driving scenarios and 200 km of real-world testing, achieving a notable 10x performance improvement over the base model. Our work demonstrates that diffusion models, when properly designed and trained, can serve as effective and scalable E2E AD planners for complex, real-world autonomous driving tasks.
Pixel2Catch: Multi-Agent Sim-to-Real Transfer for Agile Manipulation with a Single RGB Camera
To catch a thrown object, a robot must be able to perceive the object's motion and generate control actions in a timely manner. Rather than explicitly estimating the object's 3D position, this work focuses on a novel approach that recognizes object motion using pixel-level visual information extracted from a single RGB image. Such visual cues capture changes in the object's position and scale, allowing the policy to reason about the object's motion. Furthermore, to achieve stable learning in a high-DoF system composed of a robot arm equipped with a multi-fingered hand, we design a heterogeneous multi-agent reinforcement learning framework that defines the arm and hand as independent agents with distinct roles. Each agent is trained cooperatively using role-specific observations and rewards, and the learned policies are successfully transferred from simulation to the real world.
Sapling-NeRF: Geo-Localised Sapling Reconstruction in Forests for Ecological Monitoring
Saplings are key indicators of forest regeneration and overall forest health. However, their fine-scale architectural traits are difficult to capture with existing 3D sensing methods, which make quantitative evaluation difficult. Terrestrial Laser Scanners (TLS), Mobile Laser Scanners (MLS), or traditional photogrammetry approaches poorly reconstruct thin branches, dense foliage, and lack the scale consistency needed for long-term monitoring. Implicit 3D reconstruction methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) are promising alternatives, but cannot recover the true scale of a scene and lack any means to be accurately geo-localised. In this paper, we present a pipeline which fuses NeRF, LiDAR SLAM, and GNSS to enable repeatable, geo-localised ecological monitoring of saplings. Our system proposes a three-level representation: (i) coarse Earth-frame localisation using GNSS, (ii) LiDAR-based SLAM for centimetre-accurate localisation and reconstruction, and (iii) NeRF-derived object-centric dense reconstruction of individual saplings. This approach enables repeatable quantitative evaluation and long-term monitoring of sapling traits. Our experiments in forest plots in Wytham Woods (Oxford, UK) and Evo (Finland) show that stem height, branching patterns, and leaf-to-wood ratios can be captured with increased accuracy as compared to TLS. We demonstrate that accurate stem skeletons and leaf distributions can be measured for saplings with heights between 0.5m and 2m in situ, giving ecologists access to richer structural and quantitative data for analysing forest dynamics.
Robust Helicopter Ship Deck Landing With Guaranteed Timing Using Shrinking-Horizon Model Predictive Control
We present a runtime efficient algorithm for autonomous helicopter landings on moving ship decks based on Shrinking-Horizon Model Predictive Control (SHMPC). First, a suitable planning model capturing the relevant aspects of the full nonlinear helicopter dynamics is derived. Next, we use the SHMPC together with a touchdown controller stage to ensure a pre-specified maneuver time and an associated landing time window despite the presence of disturbances. A high disturbance rejection performance is achieved by designing an ancillary controller with disturbance feedback. Thus, given a target position and time, a safe landing with suitable terminal conditions is be guaranteed if the initial optimization problem is feasible. The efficacy of our approach is shown in simulation where all maneuvers achieve a high landing precision in strong winds while satisfying timing and operational constraints with maximum computation times in the millisecond range.
comment: This version was submitted to the American Control Conference 2026 and has been accepted
SCOPE: Skeleton Graph-Based Computation-Efficient Framework for Autonomous UAV Exploration
Autonomous exploration in unknown environments is key for mobile robots, helping them perceive, map, and make decisions in complex areas. However, current methods often rely on frequent global optimization, suffering from high computational latency and trajectory oscillation, especially on resource-constrained edge devices. To address these limitations, we propose SCOPE, a novel framework that incrementally constructs a real-time skeletal graph and introduces Implicit Unknown Region Analysis for efficient spatial reasoning. The planning layer adopts a hierarchical on-demand strategy: the Proximal Planner generates smooth, high-frequency local trajectories, while the Region-Sequence Planner is activated only when necessary to optimize global visitation order. Comparative evaluations in simulation demonstrate that SCOPE achieves competitive exploration performance comparable to state-of-the-art global planners, while reducing computational cost by an average of 86.9%. Real-world experiments further validate the system's robustness and low latency in practical scenarios.
comment: This paper has been accepted for publication in the IEEE ROBOTICS AND AUTOMATION LETTERS (RA-L). Please cite the paper using appropriate formats
Does the testing environment matter? Carsickness across on-road, test-track, and driving simulator conditions
Carsickness has gained significant attention with the rise of automated vehicles, prompting extensive research across on-road, test-track, and driving simulator environments to understand its occurrence and develop mitigation strategies. However, the lack of carsickness standardization complicates comparisons across studies and environments. Previous works demonstrate measurement validity between two setups at most (e.g., on-road vs. driving simulator), leaving gaps in multi-environment comparisons. This study investigates the recreation of an on-road motion sickness exposure - previously replicated on a test track - using a motion-based driving simulator. Twenty-eight participants performed an eyes-off-road non-driving task while reporting motion sickness using the Misery Scale during the experiment and the Motion Sickness Assessment Questionnaire afterward. Psychological factors known to influence motion sickness were also assessed. The results present subjective and objective measurements for motion sickness across the considered environments. In this paper, acceleration measurements, objective metrics and subjective motion sickness ratings across environments are compared, highlighting key differences in sickness occurrence for simulator-based research validity. Significantly lower motion sickness scores are reported in the simulator compared to on-road and test-track conditions, due to its limited working envelope to reproduce low-frequency (<0.5 Hz) motions, which are the most provocative for motion sickness.
Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline ICRA 2026
Vision-Language-Action (VLA) models have emerged as a generalist robotic agent. However, existing VLAs are hindered by excessive parameter scales, prohibitive pre-training requirements, and limited applicability to diverse embodiments. To improve the practicality of VLAs, we propose a comprehensive benchmark and an improved baseline. First, we propose CEBench, a new benchmark spanning diverse embodiments in both simulation and the real world with consideration of domain randomization. We collect 14.4k simulated trajectories and 1.6k real-world expert-curated trajectories to support training on CEBench. Second, using CEBench as our testbed, we study three critical aspects of VLAs' practicality and offer several key findings. Informed by these findings, we introduce LLaVA-VLA, a lightweight yet powerful VLA designed for practical deployment on consumer-grade GPUs. Architecturally, it integrates a compact VLM backbone with multi-view perception, proprioceptive tokenization, and action chunking. To eliminate reliance on costly pre-training, LLaVA-VLA adopts a two-stage training paradigm including post-training and fine-tuning. Furthermore, LLaVA-VLA extends the action space to unify navigation and manipulation. Experiments across embodiments demonstrate the capabilities of generalization and versatility of LLaVA-VLA , while real-world mobile manipulation experiments establish it as the first end-to-end VLA model for mobile manipulation. We will open-source all datasets, codes, and checkpoints upon acceptance to foster reproducibility and future research.
comment: Accepted by ICRA 2026
Designing Robots for Families: In-Situ Prototyping for Contextual Reminders on Family Routines
Robots are increasingly entering the daily lives of families, yet their successful integration into domestic life remains a challenge. We explore family routines as a critical entry point for understanding how robots might find a sustainable role in everyday family settings. Together with each of the ten families, we co-designed robot interactions and behaviors, and a plan for the robot to support their chosen routines, accounting for contextual factors such as timing, participants, locations, and the activities in the environment. We then designed, prototyped, and deployed a mobile social robot as a four-day, in-home user study. Families welcomed the robot's reminders, with parents especially appreciating the offloading of some reminding tasks. At the same time, interviews revealed tensions around timing, authority, and family dynamics, highlighting the complexity of integrating robots into households beyond the immediate task of reminders. Based on these insights, we offer design implications for robot-facilitated contextual reminders and discuss broader considerations for designing robots for family settings.
comment: Proceedings of the 21st ACM/IEEE International Conference on Human Robot Interaction (HRI 2026)
Metamorphic Testing of Vision-Language Action-Enabled Robots
Vision-Language-Action (VLA) models are multimodal robotic task controllers that, given an instruction and visual inputs, produce a sequence of low-level control actions (or motor commands) enabling a robot to execute the requested task in the physical environment. These systems face the test oracle problem from multiple perspectives. On the one hand, a test oracle must be defined for each instruction prompt, which is a complex and non-generalizable approach. On the other hand, current state-of-the-art oracles typically capture symbolic representations of the world (e.g., robot and object states), enabling the correctness evaluation of a task, but fail to assess other critical aspects, such as the quality with which VLA-enabled robots perform a task. In this paper, we explore whether Metamorphic Testing (MT) can alleviate the test oracle problem in this context. To do so, we propose two metamorphic relation patterns and five metamorphic relations to assess whether changes to the test inputs impact the original trajectory of the VLA-enabled robots. An empirical study involving five VLA models, two simulated robots, and four robotic tasks shows that MT can effectively alleviate the test oracle problem by automatically detecting diverse types of failures, including, but not limited to, uncompleted tasks. More importantly, the proposed MRs are generalizable, making the proposed approach applicable across different VLA models, robots, and tasks, even in the absence of test oracles.
Relational Appliances: A Robot in the Refrigerator for Home-Based Health Promotion
Kitchen appliances are frequently used domestic artifacts situated at the point of everyday dietary decision making, making them a promising but underexplored site for health promotion. We explore the concept of relational appliances: everyday household devices designed as embodied social actors that engage users through ongoing, personalized interaction. We focus on the refrigerator, whose unique affordances, including a fixed, sensor-rich environment, private interaction space, and close coupling to food items, support contextualized, conversational engagement during snack choices. We present an initial exploration of this concept through a pilot study deploying an anthropomorphic robotic head inside a household refrigerator. In a home-lab apartment, participants repeatedly retrieved snacks during simulated TV "commercial breaks" while interacting with a human-sized robotic head. Participants were randomized to either a health-promotion condition, in which the robot made healthy snack recommendations, or a social-chat control condition. Outcomes included compliance with recommendations, nutritional quality of selected snacks, and psychosocial measures related to acceptance of the robot. Results suggest that participants found the robot persuasive, socially engaging, and increasingly natural over time, often describing it as helpful, aware, and companionable. Most participants reported greater awareness of their snack decisions and expressed interest in having such a robot in their own home. We discuss implications for designing relational appliances that leverage anthropomorphism, trust, and long-term human-technology relationships for home-based health promotion.
SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation
We present, to our knowledge, the first sign language-driven Vision-Language-Action (VLA) framework for intuitive and inclusive human-robot interaction. Unlike conventional approaches that rely on gloss annotations as intermediate supervision, the proposed system adopts a gloss-free paradigm and directly maps visual sign gestures to semantic instructions. This design reduces annotation cost and avoids the information loss introduced by gloss representations, enabling more natural and scalable multimodal interaction. In this work, we focus on a real-time alphabet-level finger-spelling interface that provides a robust and low-latency communication channel for robotic control. Compared with large-scale continuous sign language recognition, alphabet-level interaction offers improved reliability, interpretability, and deployment feasibility in safety-critical embodied environments. The proposed pipeline transforms continuous gesture streams into coherent language commands through geometric normalization, temporal smoothing, and lexical refinement, ensuring stable and consistent interaction. Furthermore, the framework is designed to support future integration of transformer-based gloss-free sign language models, enabling scalable word-level and sentence-level semantic understanding. Experimental results demonstrate the effectiveness of the proposed system in grounding sign-derived instructions into precise robotic actions under diverse interaction scenarios. These results highlight the potential of the framework to advance accessible, scalable, and multimodal embodied intelligence.
comment: 7 pages, 2 figures
V-MORALS: Visual Morse Graph-Aided Estimation of Regions of Attraction in a Learned Latent Space
Reachability analysis has become increasingly important in robotics to distinguish safe from unsafe states. Unfortunately, existing reachability and safety analysis methods often fall short, as they typically require known system dynamics or large datasets to estimate accurate system models, are computationally expensive, and assume full state information. A recent method, called MORALS, aims to address these shortcomings by using topological tools to estimate3DR-eEgnciodnesr of Attraction (ROA) in a low-dimensional latent space. However, MORALS still relies on full state knowledge and has not been studied when only sensor measurements are available. This paper presents Visual Morse Graph-Aided Estimation of Regions of Attraction in a Learned Latent Space (V- MORALS). V-MORALS takes in a dataset of image-based trajectories of a system under a given controller, and learns a latent space for reachability analysis. Using this learned latent space, our method is able to generate well-defined Morse Graphs, from which we can compute ROAs for various systems and controllers. V-MORALS provides capabilities similar to the original MORALS architecture without relying on state knowledge, and using only high-level sensor data. Our project website is at: https://v-morals.onrender.com.
TaCarla: A comprehensive benchmarking dataset for end-to-end autonomous driving
Collecting a high-quality dataset is a critical task that demands meticulous attention to detail, as overlooking certain aspects can render the entire dataset unusable. Autonomous driving challenges remain a prominent area of research, requiring further exploration to enhance the perception and planning performance of vehicles. However, existing datasets are often incomplete. For instance, datasets that include perception information generally lack planning data, while planning datasets typically consist of extensive driving sequences where the ego vehicle predominantly drives forward, offering limited behavioral diversity. In addition, many real datasets struggle to evaluate their models, especially for planning tasks, since they lack a proper closed-loop evaluation setup. The CARLA Leaderboard 2.0 challenge, which provides a diverse set of scenarios to address the long-tail problem in autonomous driving, has emerged as a valuable alternative platform for developing perception and planning models in both open-loop and closed-loop evaluation setups. Nevertheless, existing datasets collected on this platform present certain limitations. Some datasets appear to be tailored primarily for limited sensor configuration, with particular sensor configurations. To support end-to-end autonomous driving research, we have collected a new dataset comprising over 2.85 million frames using the CARLA simulation environment for the diverse Leaderboard 2.0 challenge scenarios. Our dataset is designed not only for planning tasks but also supports dynamic object detection, lane divider detection, centerline detection, traffic light recognition, prediction tasks and visual language action models . Furthermore, we demonstrate its versatility by training various models using our dataset. Moreover, we also provide numerical rarity scores to understand how rarely the current state occurs in the dataset.
Refining Almost-Safe Value Functions on the Fly
Control Barrier Functions (CBFs) are a powerful tool for ensuring robotic safety, but designing or learning valid CBFs for complex systems is a significant challenge. While Hamilton-Jacobi Reachability provides a formal method for synthesizing safe value functions, it scales poorly and is typically performed offline, limiting its applicability in dynamic environments. This paper bridges the gap between offline synthesis and online adaptation. We introduce refineCBF for refining an approximate CBF - whether analytically derived, learned, or even unsafe - via warm-started HJ reachability. We then present its computationally efficient successor, HJ-Patch, which accelerates this process through localized updates. Both methods guarantee the recovery of a safe value function and can ensure monotonic safety improvements during adaptation. Our experiments validate our framework's primary contribution: in-the-loop, real-time adaptation, in simulation (with detailed value function analysis) and on physical hardware. Our experiments on ground vehicles and quadcopters show that our framework can successfully adapt to sudden environmental changes, such as new obstacles and unmodeled wind disturbances, providing a practical path toward deploying formally guaranteed safety in real-world settings.
Optimization of Edge Directions and Weights for Mixed Guidance Graphs in Lifelong Multi-Agent Path Finding
Multi-Agent Path Finding (MAPF) aims to move agents from their start to goal vertices on a graph. Lifelong MAPF (LMAPF) continuously assigns new goals to agents as they complete current ones. To guide agents' movement in LMAPF, prior works have proposed Guidance Graph Optimization (GGO) methods to optimize a guidance graph, which is a bidirected weighted graph whose directed edges represent moving and waiting actions with edge weights being action costs. Higher edge weights represent higher action costs. However, edge weights only provide soft guidance. An edge with a high weight only discourages agents from using it, instead of prohibiting agents from traversing it. In this paper, we explore the need to incorporate edge directions optimization into GGO, providing strict guidance. We generalize GGO to Mixed Guidance Graph Optimization (MGGO), presenting two MGGO methods capable of optimizing both edge weights and directions. The first optimizes edge directions and edge weights in two phases separately. The second applies Quality Diversity algorithms to optimize a neural network capable of generating edge directions and weights. We also incorporate traffic patterns relevant to edge directions into a GGO method, making it capable of generating edge-direction-aware guidance graphs.
Printed helicoids with embedded air channels make sensorized segments for soft continuum robots
Soft robots enable safe, adaptive interaction with complex environments but remain difficult to sense and control due to their highly deformable structures. Architected soft materials such as helicoid lattices offer tunable stiffness and strength but are challenging to instrument because of their sparse geometry. We introduce a fabrication method for embedding air channels into helicoid-based soft continuum robots. Multi-material segments fabricated via vision-controlled jetting in a single print interface with PCBs housing miniature pressure sensors and IMUs for distributed deformation sensing. We characterize the mechanical properties of four helicoid designs and validate the sensor response to fundamental deformation modes. To demonstrate the platform's scalability, we construct and mechanically evaluate a meter-scale, 14-DoF cable-driven soft arm capable of open-loop trajectory tracking and object grasping, with tactile-based stiffness detection demonstrated using the gripper sensors. This approach establishes a scalable fabrication strategy for sensorized architected materials in large-scale soft robotic systems.
comment: Accepted for publication in the proceedings of the 2026 IEEE 9th International Conference on Soft Robotics (RoboSoft)
Demystifying Action Space Design for Robotic Manipulation Policies
The specification of the action space plays a pivotal role in imitation-based robotic manipulation policy learning, fundamentally shaping the optimization landscape of policy learning. While recent advances have focused heavily on scaling training data and model capacity, the choice of action space remains guided by ad-hoc heuristics or legacy designs, leading to an ambiguous understanding of robotic policy design philosophies. To address this ambiguity, we conducted a large-scale and systematic empirical study, confirming that the action space does have significant and complex impacts on robotic policy learning. We dissect the action design space along temporal and spatial axes, facilitating a structured analysis of how these choices govern both policy learnability and control stability. Based on 13,000+ real-world rollouts on a bimanual robot and evaluation on 500+ trained models over four scenarios, we examine the trade-offs between absolute vs. delta representations, and joint-space vs. task-space parameterizations. Our large-scale results suggest that properly designing the policy to predict delta actions consistently improves performance, while joint-space and task-space representations offer complementary strengths, favoring control stability and generalization, respectively.
Cybersecurity of Teleoperated Quadruped Robots: A Systematic Survey of Vulnerabilities, Threats, and Open Defense Gaps
Teleoperated quadruped robots are increasingly deployed in safety-critical missions -- industrial inspection, military reconnaissance, and emergency response -- yet the security of their communication and control infrastructure remains insufficiently characterized. Quadrupeds present distinct security challenges arising from dynamic stability constraints, gait-dependent vulnerability windows, substantial kinetic energy, and elevated operator cognitive load. This survey synthesizes peer-reviewed literature and vulnerability disclosures (2019--2025) to provide comprehensive analysis of cybersecurity threats, consequences, and countermeasures for teleoperated quadruped systems. We contribute: (i) a six-layer attack taxonomy spanning perception manipulation, VR/AR operator targeting, communication disruption, control signal attacks, localization spoofing, and network intrusion; (ii) systematic attack-to-consequence mapping with timing characterization; (iii) Technology Readiness Level classification exposing critical maturity gaps between field-deployed communication protections (TRL 7--9) and experimental perception/operator-layer defenses (TRL 3--5); (iv) comparative security analysis of six commercial platforms; (v) pragmatic deployment guidance stratified by implementation timeline; and (vi) eight prioritized research gaps with implementation roadmaps. Limitations: Platform assessments rely on publicly available information. Attack success rates derive from cited studies under controlled conditions and require domain-specific validation.
comment: survey paper; 23 tables; 9 figures; 132 references
DropVLA: An Action-Level Backdoor Attack on Vision--Language--Action Models
Vision-Language-Action (VLA) models map multimodal perception and language instructions to executable robot actions, making them particularly vulnerable to behavioral backdoor manipulation: a hidden trigger introduced during training can induce unintended physical actions while nominal task performance remains intact. Prior work on VLA backdoors primarily studies untargeted attacks or task-level hijacking, leaving fine-grained control over individual actions largely unexplored. In this work, we present DropVLA, an action-level backdoor attack that forces a reusable action primitive (e.g., open_gripper) to execute at attacker-chosen decision points under a realistic pipeline-black-box setting with limited data-poisoning access, using a window-consistent relabeling scheme for chunked fine-tuning. On OpenVLA-7B evaluated with LIBERO, vision-only poisoning achieves 98.67%-99.83% attack success rate (ASR) with only 0.31% poisoned episodes while preserving 98.50%-99.17% clean-task retention, and successfully triggers the targeted action within 25 control steps at 500 Hz (0.05 s). Text-only triggers are unstable at low poisoning budgets, and combining text with vision provides no consistent ASR improvement over vision-only attacks. The backdoor remains robust to moderate trigger variations and transfers across evaluation suites (96.27%, 99.09%), whereas text-only largely fails (0.72%). We further validate physical-world feasibility on a 7-DoF Franka arm with pi0-fast, demonstrating non-trivial attack efficacy under camera-relative motion that induces image-plane trigger drift. These results reveal that VLA models can be covertly steered at the granularity of safety-critical actions with minimal poisoning and without observable degradation of nominal performance.
comment: 8 pages, 6 tables, 3 figures. Under review
NMPCM: Nonlinear Model Predictive Control on Resource-Constrained Microcontrollers
Nonlinear Model Predictive Control (NMPC) is a powerful approach for controlling highly dynamic robotic systems, as it accounts for system dynamics and optimizes control inputs at each step. However, its high computational complexity makes implementation on resource-constrained microcontrollers impractical. While recent studies have demonstrated the feasibility of Model Predictive Control (MPC) with linearized dynamics on microcontrollers, applying full NMPC remains a significant challenge. This work presents an efficient solution for generating and deploying NMPC on microcontrollers (NMPCM) to control quadrotor UAVs. The proposed method optimizes computational efficiency while maintaining high control accuracy. Simulations in Gazebo/ROS and real-world experiments validate the effectiveness of the approach, demonstrating its capability to achieve high-frequency NMPC execution in real-time systems. The code is available at: https://github.com/aralab-unr/NMPCM.
PPT: Pretraining with Pseudo-Labeled Trajectories for Motion Forecasting ICRA 2026
Accurately predicting how agents move in dynamic scenes is essential for safe autonomous driving. State-of-the-art motion forecasting models rely on datasets with manually annotated or post-processed trajectories. However, building these datasets is costly, generally manual, hard to scale, and lacks reproducibility. They also introduce domain gaps that limit generalization across environments. We introduce PPT (Pretraining with Pseudo-labeled Trajectories), a simple and scalable pretraining framework that uses unprocessed and diverse trajectories automatically generated from off-the-shelf 3D detectors and tracking. Unlike data annotation pipelines aiming for clean, single-label annotations, PPT is a pretraining framework embracing off-the-shelf trajectories as useful signals for learning robust representations. With optional finetuning on a small amount of labeled data, models pretrained with PPT achieve strong performance across standard benchmarks, particularly in low-data regimes, and in cross-domain, end-to-end, and multi-class settings. PPT is easy to implement and improves generalization in motion forecasting.
comment: 8 pages, 6 figures, accepted to ICRA 2026
Event-Aided Sharp Radiance Field Reconstruction for Fast-Flying Drones
Fast-flying aerial robots promise rapid inspection under limited battery constraints, with direct applications in infrastructure inspection, terrain exploration, and search and rescue. However, high speeds lead to severe motion blur in images and induce significant drift and noise in pose estimates, making dense 3D reconstruction with Neural Radiance Fields (NeRFs) particularly challenging due to their high sensitivity to such degradations. In this work, we present a unified framework that leverages asynchronous event streams alongside motion-blurred frames to reconstruct high-fidelity radiance fields from agile drone flights. By embedding event-image fusion into NeRF optimization and jointly refining event-based visual-inertial odometry priors using both event and frame modalities, our method recovers sharp radiance fields and accurate camera trajectories without ground-truth supervision. We validate our approach on both synthetic data and real-world sequences captured by a fast-flying drone. Despite highly dynamic drone flights, where RGB frames are severely degraded by motion blur and pose priors become unreliable, our method reconstructs high-fidelity radiance fields and preserves fine scene details, delivering a performance gain of over 50% on real-world data compared to state-of-the-art methods.
Time-Varying Formation Tracking Control of Wheeled Mobile Robots With Region Constraint: A Generalized Udwadia-Kalaba Framework
In this article, the time-varying formation tracking control of wheeled mobile robots with region constraint is investigated from a generalized Udwadia-Kalaba framework. The communication network is modeled as a directed and weighted graph that has a spanning tree with the leader being the root. By reformulating the time-varying formation tracking control objective as an equality constrained equation and transforming the region constraint by a diffeomorphism, the time-varying formation tracking controller with the region constraint is designed under the generalized Udwadia-Kalaba framework. Compared with the existing works on time-varying formation tracking control, the region constraint is taken into account in this paper, which ensures the safety of the robots. Finally, the feasibility of the proposed control strategy is illustrated through some numerical simulations.
comment: 17 pages,9 figures
Spatially anchored Tactile Awareness for Robust Dexterous Manipulation
Dexterous manipulation requires precise geometric reasoning, yet existing visuo-tactile learning methods struggle with sub-millimeter precision tasks that are routine for traditional model-based approaches. We identify a key limitation: while tactile sensors provide rich contact information, current learning frameworks fail to effectively leverage both the perceptual richness of tactile signals and their spatial relationship with hand kinematics. We believe an ideal tactile representation should explicitly ground contact measurements in a stable reference frame while preserving detailed sensory information, enabling policies to not only detect contact occurrence but also precisely infer object geometry in the hand's coordinate system. We introduce SaTA (Spatially-anchored Tactile Awareness for dexterous manipulation), an end-to-end policy framework that explicitly anchors tactile features to the hand's kinematic frame through forward kinematics, enabling accurate geometric reasoning without requiring object models or explicit pose estimation. Our key insight is that spatially grounded tactile representations allow policies to not only detect contact occurrence but also precisely infer object geometry in the hand's coordinate system. We validate SaTA on challenging dexterous manipulation tasks, including bimanual USB-C mating in free space, a task demanding sub-millimeter alignment precision, as well as light bulb installation requiring precise thread engagement and rotational control, and card sliding that demands delicate force modulation and angular precision. These tasks represent significant challenges for learning-based methods due to their stringent precision requirements. Across multiple benchmarks, SaTA significantly outperforms strong visuo-tactile baselines, improving success rates by up to 30 percentage while reducing task completion times by 27 percentage.
comment: 8 pages
ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting ICRA 2026
3D occupancy prediction is critical for comprehensive scene understanding in vision-centric autonomous driving. Recent advances have explored utilizing 3D semantic Gaussians to model occupancy while reducing computational overhead, but they remain constrained by insufficient multi-view spatial interaction and limited multi-frame temporal consistency. To overcome these issues, in this paper, we propose a novel Spatial-Temporal Gaussian Splatting (ST-GS) framework to enhance both spatial and temporal modeling in existing Gaussian-based pipelines. Specifically, we develop a guidance-informed spatial aggregation strategy within a dual-mode attention mechanism to strengthen spatial interaction in Gaussian representations. Furthermore, we introduce a geometry-aware temporal fusion scheme that effectively leverages historical context to improve temporal continuity in scene completion. Extensive experiments on the large-scale nuScenes occupancy prediction benchmark showcase that our proposed approach not only achieves state-of-the-art performance but also delivers markedly better temporal consistency compared to existing Gaussian-based methods.
comment: Accepted by ICRA 2026
VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play NeurIPS 2025
Robot sports, characterized by well-defined objectives, explicit rules, and dynamic interactions, present ideal scenarios for demonstrating embodied intelligence. In this paper, we present VolleyBots, a novel robot sports testbed where multiple drones cooperate and compete in the sport of volleyball under physical dynamics. VolleyBots integrates three features within a unified platform: competitive and cooperative gameplay, turn-based interaction structure, and agile 3D maneuvering. These intertwined features yield a complex problem combining motion control and strategic play, with no available expert demonstrations. We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative reinforcement learning (RL), multi-agent reinforcement learning (MARL) and game-theoretic algorithms. Simulation results show that on-policy RL methods outperform off-policy methods in single-agent tasks, but both approaches struggle in complex tasks that combine motion control and strategic play. We additionally design a hierarchical policy which achieves 69.5% win rate against the strongest baseline in the 3 vs 3 task, demonstrating its potential for tackling the complex interplay between low-level control and high-level strategy. To highlight VolleyBots' sim-to-real potential, we further demonstrate the zero-shot deployment of a policy trained entirely in simulation on real-world drones.
comment: Accepted by NeurIPS 2025
Sparse Imagination for Efficient Visual World Model Planning ICLR 2026
World model based planning has significantly improved decision-making in complex environments by enabling agents to simulate future states and make informed choices. This computational burden is particularly restrictive in robotics, where resources are severely constrained. To address this limitation, we propose a Sparse Imagination for Efficient Visual World Model Planning, which enhances computational efficiency by reducing the number of tokens processed during forward prediction. Our method leverages a sparsely trained vision-based world model based on transformers with randomized grouped attention strategy, allowing the model to flexibly adjust the number of tokens processed based on the computational resource. By enabling sparse imagination during latent rollout, our approach significantly accelerates planning while maintaining high control fidelity. Experimental results demonstrate that sparse imagination preserves task performance while dramatically improving inference efficiency. This general technique for visual planning is applicable from simple test-time trajectory optimization to complex real-world tasks with the latest VLAs, enabling the deployment of world models in real-time scenarios.
comment: Accepted to ICLR 2026; Project Page: https://nikriz1.github.io/sparse_imagination/
SplatSDF: Boosting SDF-NeRF via Architecture-Level Fusion with Gaussian Splats
Signed distance-radiance field (SDF-NeRF) is a promising environment representation that offers both photo-realistic rendering and geometric reasoning such as proximity queries for collision avoidance. However, the slow training speed and convergence of SDF-NeRF hinder their use in practical robotic systems. We propose SplatSDF, a novel SDF-NeRF architecture that accelerates convergence using 3D Gaussian splats (3DGS), which can be quickly pre-trained. Unlike prior approaches that introduce a consistency loss between separate 3DGS and SDF-NeRF models, SplatSDF directly fuses 3DGS at an architectural level by consuming it as an input to SDF-NeRF during training. This is achieved using a novel sparse 3DGS fusion strategy that injects neural embeddings of 3DGS into SDF-NeRF around the object surface, while also permitting inference without 3DGS for minimal operation. Experimental results show SplatSDF achieves 3X faster convergence to the same geometric accuracy than the best baseline, and outperforms state-of-the-art SDF-NeRF methods in terms of chamfer distance and peak signal to noise ratio, unlike consistency loss-based approaches that in fact provide limited gains. We also present computational techniques for accelerating gradient and Hessian steps by 3X. We expect these improvements will contribute to deploying SDF-NeRF on practical systems.
SignBot: Learning Human-to-Humanoid Sign Language Interaction ICRA 2026
Sign language is a natural and visual form of language that uses movements and expressions to convey meaning, serving as a crucial means of communication for individuals who are deaf or hard-of-hearing (DHH). However, the number of people proficient in sign language remains limited, highlighting the need for technological advancements to bridge communication gaps and foster interactions with minorities. Based on recent advancements in embodied humanoid robots, we propose SignBot, a novel framework for human-robot sign language interaction. SignBot integrates a cerebellum-inspired motion control component and a cerebral-oriented module for comprehension and interaction. Specifically, SignBot consists of: 1) Motion Retargeting, which converts human sign language datasets into robot-compatible kinematics; 2) Motion Control, which leverages a learning-based paradigm to develop a robust humanoid control policy for tracking sign language gestures; and 3) Generative Interaction, which incorporates translator, responser, and generator of sign language, thereby enabling natural and effective communication between robots and humans. Simulation and real-world experimental results demonstrate that SignBot can effectively facilitate human-robot interaction and perform sign language motions with diverse robots and datasets. SignBot represents a significant advancement in automatic sign language interaction on embodied humanoid robot platforms, providing a promising solution to improve communication accessibility for the DHH community.
comment: Accepted by ICRA 2026
Super LiDAR Intensity for Robotic Perception
Conventionally, human intuition defines vision as a modality of passive optical sensing, relying on ambient light to perceive the environment. However, active optical sensing, which involves emitting and receiving signals, offers unique advantages by capturing both radiometric and geometric properties of the environment, independent of external illumination conditions. This work focuses on advancing active optical sensing using Light Detection and Ranging (LiDAR), which captures intensity data, enabling the estimation of surface reflectance that remains invariant under varying illumination. Such properties are crucial for robotic perception tasks, including detection, recognition, segmentation, and Simultaneous Localization and Mapping (SLAM). A key challenge with low-cost LiDARs lies in the sparsity of scan data, which limits their broader application. To address this limitation, this work introduces an innovative framework for generating dense LiDAR intensity images from sparse data, leveraging the unique attributes of non-repeating scanning LiDAR (NRS-LiDAR). We tackle critical challenges, including intensity calibration and the transition from static to dynamic scene domains, facilitating the reconstruction of dense intensity images in real-world settings. The key contributions of this work include a comprehensive dataset for LiDAR intensity image densification, a densification network tailored for NRS-LiDAR, and diverse applications such as loop closure and traffic lane detection using the generated dense intensity images. Experimental results validate the efficacy of the proposed approach, which successfully integrates computer vision techniques with LiDAR data processing, enhancing the applicability of low-cost LiDAR systems and establishing a novel paradigm for robotic vision via active optical sensing--LiDAR as a Camera.
comment: IEEE Robotics and Automation Letters (RA-L), 2026 (https://ieeexplore.ieee.org/document/11395610). The dataset and code are available at: (https://github.com/IMRL/Super-LiDAR-Intensity)
A Pragmatic VLA Foundation Model
Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation). To this end, we develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations. Through a systematic assessment on 3 robotic platforms, each completing 100 tasks with 130 post-training episodes per task, our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability. We have also built an efficient codebase, which delivers a throughput of 261 samples per second with an 8-GPU training setup, representing a 1.5~2.8$\times$ (depending on the relied VLM base model) speedup over existing VLA-oriented codebases. The above features ensure that our model is well-suited for real-world deployment. To advance the field of robot learning, we provide open access to the code, base model, and benchmark data, with a focus on enabling more challenging tasks and promoting sound evaluation standards.
comment: Project Webpage: https://technology.robbyant.com/lingbot-vla/, Code: https://github.com/Robbyant/lingbot-vla/, GM-100: https://huggingface.co/datasets/robbyant/lingbot-GM-100
From Prompts to Printable Models: Support-Effective 3D Generation via Offset Direct Preference Optimization
Current text-to-3D models prioritize visual fidelity but often neglect physical fabricability, resulting in geometries requiring excessive support structures. This paper introduces SEG (\textit{\underline{S}upport-\underline{E}ffective \underline{G}eneration}), a novel framework that integrates Direct Preference Optimization with an Offset (ODPO) into the 3D generation pipeline to directly optimize models for minimal support material usage. By incorporating support structure simulation into the training process, SEG encourages the generation of geometries that inherently require fewer supports, thus reducing material waste and production time. We demonstrate SEG's effectiveness through extensive experiments on two benchmark datasets, Thingi10k-Val and GPT-3DP-Val, showing that SEG significantly outperforms baseline models such as TRELLIS, DPO, and DRO in terms of support volume reduction and printability. Qualitative results further reveal that SEG maintains high fidelity to input prompts while minimizing the need for support structures. Our findings highlight the potential of SEG to transform 3D printing by directly optimizing models during the generative process, paving the way for more sustainable and efficient digital fabrication practices.
comment: Accepted by IEEE Robotics and Automation Letters 2026, preprint version by authors
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning ICRA
Multi-robot task planning requires decomposing natural-language instructions into executable actions for heterogeneous robot teams. Conventional Planning Domain Definition Language (PDDL) planners provide rigorous guarantees but struggle to handle ambiguous or long-horizon missions, while large language models (LLMs) can interpret instructions and propose plans but may hallucinate or produce infeasible actions. We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner. When plans fail, the system applies TextGrad-inspired textual-gradient updates to optimize each agent's prompt and thereby improve planning accuracy. In addition, meta-prompts are learned and shared across agents within the same layer, enabling efficient prompt optimization in multi-agent settings. On the MAT-THOR benchmark, our planner achieves success rates of 0.95 on compound tasks, 0.84 on complex tasks, and 0.60 on vague tasks, improving over the previous state-of-the-art LaMMA-P by 2, 7, and 15 percentage points respectively. An ablation study shows that the hierarchical structure, prompt optimization, and meta-prompt sharing contribute roughly +59, +37, and +4 percentage points to the overall success rate.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026. 8 pages, 2 figures
DreamWaQ++: Obstacle-Aware Quadrupedal Locomotion With Resilient Multi-Modal Reinforcement Learning
Quadrupedal robots hold promising potential for applications in navigating cluttered environments with resilience akin to their animal counterparts. However, their floating base configuration makes them vulnerable to real-world uncertainties, yielding substantial challenges in their locomotion control. Deep reinforcement learning has become one of the plausible alternatives for realizing a robust locomotion controller. However, the approaches that rely solely on proprioception sacrifice collision-free locomotion because they require front-feet contact to detect the presence of stairs to adapt the locomotion gait. Meanwhile, incorporating exteroception necessitates a precisely modeled map observed by exteroceptive sensors over a period of time. Therefore, this work proposes a novel method to fuse proprioception and exteroception featuring a resilient multi-modal reinforcement learning. The proposed method yields a controller that showcases agile locomotion performance on a quadrupedal robot over a myriad of real-world courses, including rough terrains, steep slopes, and high-rise stairs, while retaining its robustness against out-of-distribution situations.
comment: IEEE Transactions on Robotics 2026. Project site is available at https://dreamwaqpp.github.io
A spherical amplitude-phase formulation for 3-D adaptive line-of-sight (ALOS) guidance with USGES stability guarantees
A recently proposed 3-D adaptive line-of-sight (ALOS) path-following algorithm addressed coupled motion dynamics of marine craft, aircraft and uncrewed vehicles under environmental disturbances such as wind, waves and ocean currents. Stability analysis established uniform semi-global exponential stability (USGES) using a body-velocity-based amplitude-phase representation of the North-East-Down kinematic differential equations. However, the analysis is limited to straight-line paths, and restrictive assumptions are needed to ensure convergence of the vertical crab angle estimation error to zero. In this paper, we revisit the ALOS framework and introduce a novel spherical amplitude-phase design model that uses an alternative definition of the vertical crab angle. Our proposed formulation enables a significantly simplified stability proof, while retaining the USGES property for straight-line paths, removing restrictive assumptions on constant altitude/depth or zero horizontal crab angle, and remaining valid for general 3-D motion with nonzero roll, pitch and flight-path angles. We also show that the USGES result extends to a class of curved 3-D paths.
comment: 5 pages, 2 figures
STL-Based Motion Planning and Uncertainty-Aware Risk Analysis for Human-Robot Collaboration with a Multi-Rotor Aerial Vehicle
This paper presents a novel approach to motion planning and risk analysis for enhancing human-robot collaboration using a Multi-Rotor Aerial Vehicle (MRAV). The proposed method uses Signal Temporal Logic (STL) to encode key mission objectives, such as safety, timing, and human preferences, with a strong focus on ergonomics and comfort. An optimization framework generates dynamically feasible trajectories while considering the MRAV's physical constraints. Given the nonlinear and non-convex nature of the problem, smooth approximations and gradient-based techniques assist in handling the problem's computational complexity. Additionally, an uncertainty-aware risk analysis is incorporated to assess potential deviations from the mission specifications, providing insights into the likelihood of mission success under uncertain conditions. Further, an event-triggered replanning strategy is implemented to respond to unforeseen events and external disturbances. The approach is validated through MATLAB and Gazebo simulations, using an object handover task in a mock-up environment inspired by power line maintenance scenarios. The results highlight the method's effectiveness in achieving safe, efficient, and resilient human-robot collaboration.
comment: 45 pages, 14 figures
Agentic Vehicles for Human-Centered Mobility
Autonomy, from the Greek autos (self) and nomos (law), refers to the capacity to operate according to internal rules without external control. Autonomous vehicles (AuVs) are therefore understood as systems that perceive their environment and execute pre-programmed tasks independently of external input, consistent with the SAE levels of automated driving. Yet recent research and real-world deployments have begun to showcase vehicles that exhibit behaviors outside the scope of this definition. These include natural language interaction with humans, goal adaptation, contextual reasoning, external tool use, and the handling of unforeseen ethical dilemmas, enabled in part by multimodal large language models (LLMs). These developments highlight not only a gap between technical autonomy and the broader cognitive and social capacities required for human-centered mobility, but also the emergence of a form of vehicle intelligence that currently lacks a clear designation. To address this gap, the paper introduces the concept of agentic vehicles (AgVs): vehicles that integrate agentic AI systems to reason, adapt, and interact within complex environments. It synthesizes recent advances in agentic systems and suggests how AgVs can complement and even reshape conventional autonomy to ensure mobility services are aligned with user and societal needs. The paper concludes by outlining key challenges in the development and governance of AgVs and their potential role in shaping future agentic transportation systems.
Multiagent Systems
ParamMem: Augmenting Language Agents with Parametric Reflective Memory
Self-reflection enables language agents to iteratively refine solutions, yet often produces repetitive outputs that limit reasoning performance. Recent studies have attempted to address this limitation through various approaches, among which increasing reflective diversity has shown promise. Our empirical analysis reveals a strong positive correlation between reflective diversity and task success, further motivating the need for diverse reflection signals. We introduce ParamMem, a parametric memory module that encodes cross-sample reflection patterns into model parameters, enabling diverse reflection generation through temperature-controlled sampling. Building on this module, we propose ParamAgent, a reflection-based agent framework that integrates parametric memory with episodic and cross-sample memory. Extensive experiments on code generation, mathematical reasoning, and multi-hop question answering demonstrate consistent improvements over state-of-the-art baselines. Further analysis reveals that ParamMem is sample-efficient, enables weak-to-strong transfer across model scales, and supports self-improvement without reliance on stronger external model, highlighting the potential of ParamMem as an effective component for enhancing language agents.
comment: 20 pages
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in LLMs, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbf{decision-theoretic view of steganography}. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents' observable actions. To formalise this perspective, we introduce generalised $\mathcal{V}$-information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbf{steganographic gap} -- a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.
comment: First two authors contributed equally
ClawMobile: Rethinking Smartphone-Native Agentic Systems
Smartphones represent a uniquely challenging environment for agentic systems. Unlike cloud or desktop settings, mobile devices combine constrained execution contexts, fragmented control interfaces, and rapidly changing application states. As large language models (LLMs) evolve from conversational assistants to action-oriented agents, achieving reliable smartphone-native autonomy requires rethinking how reasoning and control are composed. We introduce ClawMobile as a concrete exploration of this design space. ClawMobile adopts a hierarchical architecture that separates high-level language reasoning from structured, deterministic control pathways, improving execution stability and reproducibility on real devices. Using ClawMobile as a case study, we distill the design principles for mobile LLM runtimes and identify key challenges in efficiency, adaptability, and stability. We argue that building robust smartphone-native agentic systems demands principled coordination between probabilistic planning and deterministic system interfaces. The implementation is open-sourced~\footnote{https://github.com/ClawMobile/ClawMobile} to facilitate future exploration.
comment: 7 pages, 1 figures
Robust Information Design for Multi-Agent Systems with Complementarities: Smallest-Equilibrium Threshold Policies AAMAS 2026
We study information design in multi-agent systems (MAS) with binary actions and strategic complementarities, where an external designer influences behavior only through signals. Agents play the smallest-equilibrium of the induced Bayesian game, reflecting conservative, coordination-averse behavior typical in distributed systems. We show that when utilities admit a convex potential and welfare is convex, the robustly implementable optimum has a remarkably simple form: perfect coordination at each state: either everyone acts or no one does. We provide a constructive threshold rule: compute a one-dimensional score for each state, sort states, and pick a single threshold (with a knife-edge lottery for at most one state). This rule is an explicit optimal vertex of a linear program (LP) characterized by feasibility and sequential obedience constraints. Empirically, in both vaccination and technology-adoption domains, our constructive policy matches LP optima, scales as $O(|Θ|\log|Θ|)$, and avoids the inflated welfare predicted by obedience-only designs that assume the designer can dictate the (best) equilibrium. The result is a general, scalable recipe for robust coordination in MAS with complementarities.
comment: This paper has been accepted for publication in Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026). The final published version will be available via the ACM Digital Library
QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning ICAPS 2026
Value decomposition (VD) methods have achieved remarkable success in cooperative multi-agent reinforcement learning (MARL). However, their reliance on the max operator for temporal-difference (TD) target calculation leads to systematic Q-value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q-learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near-greedy joint action space. This formulation allows the target to integrate Q-values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigates overestimation and improves learning stability. Extensive experiments demonstrate that QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to the original algorithms. Furthermore, empirical analysis confirms that QSIM significantly mitigates the systematic value overestimation in MARL. Code is available at https://github.com/MaoMaoLYJ/pymarl-qsim.
comment: 19 pages, 15 figures, 7tables. Accepted to the 36th International Conference on Automated Planning and Scheduling (ICAPS 2026)
Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents
Large-scale Graph Neural Networks (GNNs) are typically trained by sampling a vertex's neighbors to a fixed distance. Because large input graphs are distributed, training requires frequent irregular communication that stalls forward progress. Moreover, fetched data changes with graph, graph distribution, sample and batch parameters, and caching polices. Consequently, any static prefetching method will miss crucial opportunities to adapt to different dynamic conditions. In this paper, we introduce Rudder, a software module embedded in the state-of-the-art AWS DistDGL framework, to autonomously prefetch remote nodes and minimize communication. Rudder's adaptation contrasts with both standard heuristics and traditional ML classifiers. We observe that the generative AI found in contemporary Large Language Models (LLMs) exhibits emergent properties like In-Context Learning (ICL) for zero-shot tasks, with logical multi-step reasoning. We find this behavior well-suited for adaptive control even with substantial undertraining. Evaluations using standard datasets and unseen configurations on the NERSC Perlmutter supercomputer show up to 91% improvement in end-to-end training performance over baseline DistDGL (no prefetching), and an 82% improvement over static prefetching, reducing communication by over 50%. Our code is available at https://github.com/aishwaryyasarkar/rudder-llm-agent.
comment: Accepted to the 40th ACM International Conference on Supercomputing (ICS 2026)
Optimization of Edge Directions and Weights for Mixed Guidance Graphs in Lifelong Multi-Agent Path Finding
Multi-Agent Path Finding (MAPF) aims to move agents from their start to goal vertices on a graph. Lifelong MAPF (LMAPF) continuously assigns new goals to agents as they complete current ones. To guide agents' movement in LMAPF, prior works have proposed Guidance Graph Optimization (GGO) methods to optimize a guidance graph, which is a bidirected weighted graph whose directed edges represent moving and waiting actions with edge weights being action costs. Higher edge weights represent higher action costs. However, edge weights only provide soft guidance. An edge with a high weight only discourages agents from using it, instead of prohibiting agents from traversing it. In this paper, we explore the need to incorporate edge directions optimization into GGO, providing strict guidance. We generalize GGO to Mixed Guidance Graph Optimization (MGGO), presenting two MGGO methods capable of optimizing both edge weights and directions. The first optimizes edge directions and edge weights in two phases separately. The second applies Quality Diversity algorithms to optimize a neural network capable of generating edge directions and weights. We also incorporate traffic patterns relevant to edge directions into a GGO method, making it capable of generating edge-direction-aware guidance graphs.
SkillNet: Create, Evaluate, and Connect AI Skills
Current AI agents can flexibly invoke tools and execute complex tasks, yet their long-term advancement is hindered by the lack of systematic accumulation and transfer of skills. Without a unified mechanism for skill consolidation, agents frequently ``reinvent the wheel'', rediscovering solutions in isolated contexts without leveraging prior strategies. To overcome this limitation, we introduce SkillNet, an open infrastructure designed to create, evaluate, and organize AI skills at scale. SkillNet structures skills within a unified ontology that supports creating skills from heterogeneous sources, establishing rich relational connections, and performing multi-dimensional evaluation across Safety, Completeness, Executability, Maintainability, and Cost-awareness. Our infrastructure integrates a repository of over 200,000 skills, an interactive platform, and a versatile Python toolkit. Experimental evaluations on ALFWorld, WebShop, and ScienceWorld demonstrate that SkillNet significantly enhances agent performance, improving average rewards by 40% and reducing execution steps by 30% across multiple backbone models. By formalizing skills as evolving, composable assets, SkillNet provides a robust foundation for agents to move from transient experience to durable mastery.
comment: http://skillnet.openkg.cn/
Causal Graph Dynamics and Kan Extensions
On the one side, the formalism of Global Transformations comes with the claim of capturing any transformation of space that is local, synchronous and deterministic. The claim has been proven for different classes of models such as mesh refinements from computer graphics, Lindenmayer systems from morphogenesis modeling and cellular automata from biological, physical and parallel computation modeling. The Global Transformation formalism achieves this by using category theory for its genericity, and more precisely the notion of Kan extension to determine the global behaviors based on the local ones. On the other side, Causal Graph Dynamics describe the transformation of port graphs in a synchronous and deterministic way and has not yet being tackled. In this paper, we show the precise sense in which the claim of Global Transformations holds for them as well. This is done by showing different ways in which they can be expressed as Kan extensions, each of them highlighting different features of Causal Graph Dynamics. Along the way, this work uncovers the interesting class of Monotonic Causal Graph Dynamics and their universality among General Causal Graph Dynamics.
Time-Varying Formation Tracking Control of Wheeled Mobile Robots With Region Constraint: A Generalized Udwadia-Kalaba Framework
In this article, the time-varying formation tracking control of wheeled mobile robots with region constraint is investigated from a generalized Udwadia-Kalaba framework. The communication network is modeled as a directed and weighted graph that has a spanning tree with the leader being the root. By reformulating the time-varying formation tracking control objective as an equality constrained equation and transforming the region constraint by a diffeomorphism, the time-varying formation tracking controller with the region constraint is designed under the generalized Udwadia-Kalaba framework. Compared with the existing works on time-varying formation tracking control, the region constraint is taken into account in this paper, which ensures the safety of the robots. Finally, the feasibility of the proposed control strategy is illustrated through some numerical simulations.
comment: 17 pages,9 figures
LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces
Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. We propose a dual-set testing protocol for LongCLI-Bench, which measures requirement fulfillment (fail-to-pass) and regression avoidance (pass-to-pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents' planning and execution capabilities to overcome key challenges in long-horizon task performance.
Under the Influence: Quantifying Persuasion and Vigilance in Large Language Models
With increasing integration of Large Language Models (LLMs) into areas of high-stakes human decision-making, it is important to understand the risks they introduce as advisors. To be useful advisors, LLMs must sift through large amounts of content, written with both benevolent and malicious intent, and then use this information to convince a user to take a specific action. This involves two social capacities: vigilance (the ability to determine which information to use, and which to discard) and persuasion (synthesizing the available evidence to make a convincing argument). While existing work has investigated these capacities in isolation, there has been little prior investigation of how these capacities may be linked. Here, we use a simple multi-turn puzzle-solving game, Sokoban, to study LLMs' abilities to persuade and be rationally vigilant towards other LLM agents. We find that puzzle-solving performance, persuasive capability, and vigilance are dissociable capacities in LLMs. Performing well on the game does not automatically mean a model can detect when it is being misled, even if the possibility of deception is explicitly mentioned. However, LLMs do consistently modulate their token use, using fewer tokens to reason when advice is benevolent and more when it is malicious, even if they are still persuaded to take actions leading them to failure. To our knowledge, our work presents the first investigation of the relationship between persuasion, vigilance, and task performance in LLMs, and suggests that monitoring all three independently will be critical for future work in AI safety.
Scaling Inference-Time Computation via Opponent Simulation: Enabling Online Strategic Adaptation in Repeated Negotiation
While large language models (LLMs) have emerged as powerful decision-makers across a wide range of single-agent and stationary environments, fewer efforts have been devoted to settings where LLMs must engage in \emph{repeated} and \emph{strategic} interactions with unknown or dynamic opponents. In such settings, recipes built upon \emph{offline} pre-training or fine-tuning, though robust against worst-case adversaries, do not fully exploit the capability of LLMs to adapt \emph{online} based on interaction feedback. Instead, we explore the more natural perspective of scaling inference-time computation as a mechanism for adaptation, embedding the principles of a classical game-theoretical learning dynamic, \emph{smooth Fictitious Play (sFP)}, into LLM inference: (i) for belief formation, we employ an auxiliary opponent model that in-context learns to imitate the time-averaged behavior of the opponent; (ii) for best response, we advance best-of-$N$ (BoN) sampling by simulating against the opponent model. Empirical evaluations on two distinct forms of repeated negotiation games demonstrate that our method enables significant performance improvement over repeated online interaction compared to various baselines, offering a scalable and principled approach to repeated strategic decision-making without any parameter updates.
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning ICRA
Multi-robot task planning requires decomposing natural-language instructions into executable actions for heterogeneous robot teams. Conventional Planning Domain Definition Language (PDDL) planners provide rigorous guarantees but struggle to handle ambiguous or long-horizon missions, while large language models (LLMs) can interpret instructions and propose plans but may hallucinate or produce infeasible actions. We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner. When plans fail, the system applies TextGrad-inspired textual-gradient updates to optimize each agent's prompt and thereby improve planning accuracy. In addition, meta-prompts are learned and shared across agents within the same layer, enabling efficient prompt optimization in multi-agent settings. On the MAT-THOR benchmark, our planner achieves success rates of 0.95 on compound tasks, 0.84 on complex tasks, and 0.60 on vague tasks, improving over the previous state-of-the-art LaMMA-P by 2, 7, and 15 percentage points respectively. An ablation study shows that the hierarchical structure, prompt optimization, and meta-prompt sharing contribute roughly +59, +37, and +4 percentage points to the overall success rate.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026. 8 pages, 2 figures
HyperAgent: Leveraging Hypergraphs for Topology Optimization in Multi-Agent Communication
Recent advances in large language model-powered multi-agent systems have demonstrated remarkable collective intelligence through effective communication. However, existing approaches face two primary challenges: (i) \textit{Ineffective group collaboration modeling}, as they rely on pairwise edge representations in graph structures, limiting their ability to capture relationships among multiple agents; and (ii) \textit{Limited task-adaptiveness in communication topology design}, leading to excessive communication cost for simple tasks and insufficient coordination for complex scenarios. These issues restrict the scalability and practical deployment of adaptive collaboration frameworks. To address these challenges, we propose \textbf{HyperAgent}, a hypergraph-based framework that optimizes communication topologies and effectively captures group collaboration patterns using direct hyperedge representations. Unlike edge-based approaches, HyperAgent uses hyperedges to link multiple agents within the same subtask and employs hypergraph convolutional layers to achieve one-step information aggregation in collaboration groups. Additionally, it incorporates a variational autoencoder framework with sparsity regularization to dynamically adjust hypergraph topologies based on task complexity. Experiments highlight the superiority of HyperAgent in both performance and efficiency. For instance, on GSM8K, HyperAgent achieves 95.07\% accuracy while reducing token consumption by 25.33\%, demonstrating the potential of hypergraph-based optimization for multi-agent communication.
comment: This submission has been withdrawn by the authors due to a fundamental error in the methodology that affects the validity of the main results
Systems and Control (EESS)
Millimeter-Wave RIS: Hardware Design and System-Level Considerations
Reconfigurable intelligent surfaces have emerged as a promising hardware platform for shaping wireless propagation environments at millimeter-wave (mm-Wave) frequencies and beyond. While many existing studies emphasize channel modeling and signal processing, practical RIS deployment is fundamentally governed by hardware design choices and their system-level implications. This paper presents a hardware-centric overview of recent mm-Wave RIS developments, covering wideband realizations, high-resolution phase-quantized designs, fully printed low-cost implementations, optically transparent surfaces, RIS-on-chip solutions, and emerging three-dimensional architectures. Key challenges including mutual coupling, calibration, multi-RIS interaction, and frequency-dependent phase control are discussed to bridge hardware realization with system-level optimization. This overview provides practical design insights and aims to guide future RIS research toward scalable, efficient, and practically deployable intelligent surface architectures.
Signal Temporal Logic Verification and Synthesis Using Deep Reachability Analysis and Layered Control Architecture
We propose a signal temporal logic (STL)-based framework that rigorously verifies the feasibility of a mission described in STL and synthesizes control to safely execute it. The proposed framework ensures safe and reliable operation through two phases. First, the proposed framework assesses the feasibility of STL by computing a backward reachable tube (BRT), which captures all states that can satisfy the given STL, regardless of the initial state. The proposed framework accommodates the multiple reach-avoid (MRA) problem to address more general STL specifications and leverages a deep neural network to alleviate the computation burden for reachability analysis, reducing the computation time by about 1000 times compared to a baseline method. We further propose a layered planning and control architecture that combines mixed-integer linear programming (MILP) for global planning with model predictive control (MPC) as a local controller for the verified STL. Consequently, the proposed framework can robustly handle unexpected behavior of obstacles that are not described in the environment information or STL, thereby providing reliable mission performance. Our numerical simulations demonstrate that the proposed framework can successfully compute BRT for a given STL and perform the mission.
Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction
Leader-follower interaction is an important paradigm in human-robot interaction (HRI). Yet, assigning roles in real time remains challenging for resource-constrained mobile and assistive robots. While large language models (LLMs) have shown promise for natural communication, their size and latency limit on-device deployment. Small language models (SLMs) offer a potential alternative, but their effectiveness for role classification in HRI has not been systematically evaluated. In this paper, we present a benchmark of SLMs for leader-follower communication, introducing a novel dataset derived from a published database and augmented with synthetic samples to capture interaction-specific dynamics. We investigate two adaptation strategies: prompt engineering and fine-tuning, studied under zero-shot and one-shot interaction modes, compared with an untrained baseline. Experiments with Qwen2.5-0.5B reveal that zero-shot fine-tuning achieves robust classification performance (86.66% accuracy) while maintaining low latency (22.2 ms per sample), significantly outperforming baseline and prompt-engineered approaches. However, results also indicate a performance degradation in one-shot modes, where increased context length challenges the model's architectural capacity. These findings demonstrate that fine-tuned SLMs provide an effective solution for direct role assignment, while highlighting critical trade-offs between dialogue complexity and classification reliability on the edge.
Learning-based Multi-agent Race Strategies in Formula 1
In Formula 1, race strategies are adapted according to evolving race conditions and competitors' actions. This paper proposes a reinforcement learning approach for multi-agent race strategy optimization. Agents learn to balance energy management, tire degradation, aerodynamic interaction, and pit-stop decisions. Building on a pre-trained single-agent policy, we introduce an interaction module that accounts for the behavior of competitors. The combination of the interaction module and a self-play training scheme generates competitive policies, and agents are ranked based on their relative performance. Results show that the agents adapt pit timing, tire selection, and energy allocation in response to opponents, achieving robust and consistent race performance. Because the framework relies only on information available during real races, it can support race strategists' decisions before and during races.
Integrated Flight and Propulsion Control for Fixed-Wing UAVs via Thrust and Disturbance Compensation
This paper investigates the position-tracking control problem for fixed-wing unmanned aerial vehicles (UAVs) equipped with a turbojet engine via an integrated flight and propulsion control scheme. To this end, a hierarchical control framework with thrust and disturbance compensation is proposed. In particular, we first propose a perturbed fixed-wing UAV model with turbojet engine dynamics, accounting for both unmodeled dynamics and external disturbances. Second, a versatile extended observer is designed to handle both unmeasurable thrust dynamics and external disturbances. Third, a hierarchical control framework is implemented using three observer-based controllers to guarantee position-tracking performance. With the proposed control strategy, we prove that the closed-loop system asymptotically converges to the desired trajectory. Finally, a comparative simulation is performed to illustrate the proposed control strategy.
comment: 10 pages, 4 figures
Steady State Covariance Steering via Sparse Intervention
This paper addresses the steady state covariance steering for linear dynamical systems via structural intervention on the system matrix. We formulate the covariance steering problem as the minimization of the Kullback-Leibler (KL) divergence between the steady state and target Gaussian distributions. To solve the problem, we develop a solution method, hereafter referred to as the proximal gradient-based algorithm, of promoting sparsity in the structural intervention by integrating the objective into a proximal gradient framework with L1 regularization. The main contribution of this paper lies in the analytical expression of the KL divergence gradient with respect to the intervention matrix: the gradient is characterized by the solutions to two Lyapunov equations related to the state covariance equation and its adjoint. Numerical simulations demonstrate that the proximal gradient-based algorithm effectively identifies sparse, structurally-constrained interventions to achieve precise covariance steering.
Transformer Actor-Critic for Efficient Freshness-Aware Resource Allocation ICML
Emerging applications such as autonomous driving and industrial automation demand ultra-reliable and low-latency communication (URLLC), where maintaining fresh and timely information is critical. A key performance metric in such systems is the age of information (AoI). This paper addresses AoI minimization in a multi-user uplink wireless network using non-orthogonal multiple access (NOMA), where users offload tasks to a base station. The system must handle user heterogeneity in task sizes, AoI thresholds, and penalty sensitivities, while adhering to NOMA constraints on user scheduling. We propose a deep reinforcement learning (DRL) framework based on proximal policy optimization (PPO), enhanced with a Transformer encoder. The attention mechanism allows the agent to focus on critical user states and capture inter-user dependencies, improving policy performance and scalability. Extensive simulations show that our method reduces average AoI compared to baselines. We also analyze the evolution of attention weights during training and observe that the model progressively learns to prioritize high-importance users. Attention maps reveal meaningful structure: early-stage policies exhibit uniform attention, while later stages show focused patterns aligned with user priority and NOMA constraints. These results highlight the promise of attention-driven DRL for intelligent, priority-aware resource allocation in next-generation wireless systems.
comment: \c{opyright} 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses. Accepted for publication in the 2026 IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN)
Robust Helicopter Ship Deck Landing With Guaranteed Timing Using Shrinking-Horizon Model Predictive Control
We present a runtime efficient algorithm for autonomous helicopter landings on moving ship decks based on Shrinking-Horizon Model Predictive Control (SHMPC). First, a suitable planning model capturing the relevant aspects of the full nonlinear helicopter dynamics is derived. Next, we use the SHMPC together with a touchdown controller stage to ensure a pre-specified maneuver time and an associated landing time window despite the presence of disturbances. A high disturbance rejection performance is achieved by designing an ancillary controller with disturbance feedback. Thus, given a target position and time, a safe landing with suitable terminal conditions is be guaranteed if the initial optimization problem is feasible. The efficacy of our approach is shown in simulation where all maneuvers achieve a high landing precision in strong winds while satisfying timing and operational constraints with maximum computation times in the millisecond range.
comment: This version was submitted to the American Control Conference 2026 and has been accepted
Opacity in Discrete Event Systems: A Perspective and Overview
Opacity has emerged as a central confidentiality notion for information-flow security in discrete event systems (DES), capturing the requirement that an external observer (intruder) should never be able to determine with certainty whether the system is, was, or will be in a secret state. This article provides a concise, newcomer-friendly overview of opacity in DES, emphasizing core definitions and the unifying estimation viewpoint behind major opacity notions,. We summarize representative verification techniques and highlight how different observation models reshape both the problem formulation and algorithmic structure. We then review principal enforcement paradigms, ranging from opacity-enforcing supervisory control to sensor activation/information release optimization and obfuscation/editing mechanisms. Beyond finite automata, we outline how opacity has been studied in richer models such as stochastic systems, timed systems, Petri nets, and continuous/hybrid dynamics, and we briefly survey applications in robotics, location privacy, and information services. Finally, we discuss selected open challenges, including solvability under incomparable information, scalable methods beyond worst-case complexity, and opacity under intelligent or data-driven adversaries.
Toward Wireless Human-Machine Collaboration in the 6G Era
The next industrial revolution, Industry 5.0, will be driven by advanced technologies that foster human-machine collaboration (HMC). It will leverage human creativity, judgment, and dexterity with the machine's strength, precision, and speed to improve productivity, quality of life, and sustainability. Wireless communications, empowered by the emerging capabilities of sixth-generation (6G) wireless networks, will play a central role in enabling flexible, scalable, and low-cost deployment of geographically distributed HMC systems. In this article, we first introduce the generic architecture and key components of wireless HMC (WHMC). We then present the network topologies of WHMC and highlight impactful applications across various industry sectors. Driven by the prospective applications, we elaborate on new performance metrics that researchers and practitioners may consider during the exploration and implementation of WHMC and discuss new design methodologies. We then summarize the communication requirements and review promising state-of-the-art technologies that can support WHMC. Finally, we present a proof-of-concept case study and identify several open challenges.
comment: This work has been submitted to the IEEE for possible publication
HyperKKL: Enabling Non-Autonomous State Estimation through Dynamic Weight Conditioning ICLR 2026
This paper proposes HyperKKL, a novel learning approach for designing Kazantzis-Kravaris/Luenberger (KKL) observers for non-autonomous nonlinear systems. While KKL observers offer a rigorous theoretical framework by immersing nonlinear dynamics into a stable linear latent space, its practical realization relies on solving Partial Differential Equations (PDE) that are analytically intractable. Current existing learning-based approximations of the KKL observer are mostly designed for autonomous systems, failing to generalize to driven dynamics without expensive retraining or online gradient updates. HyperKKL addresses this by employing a hypernetwork architecture that encodes the exogenous input signal to instantaneously generate the parameters of the KKL observer, effectively learning a family of immersion maps parameterized by the external drive. We rigorously evaluate this approach against a curriculum learning strategy that attempts to generalize from autonomous regimes via training heuristics alone. The novel approach is illustrated on four numerical simulations in benchmark examples including the Duffing, Van der Pol, Lorenz, and Rössler systems.
comment: 18 pages, 6 figures, Under review in ICLR 2026 AI & PDE Workshop
Gradient Dominance in the Linear Quadratic Regulator: A Unified Analysis for Continuous-Time and Discrete-Time Systems
Despite its nonconvexity, policy optimization for the Linear Quadratic Regulator (LQR) admits a favorable structural property known as gradient dominance, which facilitates linear convergence of policy gradient methods to the globally optimal gain. While gradient dominance has been extensively studied, continuous-time and discrete-time LQRs have largely been analyzed separately, relying on slightly different assumptions, proof strategies, and resulting guarantees. In this paper, we present a unified gradient dominance property for both continuous-time and discrete-time LQRs under mild stabilizability and detectability assumptions. Our analysis is based on a convex reformulation derived from a common Lyapunov inequality representation and a unified change-of-variables procedure. This convex-lifting perspective yields a single proof framework applicable to both time models. The unified treatment clarifies how differences between continuous-time and discrete-time dynamics influence theoretical guarantees and reveals a deeper structural symmetry between the two formulations. Numerical examples illustrate and support the theoretical findings.
comment: 28 pages, 4 figures
SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation
We present, to our knowledge, the first sign language-driven Vision-Language-Action (VLA) framework for intuitive and inclusive human-robot interaction. Unlike conventional approaches that rely on gloss annotations as intermediate supervision, the proposed system adopts a gloss-free paradigm and directly maps visual sign gestures to semantic instructions. This design reduces annotation cost and avoids the information loss introduced by gloss representations, enabling more natural and scalable multimodal interaction. In this work, we focus on a real-time alphabet-level finger-spelling interface that provides a robust and low-latency communication channel for robotic control. Compared with large-scale continuous sign language recognition, alphabet-level interaction offers improved reliability, interpretability, and deployment feasibility in safety-critical embodied environments. The proposed pipeline transforms continuous gesture streams into coherent language commands through geometric normalization, temporal smoothing, and lexical refinement, ensuring stable and consistent interaction. Furthermore, the framework is designed to support future integration of transformer-based gloss-free sign language models, enabling scalable word-level and sentence-level semantic understanding. Experimental results demonstrate the effectiveness of the proposed system in grounding sign-derived instructions into precise robotic actions under diverse interaction scenarios. These results highlight the potential of the framework to advance accessible, scalable, and multimodal embodied intelligence.
comment: 7 pages, 2 figures
Small HVAC Control Demonstrations in Larger Buildings Often Overestimate Savings
How much energy, money, and emissions can advanced control of heating and cooling equipment save in real buildings? To address this question, researchers sometimes control a small number of thermal zones within a larger multi-zone building, then report savings for the controlled zones only. That approach can overestimate savings by neglecting heat transfer between controlled zones and adjacent zones. This paper mathematically characterizes the overestimation error when the dynamics are linear and the objectives are linear in the thermal load, as usually holds when optimizing energy efficiency, energy costs, or emissions. Overestimation errors can be large even in seemingly innocuous situations. For example, when controlling only interior zones that have no direct thermal contact with the outdoors, all perceived savings are fictitious. This paper provides an alternative estimation method based on the controlled and adjacent zones' temperature measurements. The new method does not require estimating how much energy the building would have used under baseline operations, so it removes the additional measurement and verification challenge of accurate baseline estimation.
Training with Hard Constraints: Learning Neural Certificates and Controllers for SDEs
Due to their expressive power, neural networks (NNs) are promising templates for functional optimization problems, particularly for reach-avoid certificate generation for systems governed by stochastic differential equations (SDEs). However, ensuring hard-constraint satisfaction remains a major challenge. In this work, we propose two constraint-driven training frameworks with guarantees for supermartingale-based neural certificate construction and controller synthesis for SDEs. The first approach enforces certificate inequalities via domain discretization and a bound-based loss, guaranteeing global validity once the loss reaches zero. We show that this method also enables joint NN controller-certificate synthesis with hard guarantees. For high-dimensional systems where discretization becomes prohibitive, we introduce a partition-free, scenario-based training method that provides arbitrarily tight PAC guarantees for certificate constraint satisfaction. Benchmarks demonstrate scalability of the bound-based method up to 5D, outperforming the state of the art, and scalability of the scenario-based approach to at least 10D with high-confidence guarantees.
comment: Under review
Refining Almost-Safe Value Functions on the Fly
Control Barrier Functions (CBFs) are a powerful tool for ensuring robotic safety, but designing or learning valid CBFs for complex systems is a significant challenge. While Hamilton-Jacobi Reachability provides a formal method for synthesizing safe value functions, it scales poorly and is typically performed offline, limiting its applicability in dynamic environments. This paper bridges the gap between offline synthesis and online adaptation. We introduce refineCBF for refining an approximate CBF - whether analytically derived, learned, or even unsafe - via warm-started HJ reachability. We then present its computationally efficient successor, HJ-Patch, which accelerates this process through localized updates. Both methods guarantee the recovery of a safe value function and can ensure monotonic safety improvements during adaptation. Our experiments validate our framework's primary contribution: in-the-loop, real-time adaptation, in simulation (with detailed value function analysis) and on physical hardware. Our experiments on ground vehicles and quadcopters show that our framework can successfully adapt to sudden environmental changes, such as new obstacles and unmodeled wind disturbances, providing a practical path toward deploying formally guaranteed safety in real-world settings.
Cybersecurity of Teleoperated Quadruped Robots: A Systematic Survey of Vulnerabilities, Threats, and Open Defense Gaps
Teleoperated quadruped robots are increasingly deployed in safety-critical missions -- industrial inspection, military reconnaissance, and emergency response -- yet the security of their communication and control infrastructure remains insufficiently characterized. Quadrupeds present distinct security challenges arising from dynamic stability constraints, gait-dependent vulnerability windows, substantial kinetic energy, and elevated operator cognitive load. This survey synthesizes peer-reviewed literature and vulnerability disclosures (2019--2025) to provide comprehensive analysis of cybersecurity threats, consequences, and countermeasures for teleoperated quadruped systems. We contribute: (i) a six-layer attack taxonomy spanning perception manipulation, VR/AR operator targeting, communication disruption, control signal attacks, localization spoofing, and network intrusion; (ii) systematic attack-to-consequence mapping with timing characterization; (iii) Technology Readiness Level classification exposing critical maturity gaps between field-deployed communication protections (TRL 7--9) and experimental perception/operator-layer defenses (TRL 3--5); (iv) comparative security analysis of six commercial platforms; (v) pragmatic deployment guidance stratified by implementation timeline; and (vi) eight prioritized research gaps with implementation roadmaps. Limitations: Platform assessments rely on publicly available information. Attack success rates derive from cited studies under controlled conditions and require domain-specific validation.
comment: survey paper; 23 tables; 9 figures; 132 references
Lifecycle-Integrated Security for AI-Cloud Convergence in Cyber-Physical Infrastructure
The convergence of Artificial Intelligence (AI) inference pipelines with cloud infrastructure creates a dual attack surface where cloud security standards and AI governance frameworks intersect without unified enforcement mechanisms. AI governance, cloud security, and industrial control system standards intersect without unified enforcement, leaving hybrid deployments exposed to cross-layer attacks that threaten safety-critical operations. This paper makes three primary contributions: (i) we synthesize these frameworks into a lifecycle-staged threat taxonomy structured around explicit attacker capability tiers, (ii) we propose a Unified Reference Architecture spanning a Secure Data Factory, a hardened model supply chain, and a runtime governance layer, (iii) we present a case study through Grid-Guard, a hybrid Transmission System Operator scenario in which coordinated defenses drawn from NIST AI RMF, MITRE ATLAS, OWASP AI Exchange and GenAI, CSA MAESTRO, and NERC CIP defeat a multi-tier physical-financial manipulation campaign without human intervention. Controls are mapped against all five frameworks and current NERC CIP standards to demonstrate that a single cloud-native architecture can simultaneously satisfy AI governance, adversarial robustness, agentic safety, and industrial regulatory compliance obligations.
Embedding Morphology into Transformers for Cross-Robot Policy Learning
Cross-robot policy learning -- training a single policy to perform well across multiple embodiments -- remains a central challenge in robot learning. Transformer-based policies, such as vision-language-action (VLA) models, are typically embodiment-agnostic and must infer kinematic structure purely from observations, which can reduce robustness across embodiments and even limit performance within a single embodiment. We propose an embodiment-aware transformer policy that injects morphology via three mechanisms: (1) kinematic tokens that factorize actions across joints and compress time through per-joint temporal chunking; (2) a topology-aware attention bias that encodes kinematic topology as an inductive bias in self-attention, encouraging message passing along kinematic edges; and (3) joint-attribute conditioning that augments topology with per-joint descriptors to capture semantics beyond connectivity. Across a range of embodiments, this structured integration consistently improves performance over a vanilla pi0.5 VLA baseline, indicating improved robustness both within an embodiment and across embodiments.
comment: 17 pages, 8 figures (including appendix)
Design Framework and Manufacturing of an Active Magnetic Bearing Spindle for Micro-Milling Applications
Micro-milling spindles require high rotational speeds where conventional rolling element bearings face limitations such as friction and thermal expansion. Active magnetic bearings (AMBs) address these challenges by providing non-contact and lubrication-free operation at ultra-high speeds with the ability to actively regulate spindle dynamics. The existing literature on AMB spindles has mainly reported specific prototype realizations or control system implementations for specific spindle dynamics. Consequently, design knowledge remains fragmented across isolated successful studies. This paper addresses this gap by presenting a systematic and iterative framework to design and manufacture a micro-milling AMB spindle. The process involves a multidisciplinary design flow with a focus on critical practical aspects of manufacturing. The realized spindle is reported as a case study.
Log-linear Dynamic Inversion for Thrusting Spacecraft on SE2(3)
We demonstrate that the error dynamics of a thrusting spacecraft are nearly group affine on the $SE_2(3)$ Lie group, and the nonlinearity can be bounded, or removed with the application of a dynamic inversion control law. A numerical example validates the results by showing agreement between the error predicted by the log-dynamics and the error obtained from classical integration of trajectories using Newtonian dynamics. The result clarifies how thrusting spacecraft dynamics fit within the invariant systems framework.
Cyber Attacks Detection, Prevention, and Source Localization in Digital Substation Communication using Hybrid Statistical-Deep Learning
The digital transformation of power systems is accelerating the adoption of IEC 61850 standard. However, its communication protocols, including Sampled Values (SV), lack built-in security features such as authentication and encryption, making them vulnerable to malicious packet injection. Such cyber attacks can delay fault clearance or trigger unintended circuit breaker operations. While most existing research focuses on detecting cyber attacks in digital substations, intrusion prevention systems have been disregarded because of the risk of potential communication network disruptions. This paper proposes a novel method using hybrid statistical-deep learning for the detection, prevention, and source localization of IEC 61850 SV injection attacks. The method uses exponentially modified Gaussian distributions to model communication network latency and long short-term memory and Elman recurrent neural network to detect anomalous variations in the estimated probability distributions. It effectively discards malicious SV frames with minimal processing overhead and latency, maintains robustness against communication network latency variation and time-synchronization issues, and guarantees a near-zero false positive rate in non-attack scenarios. Comprehensive validation is conducted on three testbeds involving industrial-grade devices, hardware-in-the-loop simulations, virtualized intelligent electronic devices and merging units, and high-fidelity emulated communication networks. Results demonstrate the method's suitability for practical deployment in IEC 61850-compliant digital substations.
comment: 11 pages, 7 figures. This work has been submitted to the IEEE for possible publication
Gain-Scheduling Data-Enabled Predictive Control for Nonlinear Systems with Linearized Operating Regions
This paper presents a Gain-Scheduled Data-Enabled Predictive Control (GS-DeePC) framework for nonlinear systems based on multiple locally linear data representations. Instead of relying on a single global Hankel matrix, the operating range of a measurable scheduling variable is partitioned into regions, and regional Hankel matrices are constructed from persistently exciting data. To ensure smooth transitions between linearization regions and suppress region-induced chattering, composite regions are introduced, merging neighboring data sets and enabling a robust switching mechanism. The proposed method maintains the original DeePC problem structure and can achieve reduced computational complexity by requiring only short, locally informative data sequences. Extensive experiments on a nonlinear DC-motor with an unbalanced disc demonstrate the significantly improved control performance compared to standard DeePC.
comment: 8 pages, 3 figures, 2 tables
Preference Analysis Using Random Spanning Trees: A Stochastic Sampling Approach to Inconsistent Pairwise Comparisons
Eliciting preferences from human judgements is inherently imprecise, yet most decision analysis methods force a single priority vector from pairwise comparisons, discarding the information embedded in inconsistencies. We instead leverage inconsistency to characterise preference uncertainty by examining all priority vectors consistent with the decision maker's judgements. Spanning tree analysis enumerates combinations of evaluation and weighting vectors from pairwise comparison subsets, each yielding a distinct preference vector and collectively defining a distribution over possible preference orderings. Since exponential growth renders complete enumeration prohibitive, we propose a stochastic random walk sampling approach with sample sizes formally established via statistical sampling theory. This enables two key metrics: Pairwise Winning Indices (PWIs), the probability one alternative is preferred to another, and Rank Acceptability Indices (RAIs), the probability an alternative attains a given rank. A notable advantage is applicability to incomplete pairwise comparisons, common in large-scale problems. We validate the methodology against complete enumeration on a didactic example, then demonstrate scalability on a telecommunications backbone infrastructure selection case study involving billions of spanning tree combinations. The approach yields probabilistic insights into preference robustness and ranking uncertainty, supporting informed decisions without the burden of exact enumeration.
Efficient CNN Inference on Ultra-Low-Power MCUs via Saturation-Aware Convolution
Quantized CNN inference on ultra-low-power MCUs incurs unnecessary computations in neurons that produce saturated output values. These values are too extreme and are eventually clamped to the boundaries allowed by the neuron. Often times, the neuron can save time by only producing a value that is extreme enough to lead to the clamped result, instead of completing the computation, yet without introducing any error. Based on this, we present saturation-aware convolution: an inference technique whereby we alter the order of computations in convolution kernels to induce earlier saturation, and value checks are inserted to omit unnecessary computations when the intermediate result is sufficiently extreme. Our experimental results display up to 24% inference time saving on a Cortex-M0+ MCU, with zero impact on accuracy.
Linear viscoelastic rheological FrBD models
In [1], a new modeling paradigm for developing rate-and-state-dependent, control-oriented friction models was introduced. The framework, termed Friction with Bristle Dynamics (FrBD), combines nonlinear analytical expressions for the friction coefficient with constitutive equations for bristle-like elements. Within the FrBD framework, this letter introduces two novel formulations based on the two most general linear viscoelastic models for solids: the Generalized Maxwell (GM) and Generalized Kelvin-Voigt (GKV) elements. Both are analyzed in terms of boundedness and passivity, revealing that these properties are satisfied for any physically meaningful parametrization. An application of passivity for control design is also illustrated, considering an example from robotics. The findings of this letter systematically integrate rate-and-state dynamic friction models with linear viscoelasticity.
comment: 6 pages, 3 figures. Under review at IEEE LCSS
Informativity and Identifiability for Identification of Networks of Dynamical Systems
In this paper, we show how informativity and identifiability for networks of dynamical systems can be investigated using Gröbner bases. We provide a sufficient condition for informativity in terms of positive definiteness of the spectrum of external signals and full generic rank of the transfer function relating the external signals to the inputs of the predictor. Moreover, we show how generic local network identifiability can be investigated by computing the dimension of the fiber associated with the closed loop transfer function from external measurable signals to the measured outputs.
comment: Submitted to IEEE TAC
Optimization with Multi-sourced Information and Unknown Reliability: A Distributionally Robust Approach
In problems that involve input parameter information gathered from multiple data sources with varying reliability, incorporating decision makers' trust on different sources in optimization models can potentially improve solution performance. In this work, we propose a novel multi-reference distributionally robust optimization (MR-DRO) framework, where the model inputs are uncertain and their probability distributions can be statistically inferred from multiple information sources. Via nonparametric data fusion, we construct a Wasserstein ambiguity set to minimize the worst-case expected cost of a stochastic objective function, accounting for both uncertainty and unknown reliability of several given information sources. We reformulate the MR-DRO model as a linear program given linear objective and constraints in the original problem. We also incorporate a dynamic trust update mechanism that adjusts the trust for each source based on its performance over time. In addition, we introduce the concept of probability dominance to identify sources with dominant trust. Via computational studies using resource allocation and portfolio optimization instances, we demonstrate the effectiveness of the MR-DRO approach compared to traditional optimization frameworks relying on a single data source. Our results highlight the significance of integrating (dynamic) decision maker's trust in optimization under uncertainty, particularly when given diverse and potentially conflicting input data.
comment: 38 pages, 9 figures, 7 tables
Passive Beam Shaping via Binary-Coded Apertures
This paper presents a coded-aperture reflector for indoor mmWave coverage enhancement in obstructed or blocked LoS settings. We model the reflecting aperture using an equivalent array-factor formulation, where each passive reflecting cell contributes a reradiated field with phase set by the incident and departure directions. Building on this model, we develop two fabrication-friendly passive synthesis methods: (i) binary (1-bit) spatial coding that enables deterministic non-specular beam formation and multibeam patterns by selecting cell participation on a dense λ/2 lattice via an ON/OFF metallization mask, and (ii) diffraction-order (periodic) steering that exploits aperture periodicity to place selected diffraction orders at prescribed angles. We analytically characterize the proposed cosine-threshold quantization rule, including its asymptotic activation ratio and a distribution-free lower bound on non-specular gain relative to ideal continuous-phase control. To validate the proposed designs, we fabricate and metallize low-cost prototypes in-house using a copper-backed 3D-printed "inkwell" substrate with stencil-guided conductive ink deposition. 60 GHz over-the-air measurements show non-specular power enhancements on the order of +14-20 dB relative to passive, non-engineered (all-ON) reflector baselines. Results also demonstrate that fully passive, binary-coded apertures can deliver beam control with rapid in-lab manufacturability and offer a practical alternative to power-consuming reconfigurable surfaces for static indoor mmWave links.
DreamWaQ++: Obstacle-Aware Quadrupedal Locomotion With Resilient Multi-Modal Reinforcement Learning
Quadrupedal robots hold promising potential for applications in navigating cluttered environments with resilience akin to their animal counterparts. However, their floating base configuration makes them vulnerable to real-world uncertainties, yielding substantial challenges in their locomotion control. Deep reinforcement learning has become one of the plausible alternatives for realizing a robust locomotion controller. However, the approaches that rely solely on proprioception sacrifice collision-free locomotion because they require front-feet contact to detect the presence of stairs to adapt the locomotion gait. Meanwhile, incorporating exteroception necessitates a precisely modeled map observed by exteroceptive sensors over a period of time. Therefore, this work proposes a novel method to fuse proprioception and exteroception featuring a resilient multi-modal reinforcement learning. The proposed method yields a controller that showcases agile locomotion performance on a quadrupedal robot over a myriad of real-world courses, including rough terrains, steep slopes, and high-rise stairs, while retaining its robustness against out-of-distribution situations.
comment: IEEE Transactions on Robotics 2026. Project site is available at https://dreamwaqpp.github.io
A spherical amplitude-phase formulation for 3-D adaptive line-of-sight (ALOS) guidance with USGES stability guarantees
A recently proposed 3-D adaptive line-of-sight (ALOS) path-following algorithm addressed coupled motion dynamics of marine craft, aircraft and uncrewed vehicles under environmental disturbances such as wind, waves and ocean currents. Stability analysis established uniform semi-global exponential stability (USGES) using a body-velocity-based amplitude-phase representation of the North-East-Down kinematic differential equations. However, the analysis is limited to straight-line paths, and restrictive assumptions are needed to ensure convergence of the vertical crab angle estimation error to zero. In this paper, we revisit the ALOS framework and introduce a novel spherical amplitude-phase design model that uses an alternative definition of the vertical crab angle. Our proposed formulation enables a significantly simplified stability proof, while retaining the USGES property for straight-line paths, removing restrictive assumptions on constant altitude/depth or zero horizontal crab angle, and remaining valid for general 3-D motion with nonzero roll, pitch and flight-path angles. We also show that the USGES result extends to a class of curved 3-D paths.
comment: 5 pages, 2 figures
Experimental Multi-site Testbed for Advanced Control and Optimization of Hybrid Energy Systems
This paper presents a hybrid energy system (HES) experimental testbed developed at the University of Vermont, featuring a dual-site architecture that integrates on-campus laboratory facility with an off-campus solar and meteorological station. This supports the prototyping and validation of advanced HES control and optimization strategies. The platform integrates hardware-in-the-loop (HIL) simulations with a reconfigurable set of kVA-scale assets.A unified monitoring and communication architecture supports real-time data acquisition, model validation, and control implementation. The capabilities of the testbed are demonstrated through an HIL experiment in which a battery systems participate in solar PV smoothing.
Robotics
Position-Based Flocking for Persistent Alignment without Velocity Sensing
Coordinated collective motion in bird flocks and fish schools inspires algorithms for cohesive swarm robotics. This paper presents a position-based flocking model that achieves persistent velocity alignment without velocity sensing. By approximating relative velocity differences from changes between current and initial relative positions and incorporating a time- and density-dependent alignment gain with a non-zero minimum threshold to maintain persistent alignment, the model sustains coherent collective motion over extended periods. Simulations with a collective of 50 agents demonstrate that the position-based flocking model attains faster and more sustained directional alignment and results in more compact formations than a velocity-alignment-based baseline. This position-based flocking model is particularly well-suited for real-world robotic swarms, where velocity measurements are unreliable, noisy, or unavailable. Experimental results using a team of nine real wheeled mobile robots are also presented.
System Design of the Ultra Mobility Vehicle: A Driving, Balancing, and Jumping Bicycle Robot
Trials cyclists and mountain bike riders can hop, jump, balance, and drive on one or both wheels. This versatility allows them to achieve speed and energy-efficiency on smooth terrain and agility over rough terrain. Inspired by these athletes, we present the design and control of a robotic platform, Ultra Mobility Vehicle (UMV), which combines a bicycle and a reaction mass to move dynamically with minimal actuated degrees of freedom. We employ a simulation-driven design optimization process to synthesize a spatial linkage topology with a focus on vertical jump height and momentum-based balancing on a single wheel contact. Using a constrained Reinforcement Learning (RL) framework, we demonstrate zero-shot transfer of diverse athletic behaviors, including track-stands, jumps, wheelies, rear wheel hopping, and front flips. This 23.5 kg robot is capable of high speeds (8 m/s) and jumping on and over large obstacles (1 m tall, or 130% of the robot's nominal height).
comment: 19 Pages, 11 figures, 3 movies, 2 tables
Behavioral Cloning for Robotic Connector Assembly: An Empirical Study
Automating the assembly of wire harnesses is challenging in automotive, electrical cabinet, and aircraft production, particularly due to deformable cables and a high variance in connector geometries. In addition, connectors must be inserted with limited force to avoid damage, while their poses can vary significantly. While humans can do this task intuitively by combining visual and haptic feedback, programming an industrial robot for such a task in an adaptable manner remains difficult. This work presents an empirical study investigating the suitability of behavioral cloning for learning an action prediction model for connector insertion that fuses force-torque sensing with a fixed position camera. We compare several network architectures and other design choices using a dataset of up to 300 successful human demonstrations collected via teleoperation of a UR5e robot with a SpaceMouse under varying connector poses. The resulting system is then evaluated against five different connector geometries under varying connector poses, achieving an overall insertion success rate of over 90 %.
comment: 8 pages
Force Policy: Learning Hybrid Force-Position Control Policy under Interaction Frame for Contact-Rich Manipulation
Contact-rich manipulation demands human-like integration of perception and force feedback: vision should guide task progress, while high-frequency interaction control must stabilize contact under uncertainty. Existing learning-based policies often entangle these roles in a monolithic network, trading off global generalization against stable local refinement, while control-centric approaches typically assume a known task structure or learn only controller parameters rather than the structure itself. In this paper, we formalize a physically grounded interaction frame, an instantaneous local basis that decouples force regulation from motion execution, and propose a method to recover it from demonstrations. Based on this, we address both issues by proposing Force Policy, a global-local vision-force policy in which a global policy guides free-space actions using vision, and upon contact, a high-frequency local policy with force feedback estimates the interaction frame and executes hybrid force-position control for stable interaction. Real-world experiments across diverse contact-rich tasks show consistent gains over strong baselines, with more robust contact establishment, more accurate force regulation, and reliable generalization to novel objects with varied geometries and physical properties, ultimately improving both contact stability and execution quality. Project page: https://force-policy.github.io/
FlowCorrect: Efficient Interactive Correction of Generative Flow Policies for Robotic Manipulation
Generative manipulation policies can fail catastrophically under deployment-time distribution shift, yet many failures are near-misses: the robot reaches almost-correct poses and would succeed with a small corrective motion. We present FlowCorrect, a deployment-time correction framework that converts near-miss failures into successes using sparse human nudges, without full policy retraining. During execution, a human provides brief corrective pose nudges via a lightweight VR interface. FlowCorrect uses these sparse corrections to locally adapt the policy, improving actions without retraining the backbone while preserving the model performance on previously learned scenarios. We evaluate on a real-world robot across three tabletop tasks: pick-and-place, pouring, and cup uprighting. With a low correction budget, FlowCorrect improves success on hard cases by 85\% while preserving performance on previously solved scenarios. The results demonstrate clearly that FlowCorrect learns only with very few demonstrations and enables fast and sample-efficient incremental, human-in-the-loop corrections of generative visuomotor policies at deployment time in real-world robotics.
comment: 8 pages, 5 figures
World Guidance: World Modeling in Condition Space for Action Generation
Leveraging future observation modeling to facilitate action generation presents a promising avenue for enhancing the capabilities of Vision-Language-Action (VLA) models. However, existing approaches struggle to strike a balance between maintaining efficient, predictable future representations and preserving sufficient fine-grained information to guide precise action generation. To address this limitation, we propose WoG (World Guidance), a framework that maps future observations into compact conditions by injecting them into the action inference pipeline. The VLA is then trained to simultaneously predict these compressed conditions alongside future actions, thereby achieving effective world modeling within the condition space for action inference. We demonstrate that modeling and predicting this condition space not only facilitates fine-grained action generation but also exhibits superior generalization capabilities. Moreover, it learns effectively from substantial human manipulation videos. Extensive experiments across both simulation and real-world environments validate that our method significantly outperforms existing methods based on future prediction. Project page is available at: https://selen-suyue.github.io/WoGNet/
comment: Project Page: https://selen-suyue.github.io/WoGNet/
Parallel Continuous-Time Relative Localization with Augmented Clamped Non-Uniform B-Splines
Accurate relative localization is critical for multi-robot cooperation. In robot swarms, measurements from different robots arrive asynchronously and with clock time-offsets. Although Continuous-Time (CT) formulations have proved effective for handling asynchronous measurements in single-robot SLAM and calibration, extending CT methods to multi-robot settings faces great challenges to achieve high-accuracy, low-latency, and high-frequency performance. Especially, existing CT methods suffer from the inherent query-time delay of unclamped B-splines and high computational cost. This paper proposes CT-RIO, a novel Continuous-Time Relative-Inertial Odometry framework. We employ Clamped Non-Uniform B-splines (C-NUBS) to represent robot states for the first time, eliminating the query-time delay. We further augment C-NUBS with closed-form extension and shrinkage operations that preserve the spline shape, making it suitable for online estimation and enabling flexible knot management. This flexibility leads to the concept of knot-keyknot strategy, which supports spline extension at high-frequency while retaining sparse keyknots for adaptive relative-motion modeling. We then formulate a sliding-window relative localization problem that operates purely on relative kinematics and inter-robot constraints. To meet the demanding computation required at swarm scale, we decompose the tightly-coupled optimization into robot-wise sub-problems and solve them in parallel using incremental asynchronous block coordinate descent. Extensive experiments show that CT-RIO converges from time-offsets as large as 263 ms to sub-millisecond within 3 s, and achieves RMSEs of 0.046 m and 1.8 °. It consistently outperforms state-of-the-art methods, with improvements of up to 60% under high-speed motion.
comment: 26 pages, 23 figures
Are Foundation Models the Route to Full-Stack Transfer in Robotics?
In humans and robots alike, transfer learning occurs at different levels of abstraction, from high-level linguistic transfer to low-level transfer of motor skills. In this article, we provide an overview of the impact that foundation models and transformer networks have had on these different levels, bringing robots closer than ever to "full-stack transfer". Considering LLMs, VLMs and VLAs from a robotic transfer learning perspective allows us to highlight recurring concepts for transfer, beyond specific implementations. We also consider the challenges of data collection and transfer benchmarks for robotics in the age of foundation models. Are foundation models the route to full-stack transfer in robotics? Our expectation is that they will certainly stay on this route as a key technology.
comment: 12 pages, 4 figures
Humanizing Robot Gaze Shifts: A Framework for Natural Gaze Shifts in Humanoid Robots
Leveraging auditory and visual feedback for attention reorientation is essential for natural gaze shifts in social interaction. However, enabling humanoid robots to perform natural and context-appropriate gaze shifts in unconstrained human--robot interaction (HRI) remains challenging, as it requires the coupling of cognitive attention mechanisms and biomimetic motion generation. In this work, we propose the Robot Gaze-Shift (RGS) framework, which integrates these two components into a unified pipeline. First, RGS employs a vision--language model (VLM)-based gaze reasoning pipeline to infer context-appropriate gaze targets from multimodal interaction cues, ensuring consistency with human gaze-orienting regularities. Second, RGS introduces a conditional Vector Quantized-Variational Autoencoder (VQ-VAE) model for eye--head coordinated gaze-shift motion generation, producing diverse and human-like gaze-shift behaviors. Experiments validate that RGS effectively replicates human-like target selection and generates realistic, diverse gaze-shift motions.
comment: submitted to AIM 2026
Dream-SLAM: Dreaming the Unseen for Active SLAM in Dynamic Environments
In addition to the core tasks of simultaneous localization and mapping (SLAM), active SLAM additionally in- volves generating robot actions that enable effective and efficient exploration of unknown environments. However, existing active SLAM pipelines are limited by three main factors. First, they inherit the restrictions of the underlying SLAM modules that they may be using. Second, their motion planning strategies are typically shortsighted and lack long-term vision. Third, most approaches struggle to handle dynamic scenes. To address these limitations, we propose a novel monocular active SLAM method, Dream-SLAM, which is based on dreaming cross-spatio-temporal images and semantically plausible structures of partially observed dynamic environments. The generated cross-spatio-temporal im- ages are fused with real observations to mitigate noise and data incompleteness, leading to more accurate camera pose estimation and a more coherent 3D scene representation. Furthermore, we integrate dreamed and observed scene structures to enable long- horizon planning, producing farsighted trajectories that promote efficient and thorough exploration. Extensive experiments on both public and self-collected datasets demonstrate that Dream-SLAM outperforms state-of-the-art methods in localization accuracy, mapping quality, and exploration efficiency. Source code will be publicly available upon paper acceptance.
The Swarm Intelligence Freeway-Urban Trajectories (SWIFTraj) Dataset - Part II: A Graph-Based Approach for Trajectory Connection
In Part I of this companion paper series, we introduced SWIFTraj, a new open-source vehicle trajectory dataset collected using a unmanned aerial vehicle (UAV) swarm. The dataset has two distinctive features. First, by connecting trajectories across consecutive UAV videos, it provides long-distance continuous trajectories, with the longest exceeding 4.5 km. Second, it covers an integrated traffic network consisting of both freeways and their connected urban roads. Obtaining such long-distance continuous trajectories from a UAV swarm is challenging, due to the need for accurate time alignment across multiple videos and the irregular spatial distribution of UAVs. To address these challenges, this paper proposes a novel graph-based approach for connecting vehicle trajectories captured by a UAV swarm. An undirected graph is constructed to represent flexible UAV layouts, and an automatic time alignment method based on trajectory matching cost minimization is developed to estimate optimal time offsets across videos. To associate trajectories of the same vehicle observed in different videos, a vehicle matching table is established using the Hungarian algorithm. The proposed approach is evaluated using both simulated and real-world data. Results from real-world experiments show that the time alignment error is within three video frames, corresponding to approximately 0.1 s, and that the vehicle matching achieves an F1-score of about 0.99. These results demonstrate the effectiveness of the proposed method in addressing key challenges in UAV-based trajectory connection and highlight its potential for large-scale vehicle trajectory collection.
UNet-Based Keypoint Regression for 3D Cone Localization in Autonomous Racing ICCV
Accurate cone localization in 3D space is essential in autonomous racing for precise navigation around the track. Approaches that rely on traditional computer vision algorithms are sensitive to environmental variations, and neural networks are often trained on limited data and are infeasible to run in real time. We present a UNet-based neural network for keypoint detection on cones, leveraging the largest custom-labeled dataset we have assembled. Our approach enables accurate cone position estimation and the potential for color prediction. Our model achieves substantial improvements in keypoint accuracy over conventional methods. Furthermore, we leverage our predicted keypoints in the perception pipeline and evaluate the end-to-end autonomous system. Our results show high-quality performance across all metrics, highlighting the effectiveness of this approach and its potential for adoption in competitive autonomous racing systems.
comment: 8 pages, 9 figures. Accepted to ICCV End-to-End 3D Learning Workshop 2025 and presented as a poster; not included in the final proceedings due to a conference administrative error
Enhancing Cellular-enabled Collaborative Robots Planning through GNSS data for SAR Scenarios
Cellular-enabled collaborative robots are becoming paramount in Search-and-Rescue (SAR) and emergency response. Crucially dependent on resilient mobile network connectivity, they serve as invaluable assets for tasks like rapid victim localization and the exploration of hazardous, otherwise unreachable areas. However, their reliance on battery power and the need for persistent, low-latency communication limit operational time and mobility. To address this, and considering the evolving capabilities of 5G/6G networks, we propose a novel SAR framework that includes Mission Planning and Mission Execution phases and that optimizes robot deployment. By considering parameters such as the exploration area size, terrain elevation, robot fleet size, communication-influenced energy profiles, desired exploration rate, and target response time, our framework determines the minimum number of robots required and their optimal paths to ensure effective coverage and timely data backhaul over mobile networks. Our results demonstrate the trade-offs between number of robots, explored area, and response time for wheeled and quadruped robots. Further, we quantify the impact of terrain elevation data on mission time and energy consumption, showing the benefits of incorporating real-world environmental factors that might also affect mobile signal propagation and connectivity into SAR planning. This framework provides critical insights for leveraging next-generation mobile networks to enhance autonomous SAR operations.
comment: arXiv admin note: substantial text overlap with arXiv:2403.09177
Self-Curriculum Model-based Reinforcement Learning for Shape Control of Deformable Linear Objects
Precise shape control of Deformable Linear Objects (DLOs) is crucial in robotic applications such as industrial and medical fields. However, existing methods face challenges in handling complex large deformation tasks, especially those involving opposite curvatures, and lack efficiency and precision. To address this, we propose a two-stage framework combining Reinforcement Learning (RL) and online visual servoing. In the large-deformation stage, a model-based reinforcement learning approach using an ensemble of dynamics models is introduced to significantly improve sample efficiency. Additionally, we design a self-curriculum goal generation mechanism that dynamically selects intermediate-difficulty goals with high diversity through imagined evaluations, thereby optimizing the policy learning process. In the small-deformation stage, a Jacobian-based visual servo controller is deployed to ensure high-precision convergence. Simulation results show that the proposed method enables efficient policy learning and significantly outperforms mainstream baselines in shape control success rate and precision. Furthermore, the framework effectively transfers the policy trained in simulation to real-world tasks with zero-shot adaptation. It successfully completes all 30 cases with diverse initial and target shapes across DLOs of different sizes and materials. The project website is available at: https://anonymous.4open.science/w/sc-mbrl-dlo-EB48/
DexRepNet++: Learning Dexterous Robotic Manipulation with Geometric and Spatial Hand-Object Representations
Robotic dexterous manipulation is a challenging problem due to high degrees of freedom (DoFs) and complex contacts of multi-fingered robotic hands. Many existing deep reinforcement learning (DRL) based methods aim at improving sample efficiency in high-dimensional output action spaces. However, existing works often overlook the role of representations in achieving generalization of a manipulation policy in the complex input space during the hand-object interaction. In this paper, we propose DexRep, a novel hand-object interaction representation to capture object surface features and spatial relations between hands and objects for dexterous manipulation skill learning. Based on DexRep, policies are learned for three dexterous manipulation tasks, i.e. grasping, in-hand reorientation, bimanual handover, and extensive experiments are conducted to verify the effectiveness. In simulation, for grasping, the policy learned with 40 objects achieves a success rate of 87.9% on more than 5000 unseen objects of diverse categories, significantly surpassing existing work trained with thousands of objects; for the in-hand reorientation and handover tasks, the policies also boost the success rates and other metrics of existing hand-object representations by 20% to 40%. The grasp policies with DexRep are deployed to the real world under multi-camera and single-camera setups and demonstrate a small sim-to-real gap.
comment: Accepted by IEEE Transactions on Robotics (T-RO), 2026
Therapist-Robot-Patient Physical Interaction is Worth a Thousand Words: Enabling Intuitive Therapist Guidance via Remote Haptic Control
Robotic systems can enhance the amount and repeatability of physically guided motor training. Yet their real-world adoption is limited, partly due to non-intuitive trainer/therapist-trainee/patient interactions. To address this gap, we present a haptic teleoperation system for trainers to remotely guide and monitor the movements of a trainee wearing an arm exoskeleton. The trainer can physically interact with the exoskeleton through a commercial handheld haptic device via virtual contact points at the exoskeleton's elbow and wrist, allowing intuitive guidance. Thirty-two participants tested the system in a trainer-trainee paradigm, comparing our haptic demonstration system with conventional visual demonstration in guiding trainees in executing arm poses. Quantitative analyses showed that haptic demonstration significantly reduced movement completion time and improved smoothness, while speech analysis using large language models for automated transcription and categorization of verbal commands revealed fewer verbal instructions. The haptic demonstration did not result in higher reported mental and physical effort by trainers compared to the visual demonstration, while trainers reported greater competence and trainees lower physical demand. These findings support the feasibility of our proposed interface for effective remote human-robot physical interaction. Future work should assess its usability and efficacy for clinical populations in restoring clinicians' sense of agency during robot-assisted therapy.
comment: 14 pages, 5 figures, 3 tables
Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild CVPR2026
Despite progress, Vision-Language-Action models (VLAs) are limited by a scarcity of large-scale, diverse robot data. While human manipulation videos offer a rich alternative, existing methods are forced to choose between small, precisely-labeled datasets and vast in-the-wild footage with unreliable hand tracking labels. We present JALA, a pretraining framework that learns Jointly-Aligned Latent Actions. JALA bypasses full visual dynamic reconstruction, instead learns a predictive action embedding aligned with both inverse dynamics and real actions. This yields a transition-aware, behavior-centric latent space for learning from heterogeneous human data. We scale this approach with UniHand-Mix, a 7.5M video corpus (>2,000 hours) blending laboratory and in-the-wild footage. Experiments demonstrate that JALA generates more realistic hand motions in both controlled and unconstrained scenarios, significantly improving downstream robot manipulation performance in both simulation and real-world tasks. These results indicate that jointly-aligned latent actions offer a scalable pathway for VLA pretraining from human data.
comment: CVPR2026
LessMimic: Long-Horizon Humanoid Interaction with Unified Distance Field Representations
Humanoid robots that autonomously interact with physical environments over extended horizons represent a central goal of embodied intelligence. Existing approaches rely on reference motions or task-specific rewards, tightly coupling policies to particular object geometries and precluding multi-skill generalization within a single framework. A unified interaction representation enabling reference-free inference, geometric generalization, and long-horizon skill composition within one policy remains an open challenge. Here we show that Distance Field (DF) provides such a representation: LessMimic conditions a single whole-body policy on DF-derived geometric cues--surface distances, gradients, and velocity decompositions--removing the need for motion references, with interaction latents encoded via a Variational Auto-Encoder (VAE) and post-trained using Adversarial Interaction Priors (AIP) under Reinforcement Learning (RL). Through DAgger-style distillation that aligns DF latents with egocentric depth features, LessMimic further transfers seamlessly to vision-only deployment without motion capture (MoCap) infrastructure. A single LessMimic policy achieves 80--100% success across object scales from 0.4x to 1.6x on PickUp and SitStand where baselines degrade sharply, attains 62.1% success on 5 task instances trajectories, and remains viable up to 40 sequentially composed tasks. By grounding interaction in local geometry rather than demonstrations, LessMimic offers a scalable path toward humanoid robots that generalize, compose skills, and recover from failures in unstructured environments.
Dual-Regime Hybrid Aerodynamic Modeling of Winged Blimps With Neural Mixing
Winged blimps operate across distinct aerodynamic regimes that cannot be adequately captured by a single model. At high speeds and small angles of attack, their dynamics exhibit strong coupling between lift and attitude, resembling fixed-wing aircraft behavior. At low speeds or large angles of attack, viscous effects and flow separation dominate, leading to drag-driven and damping-dominated dynamics. Accurately representing transitions between these regimes remains a fundamental challenge. This paper presents a hybrid aerodynamic modeling framework that integrates a fixed-wing Aerodynamic Coupling Model (ACM) and a Generalized Drag Model (GDM) using a learned neural network mixer with explicit physics-based regularization. The mixer enables smooth transitions between regimes while retaining explicit, physics-based aerodynamic representation. Model parameters are identified through a structured three-phase pipeline tailored for hybrid aerodynamic modeling. The proposed approach is validated on the RGBlimp platform through a large-scale experimental campaign comprising 1,320 real-world flight trajectories across 330 thruster and moving mass configurations, spanning a wide range of speeds and angles of attack. Experimental results demonstrate that the proposed hybrid model consistently outperforms single-model and predefined-mixer baselines, establishing a practical and robust aerodynamic modeling solution for winged blimps.
Trajectory Generation with Endpoint Regulation and Momentum-Aware Dynamics for Visually Impaired Scenarios
Trajectory generation for visually impaired scenarios requires smooth and temporally consistent state in structured, low-speed dynamic environments. However, traditional jerk-based heuristic trajectory sampling with independent segment generation and conventional smoothness penalties often lead to unstable terminal behavior and state discontinuities under frequent regenerating. This paper proposes a trajectory generation approach that integrates endpoint regulation to stabilize terminal states within each segment and momentum-aware dynamics to regularize the evolution of velocity and acceleration for segment consistency. Endpoint regulation is incorporated into trajectory sampling to stabilize terminal behavior, while a momentum-aware dynamics enforces consistent velocity and acceleration evolution across consecutive trajectory segments. Experimental results demonstrate reduced acceleration peaks and lower jerk levels with decreased dispersion, smoother velocity and acceleration profiles, more stable endpoint distributions, and fewer infeasible trajectory candidates compared with a baseline planner.
comment: 9 pages, 7 figures
Primary-Fine Decoupling for Action Generation in Robotic Imitation ICLR
Multi-modal distribution in robotic manipulation action sequences poses critical challenges for imitation learning. To this end, existing approaches often model the action space as either a discrete set of tokens or a continuous, latent-variable distribution. However, both approaches present trade-offs: some methods discretize actions into tokens and therefore lose fine-grained action variations, while others generate continuous actions in a single stage tend to produce unstable mode transitions. To address these limitations, we propose Primary-Fine Decoupling for Action Generation (PF-DAG), a two-stage framework that decouples coarse action consistency from fine-grained variations. First, we compress action chunks into a small set of discrete modes, enabling a lightweight policy to select consistent coarse modes and avoid mode bouncing. Second, a mode conditioned MeanFlow policy is learned to generate high-fidelity continuous actions. Theoretically, we prove PF-DAG's two-stage design achieves a strictly lower MSE bound than single-stage generative policies. Empirically, PF-DAG outperforms state-of-the-art baselines across 56 tasks from Adroit, DexArt, and MetaWorld benchmarks. It further generalizes to real-world tactile dexterous manipulation tasks. Our work demonstrates that explicit mode-level decoupling enables both robust multi-modal modeling and reactive closed-loop control for robotic manipulation.
comment: The Fourteenth International Conference on Learning Representations (ICLR), 2026
SunnyParking: Multi-Shot Trajectory Generation and Motion State Awareness for Human-like Parking
Autonomous parking fundamentally differs from on-road driving due to its frequent direction changes and complex maneuvering requirements. However, existing End-to-End (E2E) planning methods often simplify the parking task into a geometric path regression problem, neglecting explicit modeling of the vehicle's kinematic state. This "dimensionality deficiency" easily leads to physically infeasible trajectories and deviates from real human driving behavior, particularly at critical gear-shift points in multi-shot parking scenarios. In this paper, we propose SunnyParking, a novel dual-branch E2E architecture that achieves motion state awareness by jointly predicting spatial trajectories and discrete motion state sequences (e.g., forward/reverse). Additionally, we introduce a Fourier feature-based representation of target parking slots to overcome the resolution limitations of traditional bird's-eye view (BEV) approaches, enabling high-precision target interactions. Experimental results demonstrate that our framework generates more robust and human-like trajectories in complex multi-shot parking scenarios, while significantly improving gear-shift point localization accuracy compared to state-of-the-art methods. We open-source a new parking dataset of the CARLA simulator, specifically designed to evaluate full prediction capabilities under complex maneuvers.
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning ICRA 2026
Multi-robot task planning requires decomposing natural-language instructions into executable actions for heterogeneous robot teams. Conventional Planning Domain Definition Language (PDDL) planners provide rigorous guarantees but struggle to handle ambiguous or long-horizon missions, while large language models (LLMs) can interpret instructions and propose plans but may hallucinate or produce infeasible actions. We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner. When plans fail, the system applies TextGrad-inspired textual-gradient updates to optimize each agent's prompt and thereby improve planning accuracy. In addition, meta-prompts are learned and shared across agents within the same layer, enabling efficient prompt optimization in multi-agent settings. On the MAT-THOR benchmark, our planner achieves success rates of 0.95 on compound tasks, 0.84 on complex tasks, and 0.60 on vague tasks, improving over the previous state-of-the-art LaMMA-P by 2, 7, and 15 percentage points respectively. An ablation study shows that the hierarchical structure, prompt optimization, and meta-prompt sharing contribute roughly +59, +37, and +4 percentage points to the overall success rate.
comment: Accepted to ICRA 2026. 8 pages, 2 figures
Biomechanical Comparisons Reveal Divergence of Human and Humanoid Gaits
It remains challenging to achieve human-like locomotion in legged robots due to fundamental discrepancies between biological and mechanical structures. Although imitation learning has emerged as a promising approach for generating natural robotic movements, simply replicating joint angle trajectories fails to capture the underlying principles of human motion. This study proposes a Gait Divergence Analysis Framework (GDAF), a unified biomechanical evaluation framework that systematically quantifies kinematic and kinetic discrepancies between humans and bipedal robots. We apply GDAF to systematically compare human and humanoid locomotion across 28 walking speeds. To enable reproducible analysis, we collect and release a speed-continuous humanoid locomotion dataset from a state-of-the-art humanoid controller. We further provide an open-source implementation of GDAF, including analysis, visualization, and MuJoCo-based tools, enabling quantitative, interpretable, and reproducible biomechanical analysis of humanoid locomotion. Results demonstrate that despite visually human-like motion generated by modern humanoid controllers, significant biomechanical divergence persists across speeds. Robots exhibit systematic deviations in gait symmetry, energy distribution, and joint coordination, indicating that substantial room remains for improving the biomechanical fidelity and energetic efficiency of humanoid locomotion. This work provides a quantitative benchmark for evaluating humanoid locomotion and offers data and versatile tools to support the development of more human-like and energetically efficient locomotion controllers. The data and code will be made publicly available upon acceptance of the paper.
DAGS-SLAM: Dynamic-Aware 3DGS SLAM via Spatiotemporal Motion Probability and Uncertainty-Aware Scheduling
Mobile robots and IoT devices demand real-time localization and dense reconstruction under tight compute and energy budgets. While 3D Gaussian Splatting (3DGS) enables efficient dense SLAM, dynamic objects and occlusions still degrade tracking and mapping. Existing dynamic 3DGS-SLAM often relies on heavy optical flow and per-frame segmentation, which is costly for mobile deployment and brittle under challenging illumination. We present DAGS-SLAM, a dynamic-aware 3DGS-SLAM system that maintains a spatiotemporal motion probability (MP) state per Gaussian and triggers semantics on demand via an uncertainty-aware scheduler. DAGS-SLAM fuses lightweight YOLO instance priors with geometric cues to estimate and temporally update MP, propagates MP to the front-end for dynamic-aware correspondence selection, and suppresses dynamic artifacts in the back-end via MP-guided optimization. Experiments on public dynamic RGB-D benchmarks show improved reconstruction and robust tracking while sustaining real-time throughput on a commodity GPU, demonstrating a practical speed-accuracy tradeoff with reduced semantic invocations toward mobile deployment.
Self-Correcting VLA: Online Action Refinement via Sparse World Imagination
Standard vision-language-action (VLA) models rely on fitting statistical data priors, limiting their robust understanding of underlying physical dynamics. Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states. World action models have emerged as a promising paradigm that integrates imagination and control to enable predictive planning. However, they rely on implicit context modeling, lacking explicit mechanisms for self-improvement. To solve these problems, we propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination. We first design sparse world imagination by integrating auxiliary predictive heads to forecast current task progress and future trajectory trends, thereby constraining the policy to encode short-term physical evolution. Then we introduce the online action refinement module to reshape progress-dependent dense rewards, adjusting trajectory orientation based on the predicted sparse future states. Evaluations on challenging robot manipulation tasks from simulation benchmarks and real-world settings demonstrate that SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines, alongside a 14% gain in real-world experiments. Code is available at https://github.com/Kisaragi0/SC-VLA.
Tacmap: Bridging the Tactile Sim-to-Real Gap via Geometry-Consistent Penetration Depth Map
Vision-Based Tactile Sensors (VBTS) are essential for achieving dexterous robotic manipulation, yet the tactile sim-to-real gap remains a fundamental bottleneck. Current tactile simulations suffer from a persistent dilemma: simplified geometric projections lack physical authenticity, while high-fidelity Finite Element Methods (FEM) are too computationally prohibitive for large-scale reinforcement learning. In this work, we present Tacmap, a high-fidelity, computationally efficient tactile simulation framework anchored in volumetric penetration depth. Our key insight is to bridge the tactile sim-to-real gap by unifying both domains through a shared deform map representation. Specifically, we compute 3D intersection volumes as depth maps in simulation, while in the real world, we employ an automated data-collection rig to learn a robust mapping from raw tactile images to ground-truth depth maps. By aligning simulation and real-world in this unified geometric space, Tacmap minimizes domain shift while maintaining physical consistency. Quantitative evaluations across diverse contact scenarios demonstrate that Tacmap's deform maps closely mirror real-world measurements. Moreover, we validate the utility of Tacmap through an in-hand rotation task, where a policy trained exclusively in simulation achieves zero-shot transfer to a physical robot.
comment: 8 pages
ADM-DP: Adaptive Dynamic Modality Diffusion Policy through Vision-Tactile-Graph Fusion for Multi-Agent Manipulation ICRA 2026
Multi-agent robotic manipulation remains challenging due to the combined demands of coordination, grasp stability, and collision avoidance in shared workspaces. To address these challenges, we propose the Adaptive Dynamic Modality Diffusion Policy (ADM-DP), a framework that integrates vision, tactile, and graph-based (multi-agent pose) modalities for coordinated control. ADM-DP introduces four key innovations. First, an enhanced visual encoder merges RGB and point-cloud features via Feature-wise Linear Modulation (FiLM) modulation to enrich perception. Second, a tactile-guided grasping strategy uses Force-Sensitive Resistor (FSR) feedback to detect insufficient contact and trigger corrective grasp refinement, improving grasp stability. Third, a graph-based collision encoder leverages shared tool center point (TCP) positions of multiple agents as structured kinematic context to maintain spatial awareness and reduce inter-agent interference. Fourth, an Adaptive Modality Attention Mechanism (AMAM) dynamically re-weights modalities according to task context, enabling flexible fusion. For scalability and modularity, a decoupled training paradigm is employed in which agents learn independent policies while sharing spatial information. This maintains low interdependence between agents while retaining collective awareness. Across seven multi-agent tasks, ADM-DP achieves 12-25% performance gains over state-of-the-art baselines. Ablation studies show the greatest improvements in tasks requiring multiple sensory modalities, validating our adaptive fusion strategy and demonstrating its robustness for diverse manipulation scenarios.
comment: Accepted to IEEE International Conference on Robotics and Automation (ICRA 2026)
Jumping Control for a Quadrupedal Wheeled-Legged Robot via NMPC and DE Optimization
Quadrupedal wheeled-legged robots combine the advantages of legged and wheeled locomotion to achieve superior mobility, but executing dynamic jumps remains a significant challenge due to the additional degrees of freedom introduced by wheeled legs. This paper develops a mini-sized wheeled-legged robot for agile motion and presents a novel motion control framework that integrates the Nonlinear Model Predictive Control (NMPC) for locomotion and the Differential Evolution (DE) based trajectory optimization for jumping in quadrupedal wheeled-legged robots. The proposed controller utilizes wheel motion and locomotion to enhance jumping performance, achieving versatile maneuvers such as vertical jumping, forward jumping, and backflips. Extensive simulations and real-world experiments validate the effectiveness of the framework, demonstrating a forward jump over a 0.12 m obstacle and a vertical jump reaching 0.5 m.
comment: 8 pages, 12 figures
Iterative Closed-Loop Motion Synthesis for Scaling the Capabilities of Humanoid Control
Physics-based humanoid control relies on training with motion datasets that have diverse data distributions. However, the fixed difficulty distribution of datasets limits the performance ceiling of the trained control policies. Additionally, the method of acquiring high-quality data through professional motion capture systems is constrained by costs, making it difficult to achieve large-scale scalability. To address these issues, we propose a closed-loop automated motion data generation and iterative framework. It can generate high-quality motion data with rich action semantics, including martial arts, dance, combat, sports, gymnastics, and more. Furthermore, our framework enables difficulty iteration of policies and data through physical metrics and objective evaluations, allowing the trained tracker to break through its original difficulty limits. On the PHC single-primitive tracker, using only approximately 1/10 of the AMASS dataset size, the average failure rate on the test set (2201 clips) is reduced by 45\% compared to the baseline. Finally, we conduct comprehensive ablation and comparative experiments to highlight the rationality and advantages of our framework.
SPOC: Safety-Aware Planning Under Partial Observability And Physical Constraints ICASSP 2026
Embodied Task Planning with large language models faces safety challenges in real-world environments, where partial observability and physical constraints must be respected. Existing benchmarks often overlook these critical factors, limiting their ability to evaluate both feasibility and safety. We introduce SPOC, a benchmark for safety-aware embodied task planning, which integrates strict partial observability, physical constraints, step-by-step planning, and goal-condition-based evaluation. Covering diverse household hazards such as fire, fluid, injury, object damage, and pollution, SPOC enables rigorous assessment through both state and constraint-based online metrics. Experiments with state-of-the-art LLMs reveal that current models struggle to ensure safety-aware planning, particularly under implicit constraints. Code and dataset are available at https://github.com/khm159/SPOC
comment: Accepted to IEEE ICASSP 2026
Learning Agile and Robust Omnidirectional Aerial Motion on Overactuated Tiltable-Quadrotors
Tilt-rotor aerial robots enable omnidirectional maneuvering through thrust vectoring, but introduce significant control challenges due to the strong coupling between joint and rotor dynamics. While model-based controllers can achieve high motion accuracy under nominal conditions, their robustness and responsiveness often degrade in the presence of disturbances and modeling uncertainties. This work investigates reinforcement learning for omnidirectional aerial motion control on over-actuated tiltable quadrotors that prioritizes robustness and agility. We present a learning-based control framework that enables efficient acquisition of coordinated rotor-joint behaviors for reaching target poses in the $SE(3)$ space. To achieve reliable sim-to-real transfer while preserving motion accuracy, we integrate system identification with minimal and physically consistent domain randomization. Compared with a state-of-the-art NMPC controller, the proposed method achieves comparable six-degree-of-freedom pose tracking accuracy, while demonstrating superior robustness and generalization across diverse tasks, enabling zero-shot deployment on real hardware.
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
General-purpose robots must master long-horizon manipulation, defined as tasks involving multiple kinematic structure changes (e.g., attaching or detaching objects) in unstructured environments. While Vision-Language-Action (VLA) models offer the potential to master diverse atomic skills, they struggle with the combinatorial complexity of sequencing them and are prone to cascading failures due to environmental sensitivity. To address these challenges, we propose LiLo-VLA (Linked Local VLA), a modular framework capable of zero-shot generalization to novel long-horizon tasks without ever being trained on them. Our approach decouples transport from interaction: a Reaching Module handles global motion, while an Interaction Module employs an object-centric VLA to process isolated objects of interest, ensuring robustness against irrelevant visual features and invariance to spatial configurations. Crucially, this modularity facilitates robust failure recovery through dynamic replanning and skill reuse, effectively mitigating the cascading errors common in end-to-end approaches. We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long. In these simulations, LiLo-VLA achieves a 69% average success rate, outperforming Pi0.5 by 41% and OpenVLA-OFT by 67%. Furthermore, real-world evaluations across 8 long-horizon tasks demonstrate an average success rate of 85%. Project page: https://yy-gx.github.io/LiLo-VLA/.
Constructive Vector Fields for Path Following in Fully-Actuated Systems on Matrix Lie Groups
This paper presents a novel vector field strategy for controlling fully-actuated systems on connected matrix Lie groups, ensuring convergence to and traversal along a curve defined on the group. Our approach generalizes our previous work (Rezende et al., 2022) and reduces to it when considering the Lie group of translations in Euclidean space. Since the proofs in Rezende et al. (2022) rely on key properties such as the orthogonality between the convergent and traversal components, we extend these results by leveraging Lie group properties. These properties also allow the control input to be non-redundant, meaning it matches the dimension of the Lie group, rather than the potentially larger dimension of the space in which the group is embedded. This can lead to more practical control inputs in certain scenarios. A particularly notable application of our strategy is in controlling systems on SE(3) -- in this case, the non-redundant input corresponds to the object's mechanical twist -- making it well-suited for controlling objects that can move and rotate freely, such as omnidirectional drones. In this case, we provide an efficient algorithm to compute the vector field. We experimentally validate the proposed method using a robotic manipulator to demonstrate its effectiveness.
When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering
Policy steering is an emerging way to adapt robot behaviors at deployment-time: a learned verifier analyzes low-level action samples proposed by a pre-trained policy (e.g., diffusion policy) and selects only those aligned with the task. While Vision-Language Models (VLMs) are promising general-purpose verifiers due to their reasoning capabilities, existing frameworks often assume these models are well-calibrated. In practice, the overconfident judgment from VLM can degrade the steering performance under both high-level semantic uncertainty in task specifications and low-level action uncertainty or incapability of the pre-trained policy. We propose uncertainty-aware policy steering (UPS), a framework that jointly reasons about semantic task uncertainty and low-level action feasibility, and selects an uncertainty resolution strategy: execute a high-confidence action, clarify task ambiguity via natural language queries, or ask for action interventions to correct the low-level policy when it is deemed incapable at the task. We leverage conformal prediction to calibrate the composition of the VLM and the pre-trained base policy, providing statistical assurances that the verifier selects the correct strategy. After collecting interventions during deployment, we employ residual learning to improve the capability of the pre-trained policy, enabling the system to learn continually but with minimal expensive human feedback. We demonstrate our framework through experiments in simulation and on hardware, showing that UPS can disentangle confident, ambiguous, and incapable scenarios and minimizes expensive user interventions compared to uncalibrated baselines and prior human- or robot-gated continual learning approaches. Videos can be found at https://jessie-yuan.github.io/ups/
EgoAVFlow: Robot Policy Learning with Active Vision from Human Egocentric Videos via 3D Flow
Egocentric human videos provide a scalable source of manipulation demonstrations; however, deploying them on robots requires active viewpoint control to maintain task-critical visibility, which human viewpoint imitation often fails to provide due to human-specific priors. We propose EgoAVFlow, which learns manipulation and active vision from egocentric videos through a shared 3D flow representation that supports geometric visibility reasoning and transfers without robot demonstrations. EgoAVFlow uses diffusion models to predict robot actions, future 3D flow, and camera trajectories, and refines viewpoints at test time with reward-maximizing denoising under a visibility-aware reward computed from predicted motion and scene geometry. Real-world experiments under actively changing viewpoints show that EgoAVFlow consistently outperforms prior human-demo-based baselines, demonstrating effective visibility maintenance and robust manipulation without robot demonstrations.
Hierarchical Trajectory Planning of Floating-Base Multi-Link Robot for Maneuvering in Confined Environments
Floating-base multi-link robots can change their shape during flight, making them well-suited for applications in confined environments such as autonomous inspection and search and rescue. However, trajectory planning for such systems remains an open challenge because the problem lies in a high-dimensional, constraint-rich space where collision avoidance must be addressed together with kinematic limits and dynamic feasibility. This work introduces a hierarchical trajectory planning framework that integrates global guidance with configuration-aware local optimization. First, we exploit the dual nature of these robots - the root link as a rigid body for guidance and the articulated joints for flexibility - to generate global anchor states that decompose the planning problem into tractable segments. Second, we design a local trajectory planner that optimizes each segment in parallel with differentiable objectives and constraints, systematically enforcing kinematic feasibility and maintaining dynamic feasibility by avoiding control singularities. Third, we implement a complete system that directly processes point-cloud data, eliminating the need for handcrafted obstacle models. Extensive simulations and real-world experiments confirm that this framework enables an articulated aerial robot to exploit its morphology for maneuvering that rigid robots cannot achieve. To the best of our knowledge, this is the first planning framework for floating-base multi-link robots that has been demonstrated on a real robot to generate continuous, collision-free, and dynamically feasible trajectories directly from raw point-cloud inputs, without relying on handcrafted obstacle models.
comment: Accepted to IEEE T-ASE; DOI pending
CWM: Contrastive World Models for Action Feasibility Learning in Embodied Agent Pipelines
A reliable action feasibility scorer is a critical bottleneck in embodied agent pipelines: before any planning or reasoning occurs, the agent must identify which candidate actions are physically executable in the current state. Existing approaches use supervised fine-tuning (SFT) to train action scorers, but SFT treats each candidate independently and does not explicitly teach the model to discriminate between actions that are physically correct and those that are subtly wrong. We propose the Contrastive World Model (CWM), which fine-tunes a large language model (LLM) as an action scorer using an InfoNCE contrastive objective with hard-mined negative examples. The key idea is to push valid actions away from invalid ones in scoring space, with special emphasis on hard negatives: semantically similar but physically incompatible candidates. We evaluate CWM on the ScienceWorld benchmark through two studies. First, an intrinsic affordance evaluation on 605 hard-negative test pairs shows that CWM outperforms SFT by +6.76 percentage points on Precision@1 for minimal-edit negatives -- cases where a single word changes the physical outcome -- and achieves a higher AUC-ROC (0.929 vs. 0.906). Second, a live filter characterisation study measures how well CWM ranks gold-path actions against all valid environment actions during task execution. Under out-of-distribution stress conditions, CWM maintains a significantly better safety margin (-2.39) than SFT (-3.96), indicating that the gold action is ranked closer to the top. These results support the hypothesis that contrastive training induces representations that capture physical feasibility more faithfully than SFT alone.
Detection and Recognition: A Pairwise Interaction Framework for Mobile Service Robots
Autonomous mobile service robots, like lawnmowers or cleaning robots, operating in human-populated environments need to reason about local human-human interactions to support safe and socially aware navigation while fulfilling their tasks. For such robots, interaction understanding is not primarily a fine-grained recognition problem, but a perception problem under limited sensing quality and computational resources. Many existing approaches focus on holistic group activity recognition, which often requires complex and large models which may not be necessary for mobile service robots. Others use pairwise interaction methods which commonly rely on skeletal representations but their use in outdoor environments remains challenging. In this work, we argue that pairwise human interaction constitute a minimal yet sufficient perceptual unit for robot-centric social understanding. We study the problem of identifying interacting person pairs and classifying coarse-grained interaction behaviors sufficient for downstream group-level reasoning and service robot decision-making. To this end, we adopt a two-stage framework in which candidate interacting pairs are first identified based on lightweight geometric and motion cues, and interaction types are subsequently classified using a relation network. We evaluate the proposed approach on the JRDB dataset, where it achieves sufficient accuracy with reduced computational cost and model size compared to appearance-based methods. Additional experiments on the Collective Activity Dataset and zero shot test on a lawnmower-collected dataset further illustrate the generality of the proposed framework. These results suggest that pairwise geometric and motion cues provide a practical basis for interaction perception on mobile service robot providing a promising method for integration into mobile robot navigation stacks in future work. Code will be released soon
Heuristic Adaptation of Potentially Misspecified Domain Support for Likelihood-Free Inference in Stochastic Dynamical Systems
In robotics, likelihood-free inference (LFI) can provide the domain distribution that adapts a learnt agent in a parametric set of deployment conditions. LFI assumes an arbitrary support for sampling, which remains constant as the initial generic prior is iteratively refined to more descriptive posteriors. However, a potentially misspecified support can lead to suboptimal, yet falsely certain, posteriors. To address this issue, we propose three heuristic LFI variants: EDGE, MODE, and CENTRE. Each interprets the posterior mode shift over inference steps in its own way and, when integrated into an LFI step, adapts the support alongside posterior inference. We first expose the support misspecification issue and evaluate our heuristics using stochastic dynamical benchmarks. We then evaluate the impact of heuristic support adaptation on parameter inference and policy learning for a dynamic deformable linear object (DLO) manipulation task. Inference results in a finer length and stiffness classification for a parametric set of DLOs. When the resulting posteriors are used as domain distributions for sim-based policy learning, they lead to more robust object-centric agent performance.
comment: 20 pages, 18 figures
A Distributional Treatment of Real2Sim2Real for Object-Centric Agent Adaptation in Vision-Driven Deformable Linear Object Manipulation
We present an integrated (or end-to-end) framework for the Real2Sim2Real problem of manipulating deformable linear objects (DLOs) based on visual perception. Working with a parameterised set of DLOs, we use likelihood-free inference (LFI) to compute the posterior distributions for the physical parameters using which we can approximately simulate the behaviour of each specific DLO. We use these posteriors for domain randomisation while training, in simulation, object-specific visuomotor policies (i.e. assuming only visual and proprioceptive sensory) for a DLO reaching task, using model-free reinforcement learning. We demonstrate the utility of this approach by deploying sim-trained DLO manipulation policies in the real world in a zero-shot manner, i.e. without any further fine-tuning. In this context, we evaluate the capacity of a prominent LFI method to perform fine classification over the parametric set of DLOs, using only visual and proprioceptive data obtained in a dynamic manipulation trajectory. We then study the implications of the resulting domain distributions in sim-based policy learning and real-world performance.
Perception-Control Coupled Visual Servoing for Textureless Objects Using Keypoint-Based EKF
Visual servoing is fundamental to robotic applications, enabling precise positioning and control. However, applying it to textureless objects remains a challenge due to the absence of reliable visual features. Moreover, adverse visual conditions, such as occlusions, often corrupt visual feedback, leading to reduced accuracy and instability in visual servoing. In this work, we build upon learning-based keypoint detection for textureless objects and propose a method that enhances robustness by tightly integrating perception and control in a closed loop. Specifically, we employ an Extended Kalman Filter (EKF) that integrates per-frame keypoint measurements to estimate 6D object pose, which drives pose-based visual servoing (PBVS) for control. The resulting camera motion, in turn, enhances the tracking of subsequent keypoints, effectively closing the perception-control loop. Additionally, unlike standard PBVS, we propose a probabilistic control law that computes both camera velocity and its associated uncertainty, enabling uncertainty-aware control for safe and reliable operation. We validate our approach on real-world robotic platforms using quantitative metrics and grasping experiments, demonstrating that our method outperforms traditional visual servoing techniques in both accuracy and practical application.
Rod models in continuum and soft robot control: a review
Continuum and soft robots can transform diverse sectors, including healthcare, agriculture, marine, and space, thanks to their potential to adaptively interact with unstructured environments. These robots exhibit complex mechanics that pose diverse challenges in modeling and control. Among various models, continuum mechanical models based on rod theories can effectively capture the deformations of slender bodies in contact-rich scenarios. This structured review paper focuses on the role of rod models in continuum and soft robot control with a vertical approach. We provide a comprehensive summary of the mathematical background underlying the four main rod theories applied in soft robotics and their variants. Then, we review the literature on rod models applied to continuum and soft robots, providing a novel categorization in deformation classes. Finally, we survey recent model-based and learning-based control strategies leveraging rod models, highlighting their potential in real-world manipulation. We critically discuss the trends, advantages, limitations, research gaps, and possible future developments of rod models. This paper aims to guide researchers who intend to simulate and control new soft robots while providing feedback to the design and manufacturing community.
Lang2Lift: A Language-Guided Autonomous Forklift System for Outdoor Industrial Pallet Handling
Automating pallet handling in outdoor logistics and construction environments remains challenging due to unstructured scenes, variable pallet configurations, and changing environmental conditions. In this paper, we present Lang2Lift, an end-to-end language-guided autonomous forklift system designed to support practical pallet pick-up operations in real-world outdoor settings. The system enables operators to specify target pallets using natural language instructions, allowing flexible selection among multiple pallets with different loads and spatial arrangements. Lang2Lift integrates foundation-model-based perception modules with motion planning and control in a closed-loop autonomy pipeline. Language-grounded visual perception is used to identify and segment target pallets, followed by 6D pose estimation and geometric refinement to generate manipulation-feasible insertion poses. The resulting pose estimates are directly coupled with the forklift planning and control modules to execute fully autonomous pallet pick-up maneuvers. We deploy and evaluate the proposed system on the ADAPT autonomous outdoor forklift platform across diverse real-world scenarios, including cluttered scenes, variable lighting, and different payload configurations. Tolerance-based pose evaluation further indicates accuracy sufficient for successful fork insertion. Timing and failure analyses highlight key deployment trade-offs and practical limitations, providing insights into integrating language-guided perception within industrial automation systems. Video demonstrations are available at https://eric-nguyen1402.github.io/lang2lift.github.io/
comment: 8 pages, 7 figures
GeCo-SRT: Geometry-aware Continual Adaptation for Robotic Cross-Task Sim-to-Real Transfer CVPR 2026
Bridging the sim-to-real gap is important for applying low-cost simulation data to real-world robotic systems. However, previous methods are severely limited by treating each transfer as an isolated endeavor, demanding repeated, costly tuning and wasting prior transfer experience. To move beyond isolated sim-to-real, we build a continual cross-task sim-to-real transfer paradigm centered on knowledge accumulation across iterative transfers, thereby enabling effective and efficient adaptation to novel tasks. Thus, we propose GeCo-SRT, a geometry-aware continual adaptation method. It utilizes domain-invariant and task-invariant knowledge from local geometric features as a transferable foundation to accelerate adaptation during subsequent sim-to-real transfers. This method starts with a geometry-aware mixture-of-experts module, which dynamically activates experts to specialize in distinct geometric knowledge to bridge observation sim-to-real gap. Further, the geometry-expert-guided prioritized experience replay module preferentially samples from underutilized experts, refreshing specialized knowledge to combat forgetting and maintain robust cross-task performance. Leveraging knowledge accumulated during iterative transfer, GeCo-SRT method not only achieves 52% average performance improvement over the baseline, but also demonstrates significant data efficiency for new task adaptation with only 1/6 data. We hope this work inspires approaches for efficient, low-cost cross-task sim-to-real transfer.
comment: Accepted By CVPR 2026
A study on the effects of mixed explicit and implicit communications in human-artificial-agent interactions
Communication between humans and artificial agents is essential for their interaction. This is often inspired by human communication, which uses gestures, facial expressions, gaze direction, and other explicit and implicit means. This work presents interaction experiments where humans and artificial agents interact through explicit and implicit communication to evaluate the effect of mixed explicit-implicit communication against purely explicit communication and the impact of the task difficulty in this evaluation. Results obtained using Bayesian parameter estimation show that the task execution time did not significantly change when mixed explicit and implicit communications were used in neither of our experiments, which varied in the type of artificial agent (virtual agent and humanoid robot) used and task difficulty. The number of errors was affected by the communication only when the human was executing a more difficult task, and an impact on the perceived efficiency of the interaction was only observed in the interaction with the robot, for both easy and difficult tasks. In contrast, acceptance, sociability, and transparency of the artificial agent increased when using mixed communication modalities in both our experiments and task difficulty levels. This suggests that task-related measures, such as time, number of errors, and perceived efficiency of the interaction, as well as the impact of the communication on them, are more sensitive to the type of task and the difficulty level, whereas the combination of explicit and implicit communications more consistently improves human perceptions about artificial agents.
comment: Main paper with 28 pages, 14 figures, 4 tables. Supplementary material with 39 pages, 44 figures, 2 tables. Submitted to Intelligent Service Robotics
Adaptive Diffusion Constrained Sampling for Bimanual Robot Manipulation ICRA 2026
Coordinated multi-arm manipulation requires satisfying multiple simultaneous geometric constraints across high-dimensional configuration spaces, which poses a significant challenge for traditional planning and control methods. In this work, we propose Adaptive Diffusion Constrained Sampling (ADCS), a generative framework that flexibly integrates both equality (e.g., relative and absolute pose constraints) and structured inequality constraints (e.g., proximity to object surfaces) into an energy-based diffusion model. Equality constraints are modeled using dedicated energy networks trained on pose differences in Lie algebra space, while inequality constraints are represented via Signed Distance Functions (SDFs) and encoded into learned constraint embeddings, allowing the model to reason about complex spatial regions. A key innovation of our method is a Transformer-based architecture that learns to weight constraint-specific energy functions at inference time, enabling flexible and context-aware constraint integration. Moreover, we adopt a two-phase sampling strategy that improves precision and sample diversity by combining Langevin dynamics with resampling and density-aware re-weighting. Experimental results on dual-arm manipulation tasks show that ADCS significantly improves sample diversity and generalization across settings demanding precise coordination and adaptive constraint handling.
comment: Accepted by IEEE International Conference on Robotics and Automation 2026(ICRA 2026)
ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning
Robot learning increasingly relies on simulation to advance complex ability such as dexterous manipulations and precise interactions, necessitating high-quality digital assets to bridge the sim-to-real gap. However, existing open-source articulated-object datasets for simulation are limited by insufficient visual realism and low physical fidelity, which hinder their utility for training models mastering robotic tasks in real world. To address these challenges, we introduce ArtVIP, a comprehensive open-source dataset comprising high-quality digital-twin articulated objects, accompanied by indoor-scene assets. Crafted by professional 3D modelers adhering to unified standards, ArtVIP ensures visual realism through precise geometric meshes and high-resolution textures, while physical fidelity is achieved via fine-tuned dynamic parameters. Meanwhile, the dataset pioneers embedded modular interaction behaviors within assets and pixel-level affordance annotations. Feature-map visualization and optical motion capture are employed to quantitatively demonstrate ArtVIP's visual and physical fidelity, with its applicability validated across imitation learning and reinforcement learning experiments. Provided in USD format with detailed production guidelines, ArtVIP is fully open-source, benefiting the research community and advancing robot learning research. Our project is at https://x-humanoid-artvip.github.io/ .
MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation
Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVI generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step. Rather than using a single model, MALLVI coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning. Experiments in simulation and real-world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks. Code available at https://github.com/iman1234ahmadi/MALLVI .
mjlab: A Lightweight Framework for GPU-Accelerated Robot Learning
We present mjlab, a lightweight, open-source framework for robot learning that combines GPU-accelerated simulation with composable environments and minimal setup friction. mjlab adopts the manager-based API introduced by Isaac Lab, where users compose modular building blocks for observations, rewards, and events, and pairs it with MuJoCo Warp for GPU-accelerated physics. The result is a framework installable with a single command, requiring minimal dependencies, and providing direct access to native MuJoCo data structures. mjlab ships with reference implementations of velocity tracking, motion imitation, and manipulation tasks.
comment: Comments: 11 pages; Code is available at https://github.com/mujocolab/mjlab ; Expanded sensor and domain randomization sections, added references, minor edits
Gauss-Newton accelerated MPPI Control
Model Predictive Path Integral (MPPI) control is a sampling-based optimization method that has recently attracted attention, particularly in the robotics and reinforcement learning communities. MPPI has been widely applied as a GPU-accelerated random search method to deterministic direct single-shooting optimal control problems arising in model predictive control (MPC) formulations. MPPI offers several key advantages, including flexibility, robustness, ease of implementation, and inherent parallelizability. However, its performance can deteriorate in high-dimensional settings since the optimal control problem is solved via Monte Carlo sampling. To address this limitation, this paper proposes an enhanced MPPI method that incorporates a Jacobian reconstruction technique and the second-order Generalized Gauss-Newton method. This novel approach is called \textit{Gauss-Newton accelerated MPPI}. The numerical results show that the Gauss-Newton accelerated MPPI approach substantially improves MPPI scalability and computational efficiency while preserving the key benefits of the classical MPPI framework, making it a promising approach even for high-dimensional problems.
comment: 6 pages, 3 figures, submitted to the IFAC World Congress 2026, parts of this preprint are directly taken from Chapter 3 of the main author's PhD thesis with title "Optimal Control for Efficient Vessel Operation: From Theory to Real-World Applications"
Toward a Decision Support System for Energy-Efficient Ferry Operation on Lake Constance based on Optimal Control
The maritime sector is undergoing a disruptive technological change driven by three main factors: autonomy, decarbonization, and digital transformation. Addressing these factors necessitates a reassessment of inland vessel operations. This paper presents the design and development of a decision support system for ferry operations based on a shrinking-horizon optimal control framework. The problem formulation incorporates a mathematical model of the ferry's dynamics and environmental disturbances, specifically water currents and wind, which can significantly influence the dynamics. Real-world data and illustrative scenarios demonstrate the potential of the proposed system to effectively support ferry crews by providing real-time guidance. This enables enhanced operational efficiency while maintaining predefined maneuver durations. The findings suggest that optimal control applications hold substantial promise for advancing future ferry operations on inland waters. A video of the real-world ferry MS Insel Mainau operating on Lake Constance is available at: https://youtu.be/i1MjCdbEQyE
comment: 6 pages, 8 figures, parts of this preprint are directly taken from Chapter 6 of the main author's PhD thesis with title "Optimal Control for Efficient Vessel Operation: From Theory to Real-World Applications"
PD-VLA: Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding IROS 2025
Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation. The performance of VLA models can be improved by integrating with action chunking, a critical technique for effective control. However, action chunking linearly scales up action dimensions in VLA models with increased chunking sizes. This reduces the inference efficiency. To tackle this problem, we propose PD-VLA, the first parallel decoding framework for VLA models integrated with action chunking. Our framework reformulates autoregressive decoding as a nonlinear system solved by parallel fixed-point iterations. This approach preserves model performance with mathematical guarantees while significantly improving decoding speed. In addition, it enables training-free acceleration without architectural changes, as well as seamless synergy with existing acceleration techniques. Extensive simulations validate that our PD-VLA maintains competitive success rates while achieving 2.52 times execution frequency on manipulators (with 7 degrees of freedom) compared with the fundamental VLA model. Furthermore, we experimentally identify the most effective settings for acceleration. Finally, real-world experiments validate its high applicability across different tasks.
comment: Accepted by IROS 2025, updated results on LIBERO
JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation ICLR 2026
Vision-and-Language Navigation requires an embodied agent to navigate through unseen environments, guided by natural language instructions and a continuous video stream. Recent advances in VLN have been driven by the powerful semantic understanding of Multimodal Large Language Models. However, these methods typically rely on explicit semantic memory, such as building textual cognitive maps or storing historical visual frames. This type of method suffers from spatial information loss, computational redundancy, and memory bloat, which impede efficient navigation. Inspired by the implicit scene representation in human navigation, analogous to the left brain's semantic understanding and the right brain's spatial cognition, we propose JanusVLN, a novel VLN framework featuring a dual implicit neural memory that models spatial-geometric and visual-semantic memory as separate, compact, and fixed-size neural representations. This framework first extends the MLLM to incorporate 3D prior knowledge from the spatial-geometric encoder, thereby enhancing the spatial reasoning capabilities of models based solely on RGB input. Then, the historical key-value caches from the spatial-geometric and visual-semantic encoders are constructed into a dual implicit memory. By retaining only the KVs of tokens in the initial and sliding window, redundant computation is avoided, enabling efficient incremental updates. Extensive experiments demonstrate that JanusVLN outperforms over 20 recent methods to achieve SOTA performance. For example, the success rate improves by 10.5-35.5 compared to methods using multiple data types as input and by 3.6-10.8 compared to methods using more RGB training data. This indicates that the proposed dual implicit neural memory, as a novel paradigm, explores promising new directions for future VLN research. Ours project page: https://miv-xjtu.github.io/JanusVLN.github.io/.
comment: Accepted to ICLR 2026. Project page: https://miv-xjtu.github.io/JanusVLN.github.io/
Dual-Regularized Riccati Recursions for Interior-Point Optimal Control
We derive closed-form extensions of Riccati's recursions (both sequential and parallel) for solving dual-regularized LQR problems. We show how these methods can be used to solve general constrained, non-convex, discrete-time optimal control problems via a regularized interior point method, while guaranteeing that each primal step is a descent direction of an Augmented Barrier-Lagrangian merit function. We provide MIT-licensed implementations of our methods in C++ and JAX.
EO-1: An Open Unified Embodied Foundation Model for General Robot Control
The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, we introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models. Project Page: https://eo-robotics.ai/eo-1.
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation ICLR 2026
Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming. This challenge intensifies for multi-step bimanual mobile manipulation, where humans must teleoperate both the mobile base and two high-DoF arms. Prior X-Gen works have developed automated data generation frameworks for static (bimanual) manipulation tasks, augmenting a few human demos in simulation with novel scene configurations to synthesize large-scale datasets. However, prior works fall short for bimanual mobile manipulation tasks for two major reasons: 1) a mobile base introduces the problem of how to place the robot base to enable downstream manipulation (reachability) and 2) an active camera introduces the problem of how to position the camera to generate data for a visuomotor policy (visibility). To address these challenges, MoMaGen formulates data generation as a constrained optimization problem that satisfies hard constraints (e.g., reachability) while balancing soft constraints (e.g., visibility while navigation). This formulation generalizes across most existing automated data generation approaches and offers a principled foundation for developing future methods. We evaluate on four multi-step bimanual mobile manipulation tasks and find that MoMaGen enables the generation of much more diverse datasets than previous methods. As a result of the dataset diversity, we also show that the data generated by MoMaGen can be used to train successful imitation learning policies using a single source demo. Furthermore, the trained policy can be fine-tuned with a very small amount of real-world data (40 demos) to be succesfully deployed on real robotic hardware. More details are on our project page: momagen.github.io.
comment: Project website: momagen.github.io. The first four authors contribute equally. Accpeted to International Conference on Learning Representations (ICLR 2026)
SPACeR: Self-Play Anchoring with Centralized Reference Models ICLR 2026
Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable. Achieving this requires sim agent policies that are human-like, fast, and scalable in multi-agent settings. Recent progress in imitation learning with large diffusion-based or tokenized models has shown that behaviors can be captured directly from human driving data, producing realistic policies. However, these models are computationally expensive, slow during inference, and struggle to adapt in reactive, closed-loop scenarios. In contrast, self-play reinforcement learning (RL) scales efficiently and naturally captures multi-agent interactions, but it often relies on heuristics and reward shaping, and the resulting policies can diverge from human norms. We propose SPACeR, a framework that leverages a pretrained tokenized autoregressive motion model as a centralized reference policy to guide decentralized self-play. The reference model provides likelihood rewards and KL divergence, anchoring policies to the human driving distribution while preserving RL scalability. Evaluated on the Waymo Sim Agents Challenge, our method achieves competitive performance with imitation-learned policies while being up to 10x faster at inference and 50x smaller in parameter size than large generative models. In addition, we demonstrate in closed-loop ego planning evaluation tasks that our sim agents can effectively measure planner quality with fast and scalable traffic simulation, establishing a new paradigm for testing autonomous driving policies.
comment: Accepted at ICLR 2026. Project page: https://spacer-ai.github.io/
HetroD: A High-Fidelity Drone Dataset and Benchmark for Autonomous Driving in Heterogeneous Traffic ICRA
We present HetroD, a dataset and benchmark for developing autonomous driving systems in heterogeneous environments. HetroD targets the critical challenge of navi- gating real-world heterogeneous traffic dominated by vulner- able road users (VRUs), including pedestrians, cyclists, and motorcyclists that interact with vehicles. These mixed agent types exhibit complex behaviors such as hook turns, lane splitting, and informal right-of-way negotiation. Such behaviors pose significant challenges for autonomous vehicles but remain underrepresented in existing datasets focused on structured, lane-disciplined traffic. To bridge the gap, we collect a large- scale drone-based dataset to provide a holistic observation of traffic scenes with centimeter-accurate annotations, HD maps, and traffic signal states. We further develop a modular toolkit for extracting per-agent scenarios to support downstream task development. In total, the dataset comprises over 65.4k high- fidelity agent trajectories, 70% of which are from VRUs. HetroD supports modeling of VRU behaviors in dense, het- erogeneous traffic and provides standardized benchmarks for forecasting, planning, and simulation tasks. Evaluation results reveal that state-of-the-art prediction and planning models struggle with the challenges presented by our dataset: they fail to predict lateral VRU movements, cannot handle unstructured maneuvers, and exhibit limited performance in dense and multi-agent scenarios, highlighting the need for more robust approaches to heterogeneous traffic. See our project page for more examples: https://hetroddata.github.io/HetroD/
comment: IEEE International Conference on Robotics and Automation (ICRA) 2026
Learning Dexterous Manipulation Skills from Imperfect Simulations
Reinforcement learning and sim-to-real transfer have made significant progress in dexterous manipulation. However, progress remains limited by the difficulty of simulating complex contact dynamics and multisensory signals, especially tactile feedback. In this work, we propose \ours, a sim-to-real framework that addresses these limitations and demonstrates its effectiveness on nut-bolt fastening and screwdriving with multi-fingered hands. The framework has three stages. First, we train reinforcement learning policies in simulation using simplified object models that lead to the emergence of correct finger gaits. We then use the learned policy as a skill primitive within a teleoperation system to collect real-world demonstrations that contain tactile and proprioceptive information. Finally, we train a behavior cloning policy that incorporates tactile sensing and show that it generalizes to nuts and screwdrivers with diverse geometries. Experiments across both tasks show high task progress ratios compared to direct sim-to-real transfer and robust performance even on unseen object shapes and under external perturbations. Videos and code are available on https://dexscrew.github.io.
Multi-robot LiDAR SLAM: a practical case study in underground tunnel environments
Multi-robot SLAM aims at localizing and building a map with multiple robots, interacting with each other. In the work described in this article, we analyze the pipeline of a decentralized LiDAR SLAM system to study the current limitations of the state of the art, and we discover a significant source of failures, i.e., that the loop detection is the source of too many false positives. We therefore develop and propose a new heuristic to overcome these limitations. The environment taken as reference in this work is the highly challenging case of underground tunnels. We also highlight potential new research areas still under-explored.
comment: 14 pages, 14 figures
SCREP: Scene Coordinate Regression and Evidential Learning-based Perception-Aware Trajectory Generation
Autonomous flight in GPS-denied indoor spaces requires trajectories that keep visual-localization error tightly bounded across varied missions. Map-based visual localization methods such as feature matching require computationally intensive map reconstruction and have feature-storage scalability issues, especially for large environments. Scene coordinate regression (SCR) provides an efficient learning-based alternative that directly predicts3D coordinates for every pixel, enabling absolute pose estimation with significant potential for onboard roboticsapplications. We present a perception-aware trajectory planner that couples an evidential learning-based SCR poseestimator with a receding-horizon trajectory optimizer. The optimizer steers the onboard camera toward reliablescene coordinates with low uncertainty, while a fixed-lag smoother fuses the low-rate SCR pose estimates with high-rate IMU data to provide a high-quality, high-rate pose estimate. In simulation, our planner reduces translationand rotation RMSE by at least 4.9% and 30.8% relative to baselines, respectively. Hardware-in-the-loop experiments validate the feasibility of our proposed trajectory planner under close-to-real deployment conditions.
Multiagent Systems
Using Feasible Action-Space Reduction by Groups to fill Causal Responsibility Gaps in Spatial Interactions
Heralding the advent of autonomous vehicles and mobile robots that interact with humans, responsibility in spatial interaction is burgeoning as a research topic. Even though metrics of responsibility tailored to spatial interactions have been proposed, they are mostly focused on the responsibility of individual agents. Metrics of causal responsibility focusing on individuals fail in cases of causal overdeterminism -- when many actors simultaneously cause an outcome. To fill the gaps in causal responsibility left by individual-focused metrics, we formulate a metric for the causal responsibility of groups. To identify assertive agents that are causally responsible for the trajectory of an affected agent, we further formalise the types of assertive influences and propose a tiering algorithm for systematically identifying assertive agents. Finally, we use scenario-based simulations to illustrate the benefits of considering groups and how the emergence of group effects vary with interaction dynamics and the proximity of agents.
Hierarchical Lead Critic based Multi-Agent Reinforcement Learning
Cooperative Multi-Agent Reinforcement Learning (MARL) solves complex tasks that require coordination from multiple agents, but is often limited to either local (independent learning) or global (centralized learning) perspectives. In this paper, we introduce a novel sequential training scheme and MARL architecture, which learns from multiple perspectives on different hierarchy levels. We propose the Hierarchical Lead Critic (HLC) - inspired by natural emerging distributions in team structures, where following high-level objectives combines with low-level execution. HLC demonstrates that introducing multiple hierarchies, leveraging local and global perspectives, can lead to improved performance with high sample efficiency and robust policies. Experimental results conducted on cooperative, non-communicative, and partially observable MARL benchmarks demonstrate that HLC outperforms single hierarchy baselines and scales robustly with increasing amounts of agents and difficulty.
comment: 16 pages, 10 Figures, Preprint
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning ICRA 2026
Multi-robot task planning requires decomposing natural-language instructions into executable actions for heterogeneous robot teams. Conventional Planning Domain Definition Language (PDDL) planners provide rigorous guarantees but struggle to handle ambiguous or long-horizon missions, while large language models (LLMs) can interpret instructions and propose plans but may hallucinate or produce infeasible actions. We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner. When plans fail, the system applies TextGrad-inspired textual-gradient updates to optimize each agent's prompt and thereby improve planning accuracy. In addition, meta-prompts are learned and shared across agents within the same layer, enabling efficient prompt optimization in multi-agent settings. On the MAT-THOR benchmark, our planner achieves success rates of 0.95 on compound tasks, 0.84 on complex tasks, and 0.60 on vague tasks, improving over the previous state-of-the-art LaMMA-P by 2, 7, and 15 percentage points respectively. An ablation study shows that the hierarchical structure, prompt optimization, and meta-prompt sharing contribute roughly +59, +37, and +4 percentage points to the overall success rate.
comment: Accepted to ICRA 2026. 8 pages, 2 figures
AgentLTV: An Agent-Based Unified Search-and-Evolution Framework for Automated Lifetime Value Prediction KDD 2026
Lifetime Value (LTV) prediction is critical in advertising, recommender systems, and e-commerce. In practice, LTV data patterns vary across decision scenarios. As a result, practitioners often build complex, scenario-specific pipelines and iterate over feature processing, objective design, and tuning. This process is expensive and hard to transfer. We propose AgentLTV, an agent-based unified search-and-evolution framework for automated LTV modeling. AgentLTV treats each candidate solution as an {executable pipeline program}. LLM-driven agents generate code, run and repair pipelines, and analyze execution feedback. Two decision agents coordinate a two-stage search. The Monte Carlo Tree Search (MCTS) stage explores a broad space of modeling choices under a fixed budget, guided by the Polynomial Upper Confidence bounds for Trees criterion and a Pareto-aware multi-metric value function. The Evolutionary Algorithm (EA) stage refines the best MCTS program via island-based evolution with crossover, mutation, and migration. Experiments on a large-scale proprietary dataset and a public benchmark show that AgentLTV consistently discovers strong models across ranking and error metrics. Online bucket-level analysis further indicates improved ranking consistency and value calibration, especially for high-value and negative-LTV segments. We summarize practitioner-oriented takeaways: use MCTS for rapid adaptation to new data patterns, use EA for stable refinement, and validate deployment readiness with bucket-level ranking and calibration diagnostics. The proposed AgentLTV has been successfully deployed online.
comment: 12 pages, 4 figures, submitted to KDD 2026: 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ADS Track
Training Generalizable Collaborative Agents via Strategic Risk Aversion
Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals. Unfortunately, existing approaches to learning policies for such collaborative problems produce brittle solutions that fail when paired with new partners. We attribute these failures to a combination of free-riding during training and a lack of strategic robustness. To address these problems, we study the concept of strategic risk aversion and interpret it as a principled inductive bias for generalizable cooperation with unseen partners. While strategically risk-averse players are robust to deviations in their partner's behavior by design, we show that, in collaborative games, they also (1) can have better equilibrium outcomes than those at classical game-theoretic concepts like Nash, and (2) exhibit less or no free-riding. Inspired by these insights, we develop a multi-agent reinforcement learning (MARL) algorithm that integrates strategic risk aversion into standard policy optimization methods. Our empirical results across collaborative benchmarks (including an LLM collaboration task) validate our theory and demonstrate that our approach consistently achieves reliable collaboration with heterogeneous and previously unseen partners across collaborative tasks.
Pancake: Hierarchical Memory System for Multi-Agent LLM Serving
In this work, we identify and address the core challenges of agentic memory management in LLM serving, where large-scale storage, frequent updates, and multiple coexisting agents jointly introduce complex and high-cost approximate nearest neighbor (ANN) searching problems. We present Pancake, a multi-tier agentic memory system that unifies three key techniques: (i) multi-level index caching for single agents, (ii) coordinated index management across multiple agents, and (iii) collaborative GPU-CPU acceleration. Pancake exposes easy-to-use interface that can be integrated into memory-based agents like Mem-GPT, and is compatible with agentic frameworks such as LangChain and LlamaIndex. Experiments on realistic agent workloads show that Pancake substantially outperforms existing frameworks, achieving more than 4.29x end-to-end throughput improvement.
Sustainable Multi-Agent Crowdsourcing via Physics-Informed Bandits
Crowdsourcing platforms face a four-way tension between allocation quality, workforce sustainability, operational feasibility, and strategic contractor behaviour--a dilemma we formalise as the Cold-Start, Burnout, Utilisation, and Strategic Agency Dilemma. Existing methods resolve at most two of these tensions simultaneously: greedy heuristics and multi-criteria decision making (MCDM) methods achieve Day-1 quality but cause catastrophic burnout, while bandit algorithms eliminate burnout only through operationally infeasible 100% workforce utilisation.To address this, we introduce FORGE, a physics-grounded $K+1$ multi-agent simulator in which each contractor is a rational agent that declares its own load-acceptance threshold based on its fatigue state, converting the standard passive Restless Multi-Armed Bandit (RMAB) into a genuine Stackelberg game. Operating within FORGE, we propose a Neural-Linear UCB allocator that fuses a Two-Tower embedding network with a Physics-Informed Covariance Prior derived from offline simulator interactions. The prior simultaneously warm-starts skill-cluster geometry and UCB exploration landscape, providing a geometry-aware belief state from episode 1 that measurably reduces cold-start regret.Over $T = 200$ cold-start episodes, the proposed method achieves the highest reward of all non-oracle methods ($\text{LRew} = 0.555 \pm 0.041$) at only 7.6% workforce utilisation--a combination no conventional baseline achieves--while maintaining robustness to workforce turnover up to 50% and observation noise up to $σ= 0.20$.
Agent Behavioral Contracts: Formal Specification and Runtime Enforcement for Reliable Autonomous AI Agents
Traditional software relies on contracts -- APIs, type systems, assertions -- to specify and enforce correct behavior. AI agents, by contrast, operate on prompts and natural language instructions with no formal behavioral specification. This gap is the root cause of drift, governance failures, and frequent project failures in agentic AI deployments. We introduce Agent Behavioral Contracts (ABC), a formal framework that brings Design-by-Contract principles to autonomous AI agents. An ABC contract C = (P, I, G, R) specifies Preconditions, Invariants, Governance policies, and Recovery mechanisms as first-class, runtime-enforceable components. We define (p, delta, k)-satisfaction -- a probabilistic notion of contract compliance that accounts for LLM non-determinism and recovery -- and prove a Drift Bounds Theorem showing that contracts with recovery rate gamma > alpha (the natural drift rate) bound behavioral drift to D* = alpha/gamma in expectation, with Gaussian concentration in the stochastic setting. We establish sufficient conditions for safe contract composition in multi-agent chains and derive probabilistic degradation bounds. We implement ABC in AgentAssert, a runtime enforcement library, and evaluate on AgentContract-Bench, a benchmark of 200 scenarios across 7 models from 6 vendors. Results across 1,980 sessions show that contracted agents detect 5.2-6.8 soft violations per session that uncontracted baselines miss entirely (p < 0.0001, Cohen's d = 6.7-33.8), achieve 88-100% hard constraint compliance, and bound behavioral drift to D* < 0.27 across extended sessions, with 100% recovery for frontier models and 17-100% across all models, at overhead < 10 ms per action.
comment: 71 pages, 7 figures, 14 tables. Patent pending. Also available on Zenodo: DOI 10.5281/zenodo.18775393
CodeCureAgent: Automatic Classification and Repair of Static Analysis Warnings
Static analysis tools are widely used to detect bugs, vulnerabilities, and code smells. Traditionally, developers must resolve these warnings manually. Because this process is tedious, developers sometimes ignore warnings, leading to an accumulation of warnings and a degradation of code quality. This paper presents CodeCureAgent, an approach that harnesses LLM-based agents to automatically analyze, classify, and repair static analysis warnings. Unlike previous work, our method does not follow a predetermined algorithm. Instead, we adopt an agentic framework that iteratively invokes tools to gather additional information from the codebase (e.g., via code search) and edit the codebase to resolve the warning. CodeCureAgent detects and suppresses false positives, while fixing true positives when identified. We equip CodeCureAgent with a three-step heuristic to approve patches: (1) build the project, (2) verify that the warning disappears without introducing new warnings, and (3) run the test suite. We evaluate CodeCureAgent on a dataset of 1,000 SonarQube warnings found in 106 Java projects and covering 291 distinct rules. Our approach produces plausible fixes for 96.8% of the warnings, outperforming state-of-the-art baseline approaches by 29.2%-34.0% in plausible-fix rate. Manual inspection of 291 cases reveals a correct-fix rate of 86.3%, showing that CodeCureAgent can reliably repair static analysis warnings. The approach incurs LLM costs of about 2.9 cents (USD) and an end-to-end processing time of about four minutes per warning. We envision CodeCureAgent helping to clean existing codebases and being integrated into CI/CD pipelines to prevent the accumulation of static analysis warnings.
Multi-agent deep reinforcement learning with centralized training and decentralized execution for transportation infrastructure management
Life-cycle management of large-scale transportation systems requires determining a sequence of inspection and maintenance decisions to minimize long-term risks and costs while dealing with multiple uncertainties and constraints that lie in high-dimensional spaces. Traditional approaches have been widely applied but often suffer from limitations related to optimality, scalability, and the ability to properly handle uncertainty. Moreover, many existing methods rely on unconstrained formulations that overlook critical operational constraints. We address these issues in this work by casting the optimization problem within the framework of constrained Partially Observable Markov Decision Processes (POMDPs), which provide a robust mathematical foundation for stochastic sequential decision-making under observation uncertainties, in the presence of risk and resource limitations. To tackle the high dimensionality of state and action spaces, we propose DDMAC-CTDE, a Deep Decentralized Multi-Agent Actor-Critic (DDMAC) reinforcement learning architecture with Centralized Training and Decentralized Execution (CTDE). To demonstrate the utility of the proposed framework, we also develop a new comprehensive benchmark environment representing an existing transportation network in Virginia, U.S., with heterogeneous pavement and bridge assets undergoing nonstationary degradation. This environment incorporates multiple practical constraints related to budget limits, performance guidelines, traffic delays, and risk considerations. On this benchmark, DDMAC-CTDE consistently outperforms standard transportation management baselines, producing better policies. Together, the proposed framework and benchmark provide (i) a scalable, constraint-aware methodology, and (ii) a realistic, rigorous testbed for comprehensive evaluation of Deep Reinforcement Learning (DRL) for transportation infrastructure management.
Systems and Control (EESS)
Secure Semantic Communications via AI Defenses: Fundamentals, Solutions, and Future Directions
Semantic communication (SemCom) redefines wireless communication from reproducing symbols to transmitting task-relevant semantics. However, this AI-native architecture also introduces new vulnerabilities, as semantic failures may arise from adversarial perturbations to models, corrupted training data, desynchronized priors, or misaligned inference even when lower-layer transmission reliability and cryptographic protection remain intact. This survey provides a defense-centered and system-oriented synthesis of security in SemCom via AI defense. We analyze AI-centric threat models by consolidating existing studies and organizing attack surfaces across model-level, channel-realizable, knowledge-based, and networked inference vectors. Building on this foundation, we present a structured taxonomy of defense strategies organized by where semantic integrity can be compromised in SemCom systems despite correct symbol delivery, spanning semantic encoding, wireless transmission, knowledge integrity, and coordination among multiple agents. These categories correspond to distinct security failure modes, including representation fragility, channel-realizable manipulation, semantic prior poisoning or desynchronization, and adversarial propagation through distributed inference. We also examine security utility operating envelopes that capture tradeoffs among semantic fidelity, robustness, latency, and energy under realistic constraints, survey evaluation frameworks and representative applications, and identify open challenges in cross-layer composition and deployment-time certification. Overall, this survey offers a unified system-level perspective that enables readers to understand major threat and defense mechanisms in AI-native SemCom systems and to leverage emerging security techniques in the design and deployment of robust SemCom architectures for next-generation intelligent networks.
Tempered Christoffel-Weighted Polynomial Chaos Expansion for Resilience-Oriented Uncertainty Quantification
Accurate and efficient uncertainty quantification is essential for resilience assessment of modern power systems under high impact and low probability disturbances. Data driven sparse polynomial chaos expansion (DDSPCE) provides a computationally efficient surrogate framework but may suffer from ill conditioned regression and loss of accuracy in the distribution tails that determine system risk. This paper studies the impact of regression weighting schemes on the stability and tail accuracy of DD-SPCE surrogates by introducing a tempered Christoffel weighted least squares (T-CWLS) formulation that balances numerical stability and tail fidelity. The tempering exponent is treated as a hyperparameter whose influence is examined with respect to distributional accuracy compared with Monte Carlo simulations. Case studies on distribution system load shedding show that the proposed method reduces 95th percentile deviation by 16%, 5th percentile deviation by 6%, and improves the regression stability index by over 130%. The results demonstrate that controlling the weighting intensity directly influences both stability index and the accuracy of tail prediction.
comment: Accepted to 2026 IEEE Power & Energy Society General Meeting
Aggressiveness-Aware Learning-based Control of Quadrotor UAVs with Safety Guarantees
This paper presents an aggressiveness-aware control framework for quadrotor UAVs that integrates learning-based oracles to mitigate the effects of unknown disturbances. Starting from a nominal tracking controller on $\mathrm{SE}(3)$, unmodeled generalized forces and moments are estimated using a learning-based oracle and compensated in the control inputs. An aggressiveness-aware gain scheduling mechanism adapts the feedback gains based on probabilistic model-error bounds, enabling reduced feedback-induced aggressiveness while guaranteeing a prescribed practical exponential tracking performance. The proposed approach makes explicit the trade-off between model accuracy, robustness, and control aggressiveness, and provides a principled way to exploit learning for safer and less aggressive quadrotor maneuvers.
Traffic-aware Hierarchical Integrated Thermal and Energy Management for Connected HEVs
The energy and thermal management systems of hybrid electric vehicles (HEVs) are inherently interdependent. With the ongoing deployment of intelligent transportation systems (ITSs) and increasing vehicle connectivity, the integration of traffic information has become crucial for improving both energy efficiency and thermal comfort in modern vehicles. To enhance fuel economy, this paper proposes a novel traffic-aware hierarchical integrated thermal and energy management (TA-ITEM) strategy for connected HEVs. In the upper layer, global reference trajectories for battery state of charge (SOC) and cabin temperature are planned using traffic flow speed information obtained from ITSs. In the lower layer, a real-time model predictive control (MPC)-based ITEM controller is developed, which incorporates a novel Transformer-based speed predictor with driving condition recognition (TF-DCR) to enable anticipatory tracking of the reference trajectories. Numerical simulations are conducted under various driving cycles and ambient temperature conditions. The results demonstrate that the proposed TA-ITEM approach outperforms conventional rule-based and MPC-SP approaches, with average fuel consumption reductions of 56.36\% and 5.84\%, respectively, while maintaining superior thermal regulation and cabin comfort. These findings confirm the effectiveness and strong generalization capability of TA-ITEM and underscore the advantages of incorporating traffic information.
On the airspace complexity metrics for predecessor-follower operations
This technical note proposes a novel airspace complexity metric that quantifies the air traffic controller workload and coordination effort for pairwise predecessor-follower aircraft operations in cruise. The pairwise dynamic workload (PDW) is proposed as a continuous function that depends on the relevant parameters of these operations, such as the aircraft separation and separation rate. A comparison of this metric with the dynamic density (DD) shows that it is capable of continuously evaluating the variation of airspace complexity over time and monitoring the aircraft parameters that might lead to conflicts. This metric can be used to support the implementation of autonomous and supervised aircraft procedures, to achieve a more structured and coordinated airspace.
comment: 3 pages, 2 figures
LightSim: A Lightweight Cell Transmission Model Simulator for Traffic Signal Control Research
Reinforcement learning for traffic signal control is bottlenecked by simulators: training in SUMO takes hours, reproducing results often requires days of platform-specific setup, and the slow iteration cycle discourages the multi-seed experiments that rigorous evaluation demands. Much of this cost is unnecessary, since for signal timing optimization the relevant dynamics are queue formation and discharge, which the Cell Transmission Model (CTM) captures as a macroscopic flow model. We introduce LightSim, a pure Python, pip-installable traffic simulator with Gymnasium and PettingZoo interfaces that runs over 20000 steps per second on a single CPU. Across cross-simulator experiments spanning single intersections, grid networks, arterial corridors, and six real-world city networks, LightSim preserves controller rankings from SUMO for both classical and reinforcement learning strategies while training 3 to 7 times faster. LightSim is released as an open-source benchmark with nineteen built-in scenarios, seven controllers, and full reinforcement learning pipelines, lowering the barrier to signal control research from days to minutes.
Learning-Based Geometric Leader-Follower Control for Cooperative Rigid-Payload Transport with Aerial Manipulators
This paper presents a learning-based tracking control framework for cooperative transport of a rigid payload by multiple aerial manipulators under rigid grasp constraints. A unified geometric model is developed, yielding a coupled agent--payload differential--algebraic system that explicitly captures contact wrenches, payload dynamics, and internal force redundancy. A leader--follower architecture is adopted in which a designated leader generates a desired payload wrench based on geometric tracking errors, while the remaining agents realize this wrench through constraint-consistent force allocation. Unknown disturbances and modeling uncertainties are compensated using Gaussian Process (GP) regression. High-probability bounds on the learning error are explicitly incorporated into the control design, combining GP feedforward compensation with geometric feedback. Lyapunov analysis establishes uniform ultimate boundedness of the payload tracking errors with high probability, with an ultimate bound that scales with the GP predictive uncertainty.
Pilot-Free Optimal Control over Wireless Networks: A Control-Aided Channel Prediction Approach
A recurring theme in optimal controller design for wireless networked control systems (WNCS) is the reliance on real-time channel state information (CSI). However, acquiring accurate CSI a priori is notoriously challenging due to the time-varying nature of wireless channels. In this work, we propose a pilot-free framework for optimal control over wireless channels in which control commands are generated from plant states together with control-aided channel prediction. For linear plants operating over an orthogonal frequency-division multiplexing (OFDM) architecture, channel prediction is performed via a Kalman filter (KF), and the optimal control policy is derived from the Bellman principle. To alleviate the curse of dimensionality in computing the optimal control policy, we approximate the solution using a coupled algebraic Riccati equation (CARE), which can be computed efficiently via a stochastic approximation (SA) algorithm. Rigorous performance guarantees are established by proving the stability of both the channel predictor and the closed-loop system under the resulting control policy, providing sufficient conditions for the existence and uniqueness of a stabilizing approximate CARE solution, and establishing convergence of the SA-based control algorithm. The framework is further extended to nonlinear plants under general wireless architectures by combining a KalmanNet-based predictor with a Markov-modulated deep deterministic policy gradient (MM-DDPG) controller. Numerical results show that the proposed pilot-free approach outperforms benchmark schemes in both control performance and channel prediction accuracy for linear and nonlinear scenarios.
Stability of Open Multi-agent Systems over Dynamic Signed Digraphs
We address the synchronization problem in open multi-agent systems (OMAS) containing both cooperative and antagonistic interactions. In these systems, agents can join or leave the network over time, and the interaction structure may evolve accordingly. To capture these dynamical structural changes, we represent the network as a switched system interconnected over a dynamic and directed signed graph. Additionally, the network may contain one or multiple leader groups that influence the behavior of the remaining agents. In general, we show that the OMAS exhibit a more general form of synchronization, including trivial consensus, bipartite consensus and containment. Our approach uses the signed edge-based agreement protocol, and constructs strict Lyapunov functions for signed networks described by signed edge-Laplacian matrices containing multiple zero eigenvalues. Numerical simulations validate our theoretical results.
Two-Stage Active Distribution Network Voltage Control via LLM-RL Collaboration: A Hybrid Knowledge-Data-Driven Approach
The growing integration of distributed photovoltaics (PVs) into active distribution networks (ADNs) has exacerbated operational challenges, making it imperative to coordinate diverse equipment to mitigate voltage violations and enhance power quality. Although existing data-driven approaches have demonstrated effectiveness in the voltage control problem, they often require extensive trial-and-error exploration and struggle to incorporate heterogeneous information, such as day-ahead forecasts and semantic-based grid codes. Considering the operational scenarios and requirements in real-world ADNs, in this paper, we propose a hybrid knowledge-data-driven approach that leverages dynamic collaboration between a large language model (LLM) agent and a reinforcement learning (RL) agent to achieve two-stage voltage control. In the day-ahead stage, the LLM agent receives coarse region-level forecasts and generates scheduling strategies for on-load tap changer (OLTC) and shunt capacitors (SCs) to regulate the overall voltage profile. Then in the intra-day stage, based on accurate node-level measurements, the RL agent refines terminal voltages by deriving reactive power generation strategies for PV inverters. On top of the LLM-RL collaboration framework, we further propose a self-evolution mechanism for the LLM agent and a pretrain-finetune pipeline for the RL agent, effectively enhancing and coordinating the policies for both agents. The proposed approach not only aligns more closely with practical operational characteristics but also effectively utilizes the inherent knowledge and reasoning capabilities of the LLM agent, significantly improving training efficiency and voltage control performance. Comprehensive comparisons and ablation studies demonstrate the effectiveness of the proposed method.
Geometry-Dependent Radiation of Pinching Antennas: Theory, Simulation, and Measurement
Most existing studies achieve beamforming by adjusting the positions of pinching antennas (PAs) and typically model PAs as isotropic radiators. However, under the dielectric scatterer model, the PA radiation pattern depends on its geometry. This letter investigates the radiation patterns of PAs with different geometries through full-wave simulations and measurements, and demonstrates how geometry influences the radiation directivity. In addition, an arc-shaped PA is introduced to enable transmit-direction control in PA systems. A PA system prototype consisting of a dielectric waveguide, waveguide transitions, and a PA element is proposed. Prototype measurements are used to validate the simulations and to characterize the directivity of square and triangular PAs, and the measurement procedure can be applied to obtain radiation patterns for PAs with general geometries. The simulation and measurement results jointly demonstrate that PA geometry is critical in PA systems because it influences the radiation characteristics significantly.
comment: The manuscript has been submitted to an IEEE letter/journal for possible publication
Asymmetry Demystified: Strict CLFs and Feedbacks for Predator-Prey Interconnections
The difficulty with control of population dynamics, besides the states being positive and the control having to also be positive, is the extreme difference in the dynamics near extinction and at overpopulated states. As hard as global stabilization is, even harder is finding CLFs that are strict, don't require LaSalle arguments, and permit quantification of convergence. Among the three canonical types of two-population dynamics (mutualism, which borders on trivial, predator-prey, and competition, which makes global stabilization with positive harvesting impossible), predator-prey is the ``sweet spot'' for the study of stabilization. Even when the predator-prey interaction is neutrally stable, global asymptotic stabilization with strict CLFs has proven very difficult, except by conservative, hard-to-gain-insight-from Matrosov-like techniques. In this little note we show directions for the design of clean, elegant, insight-bearing, majorization-free strict CLFs. They generalize the classical Volterra-style Lyapunov functions for population dynamics to non-separable Volterra-style constructions. As a bonus to strictification as an analysis activity, we provide examples of concurrent designs of feedback and CLFs, using customized versions of forwarding and backstepping (note that, in suitable coordinates, predator-prey is both strict-feedforward and strict-feedback), where the striking deviations from these methods' conventional forms is necessitated by the predator-prey's states and inputs needing to be kept positive.
Diagnosis-Driven Co-planning of Network Reinforcement and BESS for Distribution Grid with High Penetration of Electric Vehicles
While the rapid proliferation of electric vehicles (EVs) accelerates net-zero goals, uncoordinated charging activities impose severe operational challenges on distribution grids, including exacerbated peak loads, thermal overloading, and voltage violations. To overcome the computational intractability of jointly optimizing grid infrastructure reinforcements and battery energy storage system (BESS) installations, this paper proposes a novel three-stage diagnosis-driven co-planning (DDCP) framework. The methodology integrates a violation detection and quantification (VDQ) model to systematically identify system breaches, and a violation-mitigated BESS planning (VMBP) model for optimal BESS sitting and sizing. Specifically, Stage I of the DDCP framework diagnoses critical bottleneck lines that render standalone BESS solutions infeasible. Stage II targets cable upgrades exclusively at the Top-N prioritized bottleneck lines and Stage III then executes the optimal BESS deployment using a network-enhanced VMBP model. Furthermore, this study quantifies the EV hosting capacity thresholds before and after BESS integration across varying EV adoption rates and base voltages. Finally, a comprehensive comparative analysis evaluates four mitigation approaches: the VDQ-driven cable upgrade (VCU) model, the VMBP model, system-wide voltage uprating, and the proposed DDCP framework. The results demonstrate that the DDCP framework not only resolves the complex joint-optimization hurdle but also achieves the high techno-economic superiority in addressing high-EV-penetration challenges.
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
General-purpose robots must master long-horizon manipulation, defined as tasks involving multiple kinematic structure changes (e.g., attaching or detaching objects) in unstructured environments. While Vision-Language-Action (VLA) models offer the potential to master diverse atomic skills, they struggle with the combinatorial complexity of sequencing them and are prone to cascading failures due to environmental sensitivity. To address these challenges, we propose LiLo-VLA (Linked Local VLA), a modular framework capable of zero-shot generalization to novel long-horizon tasks without ever being trained on them. Our approach decouples transport from interaction: a Reaching Module handles global motion, while an Interaction Module employs an object-centric VLA to process isolated objects of interest, ensuring robustness against irrelevant visual features and invariance to spatial configurations. Crucially, this modularity facilitates robust failure recovery through dynamic replanning and skill reuse, effectively mitigating the cascading errors common in end-to-end approaches. We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long. In these simulations, LiLo-VLA achieves a 69% average success rate, outperforming Pi0.5 by 41% and OpenVLA-OFT by 67%. Furthermore, real-world evaluations across 8 long-horizon tasks demonstrate an average success rate of 85%. Project page: https://yy-gx.github.io/LiLo-VLA/.
Optimal Real-Time Fusion of Time-Series Data Under Rényi Differential Privacy
In this paper, we investigate the optimal real-time fusion of data collected by multiple sensors. In our set-up, the sensor measurements are considered to be private and are jointly correlated with an underlying process. A fusion center combines the private sensor measurements and releases its output to an honest-but-curious party, which is responsible for estimating the state of the underlying process based on the fusion center's output. The privacy leakage incurred by the fusion policy is quantified using Rényi differential privacy. We formulate the privacy-aware fusion design as a constrained finite-horizon optimization problem, in which the fusion policy and the state estimation are jointly optimized to minimize the state estimation error subject to a total privacy budget constraint. We derive the constrained optimality conditions for the proposed optimization problem and use them to characterize the structural properties of the optimal fusion policy. Unlike classical differential privacy mechanisms, the optimal fusion policy is shown to adaptively allocates the privacy budget and regulates the adversary's belief in a closed-loop manner. To reduce the computational burden of solving the resulting constrained optimality equations, we parameterize the fusion policy using a structured Gaussian distribution and show that the parameterized fusion policy satisfies the privacy constraint. We further develop a numerical algorithm to jointly optimize the fusion policy and state estimator. Finally, we demonstrate the effectiveness of the proposed fusion framework through a traffic density estimation case study.
Constructive Vector Fields for Path Following in Fully-Actuated Systems on Matrix Lie Groups
This paper presents a novel vector field strategy for controlling fully-actuated systems on connected matrix Lie groups, ensuring convergence to and traversal along a curve defined on the group. Our approach generalizes our previous work (Rezende et al., 2022) and reduces to it when considering the Lie group of translations in Euclidean space. Since the proofs in Rezende et al. (2022) rely on key properties such as the orthogonality between the convergent and traversal components, we extend these results by leveraging Lie group properties. These properties also allow the control input to be non-redundant, meaning it matches the dimension of the Lie group, rather than the potentially larger dimension of the space in which the group is embedded. This can lead to more practical control inputs in certain scenarios. A particularly notable application of our strategy is in controlling systems on SE(3) -- in this case, the non-redundant input corresponds to the object's mechanical twist -- making it well-suited for controlling objects that can move and rotate freely, such as omnidirectional drones. In this case, we provide an efficient algorithm to compute the vector field. We experimentally validate the proposed method using a robotic manipulator to demonstrate its effectiveness.
Hierarchical Trajectory Planning of Floating-Base Multi-Link Robot for Maneuvering in Confined Environments
Floating-base multi-link robots can change their shape during flight, making them well-suited for applications in confined environments such as autonomous inspection and search and rescue. However, trajectory planning for such systems remains an open challenge because the problem lies in a high-dimensional, constraint-rich space where collision avoidance must be addressed together with kinematic limits and dynamic feasibility. This work introduces a hierarchical trajectory planning framework that integrates global guidance with configuration-aware local optimization. First, we exploit the dual nature of these robots - the root link as a rigid body for guidance and the articulated joints for flexibility - to generate global anchor states that decompose the planning problem into tractable segments. Second, we design a local trajectory planner that optimizes each segment in parallel with differentiable objectives and constraints, systematically enforcing kinematic feasibility and maintaining dynamic feasibility by avoiding control singularities. Third, we implement a complete system that directly processes point-cloud data, eliminating the need for handcrafted obstacle models. Extensive simulations and real-world experiments confirm that this framework enables an articulated aerial robot to exploit its morphology for maneuvering that rigid robots cannot achieve. To the best of our knowledge, this is the first planning framework for floating-base multi-link robots that has been demonstrated on a real robot to generate continuous, collision-free, and dynamically feasible trajectories directly from raw point-cloud inputs, without relying on handcrafted obstacle models.
comment: Accepted to IEEE T-ASE; DOI pending
Differentially Private Data-Driven Markov Chain Modeling
Markov chains model a wide range of user behaviors. However, generating accurate Markov chain models requires substantial user data, and sharing these models without privacy protections may reveal sensitive information about the underlying user data. We introduce a method for protecting user data used to formulate a Markov chain model. First, we develop a method for privatizing database queries whose outputs are elements of the unit simplex, and we prove that this method is differentially private. We quantify its accuracy by bounding the expected KL divergence between private and non-private queries. We extend this method to privatize stochastic matrices whose rows are each a simplex-valued query of a database, which includes data-driven Markov chain models. To assess their accuracy, we analytically bound the change in the stationary distribution and the change in the convergence rate between a non-private Markov chain model and its private form. Simulations show that under a typical privacy implementation, our method yields less than 2% error in the stationary distribution, indicating that our approach to private modeling faithfully captures the behavior of the systems we study.
comment: 4 figures, 22 pages
AdapTBF: Decentralized Bandwidth Control via Adaptive Token Borrowing for HPC Storage
Modern high-performance computing (HPC) applications run on compute resources but share global storage systems. This design can cause problems when applications consume a disproportionate amount of storage bandwidth relative to their allocated compute resources. For example, an application running on a single compute node can issue many small, random writes and consume excessive I/O bandwidth from a storage server. This can hinder larger jobs that write to the same storage server and are allocated many compute nodes, resulting in significant resource waste. A straightforward solution is to limit each application's I/O bandwidth on storage servers in proportion to its allocated compute resources. This approach has been implemented in parallel file systems using Token Bucket Filter (TBF). However, strict proportional limits often reduce overall I/O efficiency because HPC applications generate short, bursty I/O. Limiting bandwidth can waste server capacity when applications are idle or prevent applications from temporarily using higher bandwidth during bursty phases. We argue that I/O control should maximize per-application performance and overall storage efficiency while ensuring fairness (e.g., preventing small jobs from blocking large-scale ones). We propose AdapTBF, which builds on TBF in modern parallel file systems (e.g., Lustre) and introduces a decentralized bandwidth control approach using adaptive borrowing and lending. We detail the algorithm, implement AdapTBF in Lustre, and evaluate it using synthetic workloads modeled after real-world scenarios. Results show that AdapTBF manages I/O bandwidth effectively while maintaining high storage utilization, even under extreme conditions.
A Mission Engineering Framework for Uncrewed Aerial Vehicle Design in GNSS-Denied Environments for Intelligence, Surveillance, and Reconnaissance Mission Sets
Small, low-size, weight, power, and cost (SWaP-C) uncrewed aerial vehicles (UAVs) are increasingly used for intelligence, surveillance, and reconnaissance (ISR) missions due to their affordability, attritability, and suitability for distributed operations. However, their design poses challenges including limited endurance, constrained payload capacity, and reliance on simple sensing modalities such as fixed-field-of-view, bearing-only cameras. Traditional platform-centric methods cannot capture the coupled performance, cost, and coordination trade-offs that emerge at the system-of-systems level. This paper presents a mission engineering framework for early-phase design of low-SWaP-C UAV ISR architectures. The framework integrates design of experiments, multi-objective optimization, and high-fidelity simulation into a closed-loop process linking design variables to estimator-informed performance and mission cost. Candidate architectures are explored via Latin hypercube sampling and refined using a genetic algorithm, with performance evaluated through Monte Carlo trials of a federated Kalman filter benchmarked against the posterior Cramer-Rao lower bound. Validation follows the Validation Square methodology, combining theoretical, empirical, and structural assessments. A case study on man-overboard localization in a GNSS-denied maritime environment shows that localization accuracy saturates at sub-meter levels, while higher-cost configurations primarily add redundancy and resilience. The framework thus quantifies mission trade-offs between performance, affordability, and robustness, providing a scalable decision-support tool for contested, resource-constrained ISR missions.
comment: 12 pages, 8 figures, submitted to IEEE Systems for publication
Adaptive RIS Control for Mobile mmWave NLoS Communication Using Single-Bit Feedback
Reconfigurable intelligent surfaces (RISs) are emerging as key enablers of reliable industrial automation in the millimeter-wave (mmWave) band, particularly in environments with frequent line-of-sight (LoS) blockage. While prior works have largely focused on theoretical aspects, real-time validation under user mobility remains underexplored. In this work, we propose and experimentally evaluate an adaptive beamforming algorithm that enables RIS reconfiguration via a low-rate feedback link from the mobile user equipment (UE) to the RIS controller, operating without requiring UE position knowledge. The algorithm maintains the received signal power above a predefined threshold using only a single-bit comparison of received power levels. To analyze the algorithms performance, we establish a simulation-based Monte Carlo (MC) optimization benchmark that assumes full UE position knowledge, accounts for practical hardware constraints, and serves as an upper bound for performance evaluation. Using a hexagonal RIS with 127 elements and 1-bit phase quantization at 23.8 GHz, we validate the proposed approach in a semi-anechoic environment over a 60 cm by 92 cm area. The results demonstrate that the single-bit feedback-driven algorithm closes much of the performance gap to the MC upper bound while achieving up to 24 dB gain in received power compared to an inactive RIS baseline. These findings highlight the practical potential of feedback-based adaptive RIS control for robust mmWave non-line-of-sight (NLoS) communication with mobile users.
comment: Accepted to IEEE WCNC 2026 Workshops, Kuala Lumpur, Malaysia, April 2026
Learning to Pursue AC Optimal Power Flow Solutions with Feasibility Guarantees
This paper focuses on an AC optimal power flow (OPF) problem for distribution feeders equipped with controllable distributed energy resources (DERs). We consider a solution method that is based on a continuous approximation of the projected gradient flow - referred to as the safe gradient flow - that incorporates voltage and current information obtained either through real-time measurements or power flow computations. These two setups enable both online and offline implementations. The safe gradient flow involves the solution of convex quadratic programs (QPs). To enhance computational efficiency, we propose a novel framework that employs a neural network approximation of the optimal solution map of the QP. The resulting method has two key features: (a) it ensures that the DERs' setpoints are practically feasible, even for an online implementation or when an offline algorithm has an early termination; (b) it ensures convergence to a neighborhood of a strict local optimizer of the AC OPF. The proposed method is tested on a 93-node distribution system with realistic loads and renewable generation. The test shows that our method successfully regulates voltages within limits during periods with high renewable generation.
comment: Revised version with improved theoretical analysis and additional numerical results
High-Altitude Platforms in the Low-Altitude Economy: Bridging Communication, Computing, and Regulation
The Low-Altitude Economy (LAE) is rapidly emerging as a new technological and industrial frontier, with unmanned aerial vehicles (UAVs), electric vertical takeoff and landing (eVTOL) aircraft, and aerial swarms increasingly deployed in logistics, infrastructure inspection, security, and emergency response. However, the large-scale development of the LAE demands a reliable aerial foundation that ensures not only real-time connectivity and computational support, but also navigation integrity and safe airspace management for safety-critical operations. High-Altitude Platforms (HAPs), positioned at around 20 km, provide a unique balance between wide-area coverage and low-latency responsiveness. Compared with low earth orbit (LEO) satellites, HAPs are closer to end users and thus capable of delivering millisecond-level connectivity, fine-grained regulatory oversight, and powerful onboard computing and caching resources. Beyond connectivity and computation, HAPs-assisted sensing and regulation further enable navigation integrity and airspace trust, which are essential for safety-critical UAV and eVTOL operations in the LAE. This article proposes a five-stage evolutionary roadmap for HAPs in the LAE: from serving as aerial infrastructure bases, to becoming super back-ends for UAV, to acting as frontline support for ground users, further enabling swarm-scale UAV coordination, and ultimately advancing toward edge-air-cloud closed-loop autonomy. In parallel, HAPs complement LEO satellites and cloud infrastructures to form a global-regional-local three-tier architecture. Looking forward, HAPs are expected to evolve from simple platforms into intelligent hubs, emerging as pivotal nodes for air traffic management, intelligent logistics, and emergency response. By doing so, they will accelerate the transition of the LAE toward large-scale deployment, autonomy, and sustainable growth.
MPC of Uncertain Nonlinear Systems with Meta-Learning for Fast Adaptation of Neural Predictive Models
In this paper, we consider the problem of reference tracking in uncertain nonlinear systems. A neural State-Space Model (NSSM) is used to approximate the nonlinear system, where a deep encoder network learns the nonlinearity from data, and a state-space component captures the temporal relationship. This transforms the nonlinear system into a linear system in a latent space, enabling the application of model predictive control (MPC) to determine effective control actions. Our objective is to design the optimal controller using limited data from the \textit{target system} (the system of interest). To this end, we employ an implicit model-agnostic meta-learning (iMAML) framework that leverages information from \textit{source systems} (systems that share similarities with the target system) to expedite training in the target system and enhance its control performance. The framework consists of two phases: the (offine) meta-training phase learns a aggregated NSSM using data from source systems, and the (online) meta-inference phase quickly adapts this aggregated model to the target system using only a few data points and few online training iterations, based on local loss function gradients. The iMAML algorithm exploits the implicit function theorem to exactly compute the gradient during training, without relying on the entire optimization path. By focusing solely on the optimal solution, rather than the path, we can meta-train with less storage complexity and fewer approximations than other contemporary meta-learning algorithms. We demonstrate through numerical examples that our proposed method can yield accurate predictive models by adaptation, resulting in a downstream MPC that outperforms several baselines.
Gauss-Newton accelerated MPPI Control
Model Predictive Path Integral (MPPI) control is a sampling-based optimization method that has recently attracted attention, particularly in the robotics and reinforcement learning communities. MPPI has been widely applied as a GPU-accelerated random search method to deterministic direct single-shooting optimal control problems arising in model predictive control (MPC) formulations. MPPI offers several key advantages, including flexibility, robustness, ease of implementation, and inherent parallelizability. However, its performance can deteriorate in high-dimensional settings since the optimal control problem is solved via Monte Carlo sampling. To address this limitation, this paper proposes an enhanced MPPI method that incorporates a Jacobian reconstruction technique and the second-order Generalized Gauss-Newton method. This novel approach is called \textit{Gauss-Newton accelerated MPPI}. The numerical results show that the Gauss-Newton accelerated MPPI approach substantially improves MPPI scalability and computational efficiency while preserving the key benefits of the classical MPPI framework, making it a promising approach even for high-dimensional problems.
comment: 6 pages, 3 figures, submitted to the IFAC World Congress 2026, parts of this preprint are directly taken from Chapter 3 of the main author's PhD thesis with title "Optimal Control for Efficient Vessel Operation: From Theory to Real-World Applications"
Toward a Decision Support System for Energy-Efficient Ferry Operation on Lake Constance based on Optimal Control
The maritime sector is undergoing a disruptive technological change driven by three main factors: autonomy, decarbonization, and digital transformation. Addressing these factors necessitates a reassessment of inland vessel operations. This paper presents the design and development of a decision support system for ferry operations based on a shrinking-horizon optimal control framework. The problem formulation incorporates a mathematical model of the ferry's dynamics and environmental disturbances, specifically water currents and wind, which can significantly influence the dynamics. Real-world data and illustrative scenarios demonstrate the potential of the proposed system to effectively support ferry crews by providing real-time guidance. This enables enhanced operational efficiency while maintaining predefined maneuver durations. The findings suggest that optimal control applications hold substantial promise for advancing future ferry operations on inland waters. A video of the real-world ferry MS Insel Mainau operating on Lake Constance is available at: https://youtu.be/i1MjCdbEQyE
comment: 6 pages, 8 figures, parts of this preprint are directly taken from Chapter 6 of the main author's PhD thesis with title "Optimal Control for Efficient Vessel Operation: From Theory to Real-World Applications"
DCoPilot: Generative AI-Empowered Policy Adaptation for Dynamic Data Center Operations SC
Modern data centers (DCs) hosting artificial intelligence (AI)-dedicated devices operate at high power densities with rapidly varying workloads, making minute-level adaptation essential for safe and energy-efficient operation. However, manually designing piecewise deep reinforcement learning (DRL) agents cannot keep pace with frequent dynamics shifts and service-level agreement (SLA) changes of an evolving DC. This specification-to-policy lag causes a lack of timely, effective control policies, which may lead to service outages. To bridge the gap, we present DCoPilot, a hybrid framework for generative control policies in dynamic DC operation. DCoPilot synergizes two distinct generative paradigms, i.e., a large language model (LLM) that performs symbolic generation of structured reward forms, and a hypernetwork that conducts parametric generation of policy weights. DCoPilot operates through three coordinated phases: (i) simulation scale-up, which stress-tests reward candidates across diverse simulation-ready (SimReady) scenes; (ii) meta policy distillation, where a hypernetwork is trained to output policy weights conditioned on SLA and scene embeddings; and (iii) online adaptation, enabling zero-shot policy generation in response to updated specifications. Evaluated across five control task families spanning diverse DC components, DCoPilot achieves near-zero constraint violations and outperforms all baselines across specification variations. Ablation studies validate the effectiveness of LLM-based unified reward generation in enabling stable hypernetwork convergence.
comment: Accepted as a full paper at HSCC/ICCPS 2026
Development of a Scaled Setup for Experimental Study of the Effect of Lateral Dynamics on Energy Consumption in Electric Vehicles: An Extension
Most of the existing state-of-the-art approaches for energy consumption analysis do not account for the effect of lateral dynamics on energy consumption in electric vehicles (EVs) during vehicle maneuvers. This paper aims to validate this effect through an experimental study. We develop a scaled model using a radio-controlled (RC) car, modified to achieve dynamic similitude with on-road vehicles, to conduct scaled experiments. The experimental results confirm the impact of lateral dynamics on both energy demand and driving range in electric vehicles, aligning with our previous findings [1], and emphasize the need to incorporate these factors into energy consumption models. This is an extended version of a paper accepted at IEEE ITEC 2025. It includes additional results and analysis.
Dual-Regularized Riccati Recursions for Interior-Point Optimal Control
We derive closed-form extensions of Riccati's recursions (both sequential and parallel) for solving dual-regularized LQR problems. We show how these methods can be used to solve general constrained, non-convex, discrete-time optimal control problems via a regularized interior point method, while guaranteeing that each primal step is a descent direction of an Augmented Barrier-Lagrangian merit function. We provide MIT-licensed implementations of our methods in C++ and JAX.
Integrating Conductor Health into Dynamic Line Rating and Unit Commitment under Uncertainty
Dynamic line rating (DLR) enables greater utilization of existing transmission lines by leveraging real-time weather data. However, the elevated temperature operation (ETO) of conductors under DLR is often overlooked, despite its long-term impact on conductor health. This paper addresses this issue by 1) quantifying risk-based depreciation costs associated with ETO and 2) proposing a Conductor Health-Aware Unit Commitment (CHA-UC) that internalizes these costs in operational decisions. CHA-UC incorporates a robust linear approximation of conductor temperature and integration of expected depreciation costs due to hourly ETO into the objective function. Case studies on the Texas 123-bus backbone test system using NOAA weather data demonstrate that the proposed CHA-UC model reduces the total cost by 0.74\% and renewable curtailment by 85\% compared to static line rating (SLR) and outperforms quantile regression forest-based methods, while conventional DLR operation without risk consideration resulted in higher costs due to excessive ETO. Further analysis of the commitment decisions and the line temperature statistics confirms that the CHA-UC achieves safer line flows by shifting generator commitments. Finally, we examine the emergent correlation behaviors arising between wind generation and DLR forecast errors, and show that CHA-UC adaptively manages this effect by relaxing flows for risk-hedging conditions while tightening flows for risk-amplifying ones.
Multi-agent deep reinforcement learning with centralized training and decentralized execution for transportation infrastructure management
Life-cycle management of large-scale transportation systems requires determining a sequence of inspection and maintenance decisions to minimize long-term risks and costs while dealing with multiple uncertainties and constraints that lie in high-dimensional spaces. Traditional approaches have been widely applied but often suffer from limitations related to optimality, scalability, and the ability to properly handle uncertainty. Moreover, many existing methods rely on unconstrained formulations that overlook critical operational constraints. We address these issues in this work by casting the optimization problem within the framework of constrained Partially Observable Markov Decision Processes (POMDPs), which provide a robust mathematical foundation for stochastic sequential decision-making under observation uncertainties, in the presence of risk and resource limitations. To tackle the high dimensionality of state and action spaces, we propose DDMAC-CTDE, a Deep Decentralized Multi-Agent Actor-Critic (DDMAC) reinforcement learning architecture with Centralized Training and Decentralized Execution (CTDE). To demonstrate the utility of the proposed framework, we also develop a new comprehensive benchmark environment representing an existing transportation network in Virginia, U.S., with heterogeneous pavement and bridge assets undergoing nonstationary degradation. This environment incorporates multiple practical constraints related to budget limits, performance guidelines, traffic delays, and risk considerations. On this benchmark, DDMAC-CTDE consistently outperforms standard transportation management baselines, producing better policies. Together, the proposed framework and benchmark provide (i) a scalable, constraint-aware methodology, and (ii) a realistic, rigorous testbed for comprehensive evaluation of Deep Reinforcement Learning (DRL) for transportation infrastructure management.
Stealthy Sensor Attacks Against Direct Data-Driven Controllers
This paper investigates the vulnerability of discrete-time linear time-invariant systems to stealthy sensor attacks during the learning phase. In particular, we demonstrate that a {data-driven} adversary, without access to the system model, can inject attacks that mislead the operator into learning an {unstable} state-feedback controller. We further analyze attacks that degrade the performance of data-driven ${H}_2$ controllers, while ensuring that the operator can always compute a feasible controller. Potential mitigation strategies are also discussed. Numerical examples illustrate the effectiveness of the proposed attacks and underscore the importance of accounting for adversarial manipulations in data-driven controller design.
comment: Conference submission